aboutsummaryrefslogtreecommitdiffstats
path: root/ipc/msg.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-04-11ipc/msg: introduce msgctl(MSG_STAT_ANY)Davidlohr Bueso1-5/+12
There is a permission discrepancy when consulting msq ipc object metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the msq metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new MSG_STAT_ANY command such that the msq ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-03Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-26/+36
Pull namespace updates from Eric Biederman: "There was a lot of work this cycle fixing bugs that were discovered after the merge window and getting everything ready where we can reasonably support fully unprivileged fuse. The bug fixes you already have and much of the unprivileged fuse work is coming in via other trees. Still left for fully unprivileged fuse is figuring out how to cleanly handle .set_acl and .get_acl in the legacy case, and properly handling of evm xattrs on unprivileged mounts. Included in the tree is a cleanup from Alexely that replaced a linked list with a statically allocated fix sized array for the pid caches, which simplifies and speeds things up. Then there is are some cleanups and fixes for the ipc namespace. The motivation was that in reviewing other code it was discovered that access ipc objects from different pid namespaces recorded pids in such a way that when asked the wrong pids were returned. In the worst case there has been a measured 30% performance impact for sysvipc semaphores. Other test cases showed no measurable performance impact. Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc performance both gave the nod that this is good enough. Casey Schaufler and James Morris have given their approval to the LSM side of the changes. I simplified the types and the code dealing with sysvipc to pass just kern_ipc_perm for all three types of ipc. Which reduced the header dependencies throughout the kernel and simplified the lsm code. Which let me work on the pid fixes without having to worry about trivial changes causing complete kernel recompiles" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc/shm: Fix pid freeing. ipc/shm: fix up for struct file no longer being available in shm.h ipc/smack: Tidy up from the change in type of the ipc security hooks ipc: Directly call the security hook in ipc_ops.associate ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces. ipc/util: Helpers for making the sysvipc operations pid namespace aware ipc: Move IPCMNI from include/ipc.h into ipc/util.h msg: Move struct msg_queue into ipc/msg.c shm: Move struct shmid_kernel into ipc/shm.c sem: Move struct sem and struct sem_array into ipc/sem.c msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks pidns: simpler allocation of pid_* caches
2018-04-02ipc: add msgsnd syscall/compat_syscall wrappersDominik Brodowski1-4/+16
Provide ksys_msgsnd() and compat_ksys_msgsnd() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_msgsnd() and compat_sys_msgsnd(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add msgrcv syscall/compat_syscall wrappersDominik Brodowski1-3/+16
Provide ksys_msgrcv() and compat_ksys_msgrcv() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_msgrcv() and compat_sys_msgrcv(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add msgctl syscall/compat_syscall wrappersDominik Brodowski1-2/+12
Provide ksys_msgctl() and compat_ksys_msgctl() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_msgctl() and compat_sys_msgctl(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add msgget syscall wrapperDominik Brodowski1-1/+6
Provide ksys_msgget() wrapper to avoid in-kernel calls to this syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_msgget(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-03-27ipc: Directly call the security hook in ipc_ops.associateEric W. Biederman1-9/+1
After the last round of cleanups the shm, sem, and msg associate operations just became trivial wrappers around the appropriate security method. Simplify things further by just calling the security method directly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-27ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespacesEric W. Biederman1-10/+13
Today msg_lspid and msg_lrpid are remembered in the pid namespace of the creator and the processes that last send or received a sysvipc message. If you have processes in multiple pid namespaces that is just wrong. The process ids reported will not make the least bit of sense. This fix is slightly more susceptible to a performance problem than the related fix for System V shared memory. By definition the pids are updated by msgsnd and msgrcv, the fast path of System V message queues. The only concern over the previous implementation is the incrementing and decrementing of the pid reference count. As that is the only difference and multiple updates by of the task_tgid by threads in the same process have been shown in af_unix sockets to create a cache line ping-pong between cpus of the same processor. In this case I don't expect cache lines holding pid reference counts to ping pong between cpus. As senders and receivers update different pids there is a natural separation there. Further if multiple threads of the same process either send or receive messages the pid will be updated to the same value and ipc_update_pid will avoid the reference count update. Which means in the common case I expect msg_lspid and msg_lrpid to remain constant, and reference counts not to be updated when messages are sent. In rare cases it may be possible to trigger the issue which was observed for af_unix sockets, but it will require multiple processes with multiple threads to be either sending or receiving messages. It just does not feel likely that anyone would do that in practice. This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and msg_lrpid in the pid namespace of the process calling stat. This change also updates cat /proc/sysvipc/msg to return print msg_lspid and msg_lrpid in the pid namespace of the process that opened the proc file. Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user") Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24msg: Move struct msg_queue into ipc/msg.cEric W. Biederman1-0/+17
All of the users are now in ipc/msg.c so make the definition local to that file to make code maintenance easier. AKA to prevent rebuilding the entire kernel when struct msg_queue changes. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooksEric W. Biederman1-10/+8
All of the implementations of security hooks that take msg_queue only access q_perm the struct kern_ipc_perm member. This means the dependencies of the msg_queue security hooks can be simplified by passing the kern_ipc_perm member of msg_queue. Making this change will allow struct msg_queue to become private to ipc/msg.c. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-02-06ipc: fix ipc data structures inconsistencyPhilippe Mikoyan1-6/+14
As described in the title, this patch fixes <ipc>id_ds inconsistency when <ipc>ctl_stat executes concurrently with some ds-changing function, e.g. shmat, msgsnd or whatever. For instance, if shmctl(IPC_STAT) is running concurrently with shmat, following data structure can be returned: {... shm_lpid = 0, shm_nattch = 1, ...} Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems Signed-off-by: Philippe Mikoyan <philippe.mikoyan@skat.systems> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+2
Pull misc vfs updates from Al Viro: "Assorted stuff, really no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: grab the lock instead of blocking in __fd_install during resizing vfs: stop clearing close on exec when closing a fd include/linux/fs.h: fix comment about struct address_space fs: make fiemap work from compat_ioctl coda: fix 'kernel memory exposure attempt' in fsync pstore: remove unneeded unlikely() vfs: remove unneeded unlikely() stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case make vfs_ustat() static do_handle_open() should be static elf_fdpic: fix unused variable warning fold destroy_super() into __put_super() new helper: destroy_unused_super() fix address space warnings in ipc/ acct.h: get rid of detritus
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-11fix address space warnings in ipc/Linus Torvalds1-2/+2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-14Merge branch 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-111/+255
Pull ipc compat cleanup and 64-bit time_t from Al Viro: "IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa Dinamani" * 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: utimes: Make utimes y2038 safe ipc: shm: Make shmid_kernel timestamps y2038 safe ipc: sem: Make sem_array timestamps y2038 safe ipc: msg: Make msg_queue timestamps y2038 safe ipc: mqueue: Replace timespec with timespec64 ipc: Make sys_semtimedop() y2038 safe get rid of SYSVIPC_COMPAT on ia64 semtimedop(): move compat to native shmat(2): move compat to native msgrcv(2), msgsnd(2): move compat to native ipc(2): move compat to native ipc: make use of compat ipc_perm helpers semctl(): move compat to native semctl(): separate all layout-dependent copyin/copyout msgctl(): move compat to native msgctl(): split the actual work from copyin/copyout ipc: move compat shmctl to native shmctl: split the work from copyin/copyout
2017-09-08ipc: optimize semget/shmget/msgget for lots of keysGuillaume Knispel1-4/+6
ipc_findkey() used to scan all objects to look for the wanted key. This is slow when using a high number of keys. This change adds an rhashtable of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n). This change gives a 865% improvement of benchmark reaim.jobs_per_min on a 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1] Other (more micro) benchmark results, by the author: On an i5 laptop, the following loop executed right after a reboot took, without and with this change: for (int i = 0, k=0x424242; i < KEYS; ++i) semget(k++, 1, IPC_CREAT | 0600); total total max single max single KEYS without with call without call with 1 3.5 4.9 µs 3.5 4.9 10 7.6 8.6 µs 3.7 4.7 32 16.2 15.9 µs 4.3 5.3 100 72.9 41.8 µs 3.7 4.7 1000 5,630.0 502.0 µs * * 10000 1,340,000.0 7,240.0 µs * * 31900 17,600,000.0 22,200.0 µs * * *: unreliable measure: high variance The duration for a lookup-only usage was obtained by the same loop once the keys are present: total total max single max single KEYS without with call without call with 1 2.1 2.5 µs 2.1 2.5 10 4.5 4.8 µs 2.2 2.3 32 13.0 10.8 µs 2.3 2.8 100 82.9 25.1 µs * 2.3 1000 5,780.0 217.0 µs * * 10000 1,470,000.0 2,520.0 µs * * 31900 17,400,000.0 7,810.0 µs * * Finally, executing each semget() in a new process gave, when still summing only the durations of these syscalls: creation: total total KEYS without with 1 3.7 5.0 µs 10 32.9 36.7 µs 32 125.0 109.0 µs 100 523.0 353.0 µs 1000 20,300.0 3,280.0 µs 10000 2,470,000.0 46,700.0 µs 31900 27,800,000.0 219,000.0 µs lookup-only: total total KEYS without with 1 2.5 2.7 µs 10 25.4 24.4 µs 32 106.0 72.6 µs 100 591.0 352.0 µs 1000 22,400.0 2,250.0 µs 10000 2,510,000.0 25,700.0 µs 31900 28,200,000.0 115,000.0 µs [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com> Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Serge Hallyn <serge@hallyn.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com> Cc: Marc Pardo <marc.pardo@supersonicimagine.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-03ipc: msg: Make msg_queue timestamps y2038 safeDeepa Dinamani1-3/+3
time_t is not y2038 safe. Replace all uses of time_t by y2038 safe time64_t. Similarly, replace the calls to get_seconds() with y2038 safe ktime_get_real_seconds(). Note that this preserves fast access on 64 bit systems, but 32 bit systems need sequence counters. The syscall interfaces themselves are not changed as part of the patch. They will be part of a different series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-02ipc: add missing container_of()s for randstructKees Cook1-1/+2
When building with the randstruct gcc plugin, the layout of the IPC structs will be randomized, which requires any sub-structure accesses to use container_of(). The proc display handlers were missing the needed container_of()s since the iterator is passing in the top-level struct kern_ipc_perm. This would lead to crashes when running the "lsipc" program after the system had IPC registered (e.g. after starting up Gnome): general protection fault: 0000 [#1] PREEMPT SMP ... RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0 ... Call Trace: sysvipc_shm_proc_show+0x5e/0x150 sysvipc_proc_show+0x1a/0x30 seq_read+0x2e9/0x3f0 ... Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-15msgrcv(2), msgsnd(2): move compat to nativeAl Viro1-2/+43
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15ipc: make use of compat ipc_perm helpersAl Viro1-24/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15msgctl(): move compat to nativeAl Viro1-0/+133
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15msgctl(): split the actual work from copyin/copyoutAl Viro1-106/+96
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-12ipc/msg: remove special msg_alloc/freeKees Cook1-20/+4
There is nothing special about the msg_alloc/free routines any more, so remove them to make code more readable. [manfred@colorfullife.com: Rediff to keep rcu protection for security_msg_queue_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-19-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: move atomic_set() to where it is neededKees Cook1-2/+0
Only after ipc_addid() has succeeded will refcounting be used, so move initialization into ipc_addid() and remove from open-coded *_alloc() routines. Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg.c: avoid ipc_rcu_putref for failed ipc_addid()Manfred Spraul1-5/+5
Loosely based on a patch from Kees Cook <keescook@chromium.org>: - id and retval can be merged - if ipc_addid() fails, then use call_rcu() directly. The difference is that call_rcu is used for failed ipc_addid() calls, to continue to guaranteed an rcu delay for security_msg_queue_free(). Link: http://lkml.kernel.org/r/20170525185107.12869-16-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg: avoid ipc_rcu_alloc()Kees Cook1-4/+14
Instead of using ipc_rcu_alloc() which only performs the refcount bump, open code it. This also allows for msg_queue structure layout to be randomized in the future. Link: http://lkml.kernel.org/r/20170525185107.12869-12-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg: do not use ipc_rcu_free()Kees Cook1-2/+7
Avoid using ipc_rcu_free, since it just re-finds the original structure pointer. For the pre-list-init failure path, there is no RCU needed, since it was just allocated. It can be directly freed. Link: http://lkml.kernel.org/r/20170525185107.12869-8-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: merge ipc_rcu and kern_ipc_permManfred Spraul1-8/+11
ipc has two management structures that exist for every id: - struct kern_ipc_perm, it contains e.g. the permissions. - struct ipc_rcu, it contains the rcu head for rcu handling and the refcount. The patch merges both structures. As a bonus, we may save one cacheline, because both structures are cacheline aligned. In addition, it reduces the number of casts, instead most codepaths can use container_of. To simplify code, the ipc_rcu_alloc initializes the allocation to 0. [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()] Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-02sched/headers: Move the wake-queue types and interfaces from sched.h into <linux/sched/wake_q.h>Ingo Molnar1-1/+0
Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to <linux/sched/wake_q.h>Ingo Molnar1-0/+1
We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/wake_q.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-14ipc: msg, make msgrcv work with LONG_MINJiri Slaby1-1/+4
When LONG_MIN is passed to msgrcv, one would expect to recieve any message. But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is undefined. In particular, with my gcc -LONG_MIN produces -LONG_MIN again. So handle this case properly by assigning LONG_MAX to *msgtyp if LONG_MIN was specified as msgtyp to msgrcv. This code: long msg[] = { 100, 200 }; int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644); msgsnd(m, &msg, sizeof(msg), 0); msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0); produces currently nothing: msgget(IPC_PRIVATE, IPC_CREAT|0644) = 65538 msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0 msgrcv(65538, ... Except a UBSAN warning: UBSAN: Undefined behaviour in ipc/msg.c:745:13 negation of -9223372036854775808 cannot be represented in type 'long int': With the patch, I see what I expect: msgget(IPC_PRIVATE, IPC_CREAT|0644) = 0 msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0 msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16 Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.cz Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-21sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_QWaiman Long1-4/+4
Currently the wake_q data structure is defined by the WAKE_Q() macro. This macro, however, looks like a function doing something as "wake" is a verb. Even checkpatch.pl was confused as it reported warnings like WARNING: Missing a blank line after declarations #548: FILE: kernel/futex.c:3665: + int ret; + WAKE_Q(wake_q); This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies what the macro is doing and eliminates the checkpatch.pl warnings. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com [ Resolved conflict and added missing rename. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-11ipc/msg: avoid waking sender upon full queueDavidlohr Bueso1-10/+43
Blocked tasks queued in q_senders waiting for their message to fit in the queue are blindly awoken every time we think there's a remote chance this might happen. This could cause numerous (and expensive -- thundering herd-ish) bogus wakeups if the queue is still really full. Adding to the scheduling cost/overhead, there's also the fact that we need to take the ipc object lock and requeue ourselves in the q_senders list. By keeping track of the blocked sender's message size, we can know previously if the wakeup ought to occur or not. Otherwise, to maintain the current wakeup order we just move it to the tail. This is exactly what occurs right now if the sender needs to go back to sleep. The case of EIDRM is left completely untouched, as we need to wakeup all the tasks, and shouldn't be playing games in the first place. This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context switches (avg out of ten runs). Although these tests are really about functionality (as opposed to performance), is does show the direct benefits of the optimization. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: make ss_wakeup() kill arg booleanDavidlohr Bueso1-4/+4
... 'tis annoying. Link: http://lkml.kernel.org/r/1469748819-19484-4-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: batch queue sender wakeupsDavidlohr Bueso1-10/+20
Currently the use of wake_qs in sysv msg queues are only for the receiver tasks that are blocked on the queue. But blocked sender tasks (due to queue size constraints) still are awoken with the ipc object lock held, which can be a problem particularly for small sized queues and far from gracious for -rt (just like it was for the receiver side). The paths that actually wakeup a sender are obviously related to when we are either getting rid of the queue or after (some) space is freed-up after a receiver takes the msg (msgrcv). Furthermore, with the exception of msgrcv, we can always piggy-back on expunge_all that has its own tasks lined-up for waking. Finally, upon unlinking the message, it should be no problem delaying the wakeups a bit until after we've released the lock. Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: implement lockless pipelined wakeupsSebastian Andrzej Siewior1-93/+40
This patch moves the wakeup_process() invocation so it is not done under the ipc global lock by making use of a lockless wake_q. With this change, the waiter is woken up once the message has been assigned and it does not need to loop on SMP if the message points to NULL. In the signal case we still need to check the pointer under the lock to verify the state. This change should also avoid the introduction of preempt_disable() in -RT which avoids a busy-loop which pools for the NULL -> !NULL change if the waiter has a higher priority compared to the waker. By making use of wake_qs, the logic of sysv msg queues is greatly simplified (and very well suited as we can batch lockless wakeups), particularly around the lockless receive algorithm. This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G": test | before | after | diff -----------------|------------|------------|---------- pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 % pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 % pmsg-shared 2 60 | 22,884,224 | 24,278,200 | + ~6.09 % Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02sysv, ipc: fix security-layer leakingFabian Frederick1-1/+1
Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref() to receive rcu freeing function but used generic ipc_rcu_free() instead of msg_rcu_free() which does security cleaning. Running LTP msgsnd06 with kmemleak gives the following: cat /sys/kernel/debug/kmemleak unreferenced object 0xffff88003c0a11f8 (size 8): comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s) hex dump (first 8 bytes): 1b 00 00 00 01 00 00 00 ........ backtrace: kmemleak_alloc+0x23/0x40 kmem_cache_alloc_trace+0xe1/0x180 selinux_msg_queue_alloc_security+0x3f/0xd0 security_msg_queue_alloc+0x2e/0x40 newque+0x4e/0x150 ipcget+0x159/0x1b0 SyS_msgget+0x39/0x40 entry_SYSCALL_64_fastpath+0x13/0x8f Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to only use ipc_rcu_free in case of security allocation failure in newary() Fixes: 53dad6d3a8e ("ipc: fix race with LSMs") Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-30Initialize msg/shm IPC objects before doing ipc_addid()Linus Torvalds1-7/+7
As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before having initialized the IPC object state. Yes, we initialize the IPC object in a locked state, but with all the lockless RCU lookup work, that IPC object lock no longer means that the state cannot be seen. We already did this for the IPC semaphore code (see commit e8577d1f0329: "ipc/sem.c: fully initialize sem_array before making it visible") but we clearly forgot about msg and shm. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc: rename ipc_obtain_objectDavidlohr Bueso1-1/+1
... to ipc_obtain_object_idr, which is more meaningful and makes the code slightly easier to follow. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc,msg: provide barrier pairings for lockless receiveDavidlohr Bueso1-10/+38
We currently use a full barrier on the sender side to to avoid receiver tasks disappearing on us while still performing on the sender side wakeup. We lack however, the proper CPU-CPU interactions pairing on the receiver side which busy-waits for the message. Similarly, we do not need a full smp_mb, and can relax the semantics for the writer and reader sides of the message. This is safe as we are only ordering loads and stores to r_msg. And in both smp_wmb and smp_rmb, there are no stores after the calls _anyway_. This obviously applies for pipelined_send and expunge_all, for EIRDM when destroying a queue. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15ipc: remove use of seq_printf return valueJoe Perches1-16/+18
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13ipc/msg: increase MSGMNI, remove scalingManfred Spraul1-35/+1
SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: increase MSGMNI to the maximum supported. And: If we ignore the risk of locking too much memory, then an automatic scaling of MSGMNI doesn't make sense. Therefore the logic can be removed. The code preserves auto_msgmni to avoid breaking any user space applications that expect that the value exists. Notes: 1) If an administrator must limit the memory allocations, then he can set MSGMNI as necessary. Or he can disable sysv entirely (as e.g. done by Android). 2) MSGMAX and MSGMNB are intentionally not increased, as these values are used to control latency vs. throughput: If MSGMNB is large, then msgsnd() just returns and more messages can be queued before a task switch to a task that calls msgrcv() is forced. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc,msg: document volatile r_msgDavidlohr Bueso1-3/+7
The need for volatile is not obvious, document it. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc,msg: move some msgq ns code aroundDavidlohr Bueso1-69/+63
Nothing big and no logical changes, just get rid of some redundant function declarations. Move msg_[init/exit]_ns down the end of the file. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc,msg: use current->state helpersDavidlohr Bueso1-2/+2
Call __set_current_state() instead of assigning the new state directly. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullif.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: clear whitespacePaul McQuade1-16/+15
trailing whitespace Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: use Linux headersPaul McQuade1-1/+1
Use #include <linux/uaccess.h> instead of <asm/uaccess.h> Use #include <linux/types.h> instead of <asm/types.h> Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc: constify ipc_opsMathias Krause1-5/+4
There is no need to recreate the very same ipc_ops structure on every kernel entry for msgget/semget/shmget. Just declare it static and be done with it. While at it, constify it as we don't modify the structure at runtime. Found in the PaX patch, written by the PaX Team. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-16ipc: Fix 2 bugs in msgrcv() MSG_COPY implementationMichael Kerrisk1-0/+2
While testing and documenting the msgrcv() MSG_COPY flag that Stanislav Kinsbursky added in commit 4a674f34ba04 ("ipc: introduce message queue copy feature" => kernel 3.8), I discovered a couple of bugs in the implementation. The two bugs concern MSG_COPY interactions with other msgrcv() flags, namely: (A) MSG_COPY + MSG_EXCEPT (B) MSG_COPY + !IPC_NOWAIT The bugs are distinct (and the fix for the first one is obvious), however my fix for both is a single-line patch, which is why I'm combining them in a single mail, rather than writing two mails+patches. ===== (A) MSG_COPY + MSG_EXCEPT ===== With the addition of the MSG_COPY flag, there are now two msgrcv() flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp' argument in unrelated ways. Specifying both in the same call is a logical error that is currently permitted, with the effect that MSG_COPY has priority and MSG_EXCEPT is ignored. The call should give an error if both flags are specified. The patch below implements that behavior. ===== (B) (B) MSG_COPY + !IPC_NOWAIT ===== The test code that was submitted in commit 3a665531a3b7 ("selftests: IPC message queue copy feature test") shows MSG_COPY being used in conjunction with IPC_NOWAIT. In other words, if there is no message at the position 'msgtyp'. return immediately with the error in ENOMSG. What was not (fully) tested is the behavior if MSG_COPY is specified *without* IPC_NOWAIT, and there is an odd behavior. If the queue contains less than 'msgtyp' messages, then the call blocks until the next message is written to the queue. At that point, the msgrcv() call returns a copy of the newly added message, regardless of whether that message is at the ordinal position 'msgtyp'. This is clearly bogus, and problematic for applications that might want to make use of the MSG_COPY flag. I considered the following possible solutions to this problem: (1) Force the call to block until a message *does* appear at the position 'msgtyp'. (2) If the MSG_COPY flag is specified, the kernel should implicitly add IPC_NOWAIT, so that the call fails with ENOMSG for this case. (3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate an error (probably, EINVAL is the right one). I do not know if any application would really want to have the functionality of solution (1), especially since an application can determine in advance the number of messages in the queue using msgctl() IPC_STAT. Obviously, this solution would be the most work to implement. Solution (2) would have the effect of silently fixing any applications that tried to employ broken behavior. However, it would mean that if we later decided to implement solution (1), then user-space could not easily detect what the kernel supports (but, since I'm somewhat doubtful that solution (1) is needed, I'm not sure that this is much of a problem). Solution (3) would have the effect of informing broken applications that they are doing something broken. The downside is that this would cause a ABI breakage for any applications that are currently employing the broken behavior. However: a) Those applications are almost certainly not getting the results they expect. b) Possibly, those applications don't even exist, because MSG_COPY is currently hidden behind CONFIG_CHECKPOINT_RESTORE. The upside of solution (3) is that if we later decided to implement solution (1), user-space could determine what the kernel supports, via the error return. In my view, solution (3) is mildly preferable to solution (2), and solution (1) could still be done later if anyone really cares. The patch below implements solution (3). PS. For anyone out there still listening, it's the usual story: documenting an API (and the thinking about, and the testing of the API, that documentation entails) is the one of the single best ways of finding bugs in the API, as I've learned from a lot of experience. Best to do that documentation before releasing the API. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27ipc,msg: document barriersDavidlohr Bueso1-2/+17
Both expunge_all() and pipeline_send() rely on both a nil msg value and a full barrier to guarantee the correct ordering when waking up a task. While its counterpart at the receiving end is well documented for the lockless recv algorithm, we still need to document these specific smp_mb() calls. [akpm@linux-foundation.org: fix typo, per Mike] [akpm@linux-foundation.org: mroe tpyos] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>