aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-08-24Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds1-1/+1
Pull namespace fixes from Eric Biederman: "This is a set of four fairly obvious bug fixes: - a switch from d_find_alias to d_find_any_alias because the xattr code perversely takes a dentry - two mutex vs copy_to_user fixes from Jann Horn - a fix to use a sanitized size not the size userspace passed in from Christian Brauner" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: getxattr: use correct xattr length sys: don't hold uts_sem while accessing userspace memory userns: move user access out of the mutex cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
2018-08-23Merge branch 'akpm' (patches from Andrew)Linus Torvalds8-17/+67
Merge yet more updates from Andrew Morton: - the rest of MM - various misc fixes and tweaks * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits) mm: Change return type int to vm_fault_t for fault handlers lib/fonts: convert comments to utf-8 s390: ebcdic: convert comments to UTF-8 treewide: convert ISO_8859-1 text comments to utf-8 drivers/gpu/drm/gma500/: change return type to vm_fault_t docs/core-api: mm-api: add section about GFP flags docs/mm: make GFP flags descriptions usable as kernel-doc docs/core-api: split memory management API to a separate file docs/core-api: move *{str,mem}dup* to "String Manipulation" docs/core-api: kill trailing whitespace in kernel-api.rst mm/util: add kernel-doc for kvfree mm/util: make strndup_user description a kernel-doc comment fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds treewide: correct "differenciate" and "instanciate" typos fs/afs: use new return type vm_fault_t drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t mm: soft-offline: close the race against page allocation mm: fix race on soft-offlining free huge pages namei: allow restricted O_CREAT of FIFOs and regular files hfs: prevent crash on exit from failed search ...
2018-08-23mm: Change return type int to vm_fault_t for fault handlersSouptick Joarder1-4/+2
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t") The aim is to change the return type of finish_fault() and handle_mm_fault() to vm_fault_t type. As part of that clean up return type of all other recursively called functions have been changed to vm_fault_t type. The places from where handle_mm_fault() is getting invoked will be change to vm_fault_t type but in a separate patch. vmf_error() is the newly introduce inline function in 4.17-rc6. [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()] Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu buildsArnd Bergmann1-0/+2
Without CONFIG_MMU, we get a build warning: fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function] static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst, The function is only referenced from an #ifdef'ed caller, so this uses the same #ifdef around it. Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de Fixes: 7efe48df8a3d ("vmcore: append device dumps to vmcore as elf notes") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Ganesh Goudar <ganeshgr@chelsio.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23fs/afs: use new return type vm_fault_tSouptick Joarder2-2/+3
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See 1c8f422059ae ("mm: change return type to vm_fault_t") for reference. Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23namei: allow restricted O_CREAT of FIFOs and regular filesSalvatore Mesoraca1-3/+50
Disallows open of FIFOs or regular files not owned by the user in world writable sticky directories, unless the owner is the same as that of the directory or the file is opened without the O_CREAT flag. The purpose is to make data spoofing attacks harder. This protection can be turned on and off separately for FIFOs and regular files via sysctl, just like the symlinks/hardlinks protection. This patch is based on Openwall's "HARDEN_FIFO" feature by Solar Designer. This is a brief list of old vulnerabilities that could have been prevented by this feature, some of them even allow for privilege escalation: CVE-2000-1134 CVE-2007-3852 CVE-2008-0525 CVE-2009-0416 CVE-2011-4834 CVE-2015-1838 CVE-2015-7442 CVE-2016-7489 This list is not meant to be complete. It's difficult to track down all vulnerabilities of this kind because they were often reported without any mention of this particular attack vector. In fact, before hardlinks/symlinks restrictions, fifos/regular files weren't the favorite vehicle to exploit them. [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter] Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future] [keescook@chromium.org: adjust commit subjet] Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Solar Designer <solar@openwall.com> Suggested-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23hfs: prevent crash on exit from failed searchErnesto A. Fernández1-3/+4
hfs_find_exit() expects fd->bnode to be NULL after a search has failed. hfs_brec_insert() may instead set it to an error-valued pointer. Fix this to prevent a crash. Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23hfsplus: prevent crash on exit from failed searchErnesto A. Fernandez1-3/+4
hfs_find_exit() expects fd->bnode to be NULL after a search has failed. hfs_brec_insert() may instead set it to an error-valued pointer. Fix this to prevent a crash. Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com> Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23hfsplus: fix NULL dereference in hfsplus_lookup()Ernesto A. Fernández1-2/+2
An HFS+ filesystem can be mounted read-only without having a metadata directory, which is needed to support hardlinks. But if the catalog data is corrupted, a directory lookup may still find dentries claiming to be hardlinks. hfsplus_lookup() does check that ->hidden_dir is not NULL in such a situation, but mistakenly does so after dereferencing it for the first time. Reorder this check to prevent a crash. This happens when looking up corrupted catalog data (dentry) on a filesystem with no metadata directory (this could only ever happen on a read-only mount). Wen Xu sent the replication steps in detail to the fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297 Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reported-by: Wen Xu <wen.xu@gatech.edu> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds26-161/+814
Pull NFS client updates from Anna Schumaker: "These patches include adding async support for the v4.2 COPY operation. I think Bruce is planning to send the server patches for the next release, but I figured we could get the client side out of the way now since it's been in my tree for a while. This shouldn't cause any problems, since the server will still respond with synchronous copies even if the client requests async. Features: - Add support for asynchronous server-side COPY operations Stable bufixes: - Fix an off-by-one in bl_map_stripe() (v3.17+) - NFSv4 client live hangs after live data migration recovery (v4.9+) - xprtrdma: Fix disconnect regression (v4.18+) - Fix locking in pnfs_generic_recover_commit_reqs (v4.14+) - Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+) Other bugfixes and cleanups: - Optimizations and fixes involving NFS v4.1 / pNFS layout handling - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking - Immediately reschedule writeback when the server replies with an error - Fix excessive attribute revalidation in nfs_execute_ok() - Add error checking to nfs_idmap_prepare_message() - Use new vm_fault_t return type - Return a delegation when reclaiming one that the server has recalled - Referrals should inherit proto setting from parents - Make rpc_auth_create_args a const - Improvements to rpc_iostats tracking - Fix a potential reference leak when there is an error processing a callback - Fix rmdir / mkdir / rename nlink accounting - Fix updating inode change attribute - Fix error handling in nfsn4_sp4_select_mode() - Use an appropriate work queue for direct-write completion - Don't busy wait if NFSv4 session draining is interrupted" * tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits) pNFS: Remove unwanted optimisation of layoutget pNFS/flexfiles: ff_layout_pg_init_read should exit on error pNFS: Treat RECALLCONFLICT like DELAY... pNFS: When updating the stateid in layoutreturn, also update the recall range NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence() NFSv4: Fix locking in pnfs_generic_recover_commit_reqs NFSv4: Fix a typo in nfs4_init_channel_attrs() NFSv4: Don't busy wait if NFSv4 session draining is interrupted NFS recover from destination server reboot for copies NFS add a simple sync nfs4_proc_commit after async COPY NFS handle COPY ERR_OFFLOAD_NO_REQS NFS send OFFLOAD_CANCEL when COPY killed NFS export nfs4_async_handle_error NFS handle COPY reply CB_OFFLOAD call race NFS add support for asynchronous COPY NFS COPY xdr handle async reply NFS OFFLOAD_CANCEL xdr NFS CB_OFFLOAD xdr NFS: Use an appropriate work queue for direct-write completion NFSv4: Fix error handling in nfs4_sp4_select_mode() ...
2018-08-23Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds16-103/+143
Pull nfsd updates from Bruce Fields: "Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from multi-homed servers. The only new feature is a minor bit of protocol (change_attr_type) which the client doesn't even use yet. Other than that, various bugfixes and cleanup" * tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits) sunrpc: Add comment defining gssd upcall API keywords nfsd: Remove callback_cred nfsd: Use correct credential for NFSv4.0 callback with GSS sunrpc: Extract target name into svc_cred sunrpc: Enable the kernel to specify the hostname part of service principals sunrpc: Don't use stack buffer with scatterlist rpc: remove unneeded variable 'ret' in rdma_listen_handler nfsd: use true and false for boolean values nfsd: constify write_op[] fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id NFSD: Handle full-length symlinks NFSD: Refactor the generic write vector fill helper svcrdma: Clean up Read chunk path svcrdma: Avoid releasing a page in svc_xprt_release() nfsd: Mark expected switch fall-through sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data' nfsd: fix leaked file lock with nfs exported overlayfs nfsd: don't advertise a SCSI layout for an unsupported request_queue nfsd: fix corrupted reply to badly ordered compound nfsd: clarify check_op_ordering ...
2018-08-23Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds34-558/+724
Pull UBI/UBIFS updates from Richard Weinberger: - Year 2038 preparations - New UBI feature to skip CRC checks of static volumes - A new Kconfig option to disable xattrs in UBIFS - Lots of fixes in UBIFS, found by our new test framework * tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits) ubifs: Set default assert action to read-only ubifs: Allow setting assert action as mount parameter ubifs: Rework ubifs_assert() ubifs: Pass struct ubifs_info to ubifs_assert() ubifs: Turn two ubifs_assert() into a WARN_ON() ubi: expose the volume CRC check skip flag ubi: provide a way to skip CRC checks ubifs: Use kmalloc_array() ubifs: Check data node size before truncate Revert "UBIFS: Fix potential integer overflow in allocation" ubifs: Add comment on c->commit_sem ubifs: introduce Kconfig symbol for xattr support ubifs: use swap macro in swap_dirty_idx ubifs: tnc: use monotonic znode timestamp ubifs: use timespec64 for inode timestamps ubifs: xattr: Don't operate on deleted inodes ubifs: gc: Fix typo ubifs: Fix memory leak in lprobs self-check ubi: Initialize Fastmap checkmapping correctly ubifs: Fix synced_i_size calculation for xattr inodes ...
2018-08-23getxattr: use correct xattr lengthChristian Brauner1-1/+1
When running in a container with a user namespace, if you call getxattr with name = "system.posix_acl_access" and size % 8 != 4, then getxattr silently skips the user namespace fixup that it normally does resulting in un-fixed-up data being returned. This is caused by posix_acl_fix_xattr_to_user() being passed the total buffer size and not the actual size of the xattr as returned by vfs_getxattr(). This commit passes the actual length of the xattr as returned by vfs_getxattr() down. A reproducer for the issue is: touch acl_posix setfacl -m user:0:rwx acl_posix and the compile: #define _GNU_SOURCE #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <unistd.h> #include <attr/xattr.h> /* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */ int main(int argc, void **argv) { ssize_t ret1, ret2; char buf1[128], buf2[132]; int fret = EXIT_SUCCESS; char *file; if (argc < 2) { fprintf(stderr, "Please specify a file with " "\"system.posix_acl_access\" permissions set\n"); _exit(EXIT_FAILURE); } file = argv[1]; ret1 = getxattr(file, "system.posix_acl_access", buf1, sizeof(buf1)); if (ret1 < 0) { fprintf(stderr, "%s - Failed to retrieve " "\"system.posix_acl_access\" " "from \"%s\"\n", strerror(errno), file); _exit(EXIT_FAILURE); } ret2 = getxattr(file, "system.posix_acl_access", buf2, sizeof(buf2)); if (ret2 < 0) { fprintf(stderr, "%s - Failed to retrieve " "\"system.posix_acl_access\" " "from \"%s\"\n", strerror(errno), file); _exit(EXIT_FAILURE); } if (ret1 != ret2) { fprintf(stderr, "The value of \"system.posix_acl_" "access\" for file \"%s\" changed " "between two successive calls\n", file); _exit(EXIT_FAILURE); } for (ssize_t i = 0; i < ret2; i++) { if (buf1[i] == buf2[i]) continue; fprintf(stderr, "Unexpected different in byte %zd: " "%02x != %02x\n", i, buf1[i], buf2[i]); fret = EXIT_FAILURE; } if (fret == EXIT_SUCCESS) fprintf(stderr, "Test passed\n"); else fprintf(stderr, "Test failed\n"); _exit(fret); } and run: ./tester acl_posix On a non-fixed up kernel this should return something like: root@c1:/# ./t Unexpected different in byte 16: ffffffa0 != 00 Unexpected different in byte 17: ffffff86 != 00 Unexpected different in byte 18: 01 != 00 and on a fixed kernel: root@c1:~# ./t Test passed Cc: stable@vger.kernel.org Fixes: 2f6f0654ab61 ("userns: Convert vfs posix_acl support to use kuids and kgids") Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945 Reported-by: Colin Watson <cjwatson@ubuntu.com> Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-08-22nfsd: Remove callback_credChuck Lever3-30/+2
Clean up: The global callback_cred is no longer used, so it can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22nfsd: Use correct credential for NFSv4.0 callback with GSSChuck Lever1-1/+8
I've had trouble when operating a multi-homed Linux NFS server with Kerberos using NFSv4.0. Lately, I've seen my clients reporting this (and then hanging): May 9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred The client-side commit f11b2a1cfbf5 ("nfs4: copy acceptor name from context to nfs_client") appears to be related, but I suspect this problem has been going on for some time before that. RFC 7530 Section 3.3.3 says: > For Kerberos V5, nfs/hostname would be a server principal in the > Kerberos Key Distribution Center database. This is the same > principal the client acquired a GSS-API context for when it issued > the SETCLIENTID operation ... In other words, an NFSv4.0 client expects that the server will use the same GSS principal for callback that the client used to establish its lease. For example, if the client used the service principal "nfs@server.domain" to establish its lease, the server is required to use "nfs@server.domain" when performing NFSv4.0 callback operations. The Linux NFS server currently does not. It uses a common service principal for all callback connections. Sometimes this works as expected, and other times -- for example, when the server is accessible via multiple hostnames -- it won't work at all. This patch scrapes the target name from the client credential, and uses that for the NFSv4.0 callback credential. That should be correct much more often. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22sunrpc: Extract target name into svc_credChuck Lever1-2/+5
NFSv4.0 callback needs to know the GSS target name the client used when it established its lease. That information is available from the GSS context created by gssproxy. Make it available in each svc_cred. Note this will also give us access to the real target service principal name (which is typically "nfs", but spec does not require that). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fsLinus Torvalds18-527/+1645
Pull f2fs updates from Jaegeuk Kim: "In this round, we've tuned f2fs to improve general performance by serializing block allocation and enhancing discard flows like fstrim which avoids user IO contention. And we've added fsync_mode=nobarrier which gives an option to user where it skips issuing cache_flush commands to underlying flash storage. And there are many bug fixes related to fuzzed images, revoked atomic writes, quota ops, and minor direct IO. Enhancements: - add fsync_mode=nobarrier which bypasses cache_flush command - enhance the discarding flow which avoids user IOs and issues in LBA order - readahead some encrypted blocks during GC - enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set - enhance nat_bits behavior - set -o discard by default - set REQ_RAHEAD to bio in ->readpages Bug fixes: - fix a corner case to corrupt atomic_writes revoking flow - revisit i_gc_rwsem to fix race conditions - fix some dio behaviors captured by xfstests - correct handling errors given by quota-related failures - add many sanity check flows to avoid fuzz test failures - add more error number propagation to their callers - fix several corner cases to continue fault injection w/ shutdown loop" * tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits) f2fs: readahead encrypted block during GC f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc f2fs: fix performance issue observed with multi-thread sequential read f2fs: fix to skip verifying block address for non-regular inode f2fs: rework fault injection handling to avoid a warning f2fs: support fault_type mount option f2fs: fix to return success when trimming meta area f2fs: fix use-after-free of dicard command entry f2fs: support discard submission error injection f2fs: split discard command in prior to block layer f2fs: wake up gc thread immediately when gc_urgent is set f2fs: fix incorrect range->len in f2fs_trim_fs() f2fs: refresh recent accessed nat entry in lru list f2fs: fix avoid race between truncate and background GC f2fs: avoid race between zero_range and background GC f2fs: fix to do sanity check with block address in main area v2 f2fs: fix to do sanity check with inline flags f2fs: fix to reset i_gc_failures correctly f2fs: fix invalid memory access f2fs: fix to avoid broken of dnode block list ...
2018-08-22ovl: set I_CREATING on inode being createdMiklos Szeredi1-0/+4
...otherwise there will be list corruption due to inode_sb_list_add() being called for inode already on the sb list. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: e950564b97fd ("vfs: don't evict uninitialized inode") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds50-1022/+921
Merge more updates from Andrew Morton: - the rest of MM - procfs updates - various misc things - more y2038 fixes - get_maintainer updates - lib/ updates - checkpatch updates - various epoll updates - autofs updates - hfsplus - some reiserfs work - fatfs updates - signal.c cleanups - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits) ipc/util.c: update return value of ipc_getref from int to bool ipc/util.c: further variable name cleanups ipc: simplify ipc initialization ipc: get rid of ids->tables_initialized hack lib/rhashtable: guarantee initial hashtable allocation lib/rhashtable: simplify bucket_table_alloc() ipc: drop ipc_lock() ipc/util.c: correct comment in ipc_obtain_object_check ipc: rename ipcctl_pre_down_nolock() ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc: reorganize initialization of kern_ipc_perm.seq ipc: compute kern_ipc_perm.id under the ipc lock init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp adfs: use timespec64 for time conversion kernel/sysctl.c: fix typos in comments drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md fork: don't copy inconsistent signal handler state to child signal: make get_signal() return bool signal: make sigkill_pending() return bool ...
2018-08-22fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stampArnd Bergmann1-3/+3
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which returns a 64-bit timestamp. In the SYSV file system, the superblock timestamp is only 32 bits wide, and it is used to check whether a file system is clean, so the best solution seems to be to force a wraparound and explicitly convert it to an unsigned 32-bit value. This is independent of the inode timestamps that are also 32-bit wide on disk and that come from current_time(). Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22adfs: use timespec64 for time conversionArnd Bergmann1-7/+4
We just truncate the seconds to 32-bit in one place now, so this can trivially be converted over to using timespec64 consistently. Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fat: propagate 64-bit inode timestampsArnd Bergmann6-48/+30
Now that we pass down 64-bit timestamps from VFS, we just need to convert that correctly into on-disk timestamps. To make that work correctly, this changes the last use of time_to_tm() in the kernel to time64_to_tm(), which also lets use remove that deprecated interfaces. Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp on the way in, which can be avoided by using types that are wide enough to hold the intermediate values during the conversion. [hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long] Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fat: validate ->i_start before usingOGAWA Hirofumi3-10/+20
On corrupted FATfs may have invalid ->i_start. To handle it, this checks ->i_start before using, and return proper error code. Link: http://lkml.kernel.org/r/87o9f8y1t5.fsf_-_@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fat: add FITRIM ioctl for FAT file systemWentao Wang3-0/+136
Add FITRIM ioctl for FAT file system [witallwang@gmail.com: use u64s] Link: http://lkml.kernel.org/r/87h8l37hub.fsf@mail.parknet.co.jp [hirofumi@mail.parknet.co.jp: bug fixes, coding style fixes, add signal check] Link: http://lkml.kernel.org/r/87fu10anhj.fsf@mail.parknet.co.jp Signed-off-by: Wentao Wang <witallwang@gmail.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22reiserfs: fix broken xattr handling (heap corruption, bad retval)Jann Horn1-1/+3
This fixes the following issues: - When a buffer size is supplied to reiserfs_listxattr() such that each individual name fits, but the concatenation of all names doesn't fit, reiserfs_listxattr() overflows the supplied buffer. This leads to a kernel heap overflow (verified using KASAN) followed by an out-of-bounds usercopy and is therefore a security bug. - When a buffer size is supplied to reiserfs_listxattr() such that a name doesn't fit, -ERANGE should be returned. But reiserfs instead just truncates the list of names; I have verified that if the only xattr on a file has a longer name than the supplied buffer length, listxattr() incorrectly returns zero. With my patch applied, -ERANGE is returned in both cases and the memory corruption doesn't happen anymore. Credit for making me clean this code up a bit goes to Al Viro, who pointed out that the ->actor calling convention is suboptimal and should be changed. Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com Fixes: 48b32a3553a5 ("reiserfs: use generic xattr handlers") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22reiserfs: change j_timestamp type to time64_tArnd Bergmann1-1/+1
This uses the deprecated time_t type but is write-only, and could be removed, but as Jeff explains, having a timestamp can be usefule for post-mortem analysis in crash dumps. In order to remove one of the last instances of time_t, this changes the type to time64_t, same as j_trans_start_time. Link: http://lkml.kernel.org/r/20180622133315.221210-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22reiserfs: remove obsolete print_time functionArnd Bergmann1-12/+4
Before linux-2.4.6, print_time() was used to pretty-print an inode time when running reiserfs in user space, after that it has become obsolete and is still a bit incorrect: It behaves differently on 32-bit and 64-bit machines, and uses a static buffer to hold a string, which could lead to undefined behavior if we ever called this from multiple places simultaneously. Since we always want to treat the timestamps as 'unsigned' anyway, simply printing them as an integer is both simpler and safer while avoiding the deprecated time_t type. Link: http://lkml.kernel.org/r/20180620142522.27639-3-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22reiserfs: use monotonic time for j_trans_start_timeArnd Bergmann3-14/+21
Using CLOCK_REALTIME time_t timestamps breaks on 32-bit systems in 2038, and gives surprising results with a concurrent settimeofday(). This changes the reiserfs journal timestamps to use ktime_get_seconds() instead, which makes it use a 64-bit CLOCK_MONOTONIC stamp. In the procfs output, the monotonic timestamp needs to be converted back to CLOCK_REALTIME to keep the existing ABI. Link: http://lkml.kernel.org/r/20180620142522.27639-2-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22hfsplus: drop ACL supportErnesto A. Fernández11-232/+4
The HFS+ Access Control Lists have not worked at all for the past five years, and nobody seems to have noticed. Besides, POSIX draft ACLs are not compatible with MacOS. Drop the feature entirely. Link: http://lkml.kernel.org/r/20180714190608.wtnmmtjqeyladkut@eaf Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22hfsplus: fix decomposition of Hangul charactersErnesto A. Fernández1-6/+56
Files created under macOS cannot be opened under linux if their names contain Korean characters, and vice versa. The Korean alphabet is special because its normalization is done without a table. The module deals with it correctly when composing, but forgets about it for the decomposition. Fix this using the Hangul decomposition function provided in the Unicode Standard. The code fits a bit awkwardly because it requires a buffer, while all the other normalizations are returned as pointers to the decomposition table. This is actually also a bug because reordering may still be needed, but for now leave it as it is. The patch will cause trouble for Hangul filenames already created by the module in the past. This shouldn't really be concern because its main purpose was always sharing with macOS. If a user actually needs to access such a file the nodecompose mount option should be enough. Link: http://lkml.kernel.org/r/20180717220951.p6qqrgautc4pxvzu@eaf Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reported-by: Ting-Chang Hou <tchou@synology.com> Tested-by: Ting-Chang Hou <tchou@synology.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22hfsplus: avoid deadlock on file truncationErnesto A. Fernández1-4/+14
After an extent is removed from the extent tree, the corresponding bits are also cleared from the block allocation file. This is currently done without releasing the tree lock. The problem is that the allocation file has extents of its own; if it is fragmented enough, some of them may be in the extent tree as well, and hfsplus_get_block() will try to take the lock again. To avoid deadlock, only hold the extent tree lock during the actual tree operations. Link: http://lkml.kernel.org/r/20180709202549.auxwkb6memlegb4a@eaf Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22hfsplus: don't return 0 when fill_super() failedTetsuo Handa1-1/+3
syzbot is reporting NULL pointer dereference at mount_fs() [1]. This is because hfsplus_fill_super() is by error returning 0 when hfsplus_fill_super() detected invalid filesystem image, and mount_bdev() is returning NULL because dget(s->s_root) == NULL if s->s_root == NULL, and mount_fs() is accessing root->d_sb because IS_ERR(root) == false if root == NULL. Fix this by returning -EINVAL when hfsplus_fill_super() detected invalid filesystem image. [1] https://syzkaller.appspot.com/bug?id=21acb6850cecbc960c927229e597158cf35f33d0 Link: http://lkml.kernel.org/r/d83ce31a-874c-dd5b-f790-41405983a5be@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+01ffaf5d9568dd1609f7@syzkaller.appspotmail.com> Reviewed-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fs/nilfs2/file.c: use new return type vm_fault_tSouptick Joarder1-1/+1
Use new return type vm_fault_t for page_mkwrite handler. Link: http://lkml.kernel.org/r/1529555928-2411-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22nilfs2: use 64-bit superblock timstampsArnd Bergmann1-1/+1
The mount time field in the superblock uses a 64-bit timestamp, but calling get_seconds() may truncate the current time to 32 bits. This changes it to ktime_get_real_seconds() to avoid the potential overflow. Link: http://lkml.kernel.org/r/20180620075041.4154396-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: add AUTOFS_EXP_FORCED flagIan Kent1-12/+50
The userspace automount(8) daemon is meant to perform a forced expire when sent a SIGUSR2. But since the expiration is routed through the kernel and the kernel doesn't send an expire request if the mount is busy this hasn't worked at least since autofs version 5. Add an AUTOFS_EXP_FORCED flag to allow implemention of the feature and bump the protocol version so user space can check if it's implemented if needed. Link: http://lkml.kernel.org/r/152937734715.21213.6594007182776598970.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: make expire flags usage consistent with v5 paramsIan Kent2-34/+29
Make the usage of the expire flags consistent by naming the expire flags the same as it is named in the version 5 miscelaneous ioctl parameters and only check the bit flags when needed. Link: http://lkml.kernel.org/r/152937734046.21213.9454131988766280028.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: make autofs_expire_indirect() staticIan Kent2-7/+4
autofs_expire_indirect() isn't used outside of fs/autofs/expire.c so make it static. Link: http://lkml.kernel.org/r/152937733512.21213.10509996499623738446.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: make autofs_expire_direct() staticIan Kent2-7/+4
autofs_expire_direct() isn't used outside of fs/autofs/expire.c so make it static. Link: http://lkml.kernel.org/r/152937732944.21213.11821977712410930973.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: fix clearing AUTOFS_EXP_LEAVES in autofs_expire_indirect()Ian Kent1-1/+1
The expire flag AUTOFS_EXP_LEAVES is cleared before the second call to should_expire() in autofs_expire_indirect() but the parameter passed in the second call is incorrect. Fortunately AUTOFS_EXP_LEAVES expire flag has not been used for a long time but might be needed in the future so fix it rather than remove the expire leaves functionality. Link: http://lkml.kernel.org/r/152937732410.21213.7447294898147765076.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: fix inconsistent use of now variableIan Kent1-7/+3
The global variable "now" in fs/autofs/expire.c is used in an inconsistent way, sometimes using jiffies directly, and sometimes using the "now" variable, and setting it isn't done consistently either. But the autofs dentry info last_used field is only updated during path walks or during expire so jiffies can be used directly and the global variable "now" removed. Link: http://lkml.kernel.org/r/152937731702.21213.7371321165189170865.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22autofs: fix directory and symlink accessIan Kent1-3/+30
Depending on how it is configured the autofs user space daemon can leave in use mounts mounted at exit and re-connect to them at start up. But for this to work best the state of the autofs file system needs to be left intact over the restart. Also, at system shutdown, mounts in an autofs file system might be umounted exposing a mount point trigger for which subsequent access can lead to a hang. So recent versions of automount(8) now does its best to set autofs file system mounts catatonic at shutdown. When autofs file system mounts are catatonic it's currently possible to create and remove directories and symlinks which can be a problem at restart, as described above. So return EACCES in the directory, symlink and unlink methods if the autofs file system is catatonic. Link: http://lkml.kernel.org/r/152902119090.4144.9561910674530214291.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fs/eventpoll.c: simplify ep_is_linked() callersDavidlohr Bueso1-8/+8
Instead of having each caller pass the rdllink explicitly, just have ep_is_linked() pass it while the callers just need the epi pointer. This helper is all about the rdllink, and this change, furthermore, improves the function's self documentation. Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fs/eventpoll.c: loosen irq safety in ep_poll()Davidlohr Bueso1-6/+7
Similar to other calls, ep_poll() is not called with interrupts disabled, and we can therefore avoid the irq save/restore dance and just disable local irqs. In fact, the call should never be called in irq context at all, considering that the only path is epoll_wait(2) -> do_epoll_wait() -> ep_poll(). When running on a 2 socket 40-core (ht) IvyBridge a common pipe based epoll_wait(2) microbenchmark, the following performance improvements are seen: # threads vanilla dirty 1 1805587 2106412 2 1854064 2090762 4 1805484 2017436 8 1751222 1974475 16 1725299 1962104 32 1378463 1571233 64 787368 900784 Which is a pretty constantly near 15%. Also add a lockdep check such that we detect any mischief before deadlocking. Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fs/eventpoll.c: simply CONFIG_NET_RX_BUSY_POLL ifdeferyDavidlohr Bueso1-7/+16
... 'tis easier on the eye. [akpm@linux-foundation.org: use inlines rather than macros] Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22s/epoll: robustify irq safety with lockdep_assert_irqs_enabled()Davidlohr Bueso1-0/+8
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not save and restore interrupts when dealing with the ep->wq.lock. These are ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify and ep_remove. [akpm@linux-foundation.org: remove too-obvious comments] Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fs/epoll: loosen irq safety in epoll_insert() and epoll_remove()Davidlohr Bueso1-8/+6
Both functions are similar to the context of ep_modify(), called via epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is an overkill in these calls as it will never be called with irqs disabled. While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also be called when releasing the file, but this also complies with the above. Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22fs/epoll: loosen irq safety in ep_scan_ready_list()Davidlohr Bueso1-5/+4
Patch series "fs/epoll: loosen irq safety when possible". Both patches replace saving+restoring interrupts when taking the ep->lock (now the waitqueue lock), with just disabling local irqs. This shows immediate performance benefits in patch 1 for an epoll workload running on Xen. The main concern we need to have with this sort of changes in epoll is the ep_poll_callback() which is passed to the wait queue wakeup and is done very often under irq context, this patch does not touch this call. Patches have been tested pretty heavily with the customer workload, microbenchmarks, ltp testcases and two high level workloads that use epoll under the hood: nginx and libevent benchmarks. This patch (of 2): Saving and restoring interrupts in ep_scan_ready_list() is an overkill as it is never called with interrupts disabled. Loosen this to simply disabling local irqs such that archs where managing irqs is expensive or virtual environments. This patch yields some throughput improvements on a workload that is epoll intensive running on a single Xen DomU. 1 Job 7500 --> 8800 enq/s (+17%) 2 Jobs 14000 --> 15200 enq/s (+8%) 3 Jobs 20500 --> 22300 enq/s (+8%) 4 Jobs 25000 --> 28000 enq/s (+8-12)% On bare metal: For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I don't have a xen environment and the results for Xen I do have (which numbers are in patch 1) I don't have the actual workload, so cannot compare them directly. 1) Different configurations were used for a epoll_wait (pipes io) microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows around a 7-10% improvement in overall total number of times the epoll_wait() loops when using both regular and nested epolls, so very raw numbers, but measurable nonetheless. # threads vanilla dirty 1 1677717 1805587 2 1660510 1854064 4 1610184 1805484 8 1577696 1751222 16 1568837 1725299 32 1291532 1378463 64 752584 787368 Note that stddev is pretty small. 2) Another pipe test, which shows no real measurable improvement. (http://www.xmailserver.org/linux-patches/pipetest.c) Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22userfaultfd: use fault_wqh lockMatthew Wilcox1-3/+3
The userfaultfd code currently uses the unlocked waitqueue helpers for managing fault_wqh, but instead of holding the waitqueue lock for this waitqueue around these calls, it the waitqueue lock of fault_pending_wq, which is a different waitqueue instance. Given that the waitqueue is not exposed to the rest of the kernel this actually works ok at the moment, but prevents the userfaultfd locking rules from being enforced using lockdep. Switch to the internally locked waitqueue helpers instead. This means that the lock inside fault_wqh now nests inside the fault_pending_wqh lock, but that's not a problem since it was entirely unused before. [hch@lst.de: slight changelog updates] [rppt@linux.vnet.ibm.com: spotted changelog spellos] Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22epoll: use the waitqueue lock to protect ep->wqChristoph Hellwig1-36/+29
Patch series "waitqueue lockdep annotation", v3. This series adds a strategic lockdep_assert_held to __wake_up_common to ensure callers really do hold the wait_queue_head lock when calling the unlocked wake_up variants. It turns out epoll did not do this for a fairly common path (hit all the time by systemd during bootup), so the second patch fixed this instance as well. This patch (of 3): The epoll code currently uses the unlocked waitqueue helpers for managing ep->wq, but instead of holding the waitqueue lock around these calls, it uses its own ep->lock spinlock. Given that the waitqueue is not exposed to the rest of the kernel this actually works ok at the moment, but prevents the epoll locking rules from being enforced using lockdep. Remove ep->lock and use the waitqueue lock to not only reduce the size of struct eventpoll but also to make sure we can assert locking invariants in the waitqueue code. Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Baron <jbaron@akamai.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: add vmcoreinfo note to /proc/kcoreOmar Sandoval2-2/+17
The vmcoreinfo information is useful for runtime debugging tools, not just for crash dumps. A lot of this information can be determined by other means, but this is much more convenient, and it only adds a page at most to the file. Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>