aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-03-05signal: add pidfd_send_signal() syscallChristian Brauner1-0/+9
The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com>
2019-01-26Merge tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds10-67/+136
Pull smb3 fixes from Steve French: "A set of small smb3 fixes, some fixing various crediting issues discovered during xfstest runs, five for stable" * tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugData smb3: add credits we receive from oplock/break PDUs CIFS: Fix mounts if the client is low on credits CIFS: Do not assume one credit for async responses CIFS: Fix credit calculations in compound mid callback CIFS: Fix credit calculation for encrypted reads with errors CIFS: Fix credits calculations for reads with errors CIFS: Do not reconnect TCP session in add_credits() smb3: Cleanup license mess CIFS: Fix possible hang during async MTU reads and writes cifs: fix memory leak of an allocated cifs_ntsd structure
2019-01-26Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-blockLinus Torvalds2-4/+41
Pull block fixes from Jens Axboe: "A collection of fixes for this release. This contains: - Silence sparse rightfully complaining about non-static wbt functions (Bart) - Fixes for the zoned comments/ioctl documentation (Damien) - direct-io fix that's been lingering for a while (Ernesto) - cgroup writeback fix (Tejun) - Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju) - Block recursion tracking fix (Ming) - Fix debugfs command flag naming for a few flags (Jianchao)" * tag 'for-linus-20190125' of git://git.kernel.dk/linux-block: block: Fix comment typo uapi: fix ioctl documentation blk-wbt: Declare local functions static blk-mq: fix the cmd_flag_name array nvme-multipath: drop optimization for static ANA group IDs nvmet-rdma: fix null dereference under heavy load nvme-rdma: rework queue maps handling nvme-tcp: fix timeout handler nvme-rdma: fix timeout handler writeback: synchronize sync(2) against cgroup writeback membership switches block: cover another queue enter recursion via BIO_QUEUE_ENTERED direct-io: allow direct writes to empty inodes
2019-01-24cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugDataRonnie Sahlberg1-0/+1
Was helpful in debug for some recent problems. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24smb3: add credits we receive from oplock/break PDUsRonnie Sahlberg1-0/+7
Otherwise we gradually leak credits leading to potential hung session. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix mounts if the client is low on creditsPavel Shilovsky1-0/+17
If the server doesn't grant us at least 3 credits during the mount we won't be able to complete it because query path info operation requires 3 credits. Use the cached file handle if possible to allow the mount to succeed. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Do not assume one credit for async responsesPavel Shilovsky1-4/+11
If we don't receive a response we can't assume that the server granted one credit. Assume zero credits in such cases. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix credit calculations in compound mid callbackPavel Shilovsky2-11/+6
The current code doesn't do proper accounting for credits in SMB1 case: it adds one credit per response only if we get a complete response while it needs to return it unconditionally. Fix this and also include malformed responses for SMB2+ into accounting for credits because such responses have Credit Granted field, thus nothing prevents to get a proper credit value from them. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix credit calculation for encrypted reads with errorsPavel Shilovsky1-10/+14
We do need to account for credits received in error responses to read requests on encrypted sessions. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix credits calculations for reads with errorsPavel Shilovsky1-12/+23
Currently we mark MID as malformed if we get an error from server in a read response. This leads to not properly processing credits in the readv callback. Fix this by marking such a response as normal received response and process it appropriately. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Do not reconnect TCP session in add_credits()Pavel Shilovsky2-7/+46
When executing add_credits() we currently call cifs_reconnect() if the number of credits is zero and there are no requests in flight. In this case we may call cifs_reconnect() recursively twice and cause memory corruption given the following sequence of functions: mid1.callback() -> add_credits() -> cifs_reconnect() -> -> mid2.callback() -> add_credits() -> cifs_reconnect(). Fix this by avoiding to call cifs_reconnect() in add_credits() and checking for zero credits in the demultiplex thread. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-25Merge tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsLinus Torvalds1-2/+4
Pull inotify fix from Jan Kara: "Fix a file refcount leak in an inotify error path" * tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: inotify: Fix fd refcount leak in inotify_add_watch().
2019-01-24smb3: Cleanup license messThomas Gleixner2-20/+0
Precise and non-ambiguous license information is important. The recently added aegis header file has a SPDX license identifier, which is nice, but at the same time it has a contradictionary license boiler plate text. SPDX-License-Identifier: GPL-2.0 versus * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. Oh well. Assuming that the SPDX identifier is correct and according to x86/hyper-v contributions from Microsoft GPL V2 only is the usual license. Remove the boiler plate as it is wrong and even if correct it is redundant. Fixes: eccb4422cf97 ("smb3: Add ftrace tracepoints for improved SMB3 debugging") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steve French <sfrench@samba.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix possible hang during async MTU reads and writesPavel Shilovsky1-3/+3
When doing MTU i/o we need to leave some credits for possible reopen requests and other operations happening in parallel. Currently we leave 1 credit which is not enough even for reopen only: we need at least 2 credits if durable handle reconnect fails. Also there may be other operations at the same time including compounding ones which require 3 credits at a time each. Fix this by leaving 8 credits which is big enough to cover most scenarios. Was able to reproduce this when server was configured to give out fewer credits than usual. The proper fix would be to reconnect a file handle first and then obtain credits for an MTU request but this leads to bigger code changes and should happen in other patches. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24cifs: fix memory leak of an allocated cifs_ntsd structureColin Ian King1-0/+8
The call to SMB2_queary_acl can allocate memory to pntsd and also return a failure via a call to SMB2_query_acl (and then query_info). This occurs when query_info allocates the structure and then in query_info the call to smb2_validate_and_copy_iov fails. Currently the failure just returns without kfree'ing pntsd hence causing a memory leak. Currently, *data is allocated if it's not already pointing to a buffer, so it needs to be kfree'd only if was allocated in query_info, so the fix adds an allocated flag to track this. Also set *dlen to zero on an error just to be safe since *data is kfree'd. Also set errno to -ENOMEM if the allocation of *data fails. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Dan Carpener <dan.carpenter@oracle.com>
2019-01-22writeback: synchronize sync(2) against cgroup writeback membership switchesTejun Heo1-2/+38
sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has. This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing. v2: Fixed misplaced rwsem init. Spotted by Jiufei. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jiufei Xue <xuejiufei@gmail.com> Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22direct-io: allow direct writes to empty inodesErnesto A. Fernández1-2/+3
On a DIO_SKIP_HOLES filesystem, the ->get_block() method is currently not allowed to create blocks for an empty inode. This confusion comes from trying to bit shift a negative number, so check the size of the inode first. The problem is most visible for hfsplus, because the fallback to buffered I/O doesn't happen and the write fails with EIO. This is in part the fault of the module, because it gives a wrong return value on ->get_block(); that will be fixed in a separate patch. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-21ceph: quota: cleanup license messThomas Gleixner1-13/+0
Precise and non-ambiguous license information is important. The recently added quota.c file has a SPDX license identifier, which is nice, but at the same time it has a contradictionary license boiler plate text. SPDX-License-Identifier: GPL-2.0 versus * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. Oh well. As the other ceph related files are licensed under the GPL v2 only, it's assumed that the SPDX id is correct and the boiler plate was randomly copied into that patch. Remove the boiler plate as it is wrong and even if correct it is redundant. Fixes: fb18a57568c2 ("ceph: quota: add initial infrastructure to support cephfs quotas") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Henriques <lhenriques@suse.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Sage Weil <sage@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: ceph-devel@vger.kernel.org Acked-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-21ceph: clear inode pointer when snap realm gets dropped by its inodeYan, Zheng1-0/+2
snap realm and corresponding inode have pointers to each other. The two pointer should get clear at the same time. Otherwise, snap realm's pointer may reference freed inode. Cc: stable@vger.kernel.org # 4.17+ Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-21Merge tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds1-8/+4
Pull pstore fixes from Kees Cook: - Fix console ramoops to show the previous boot logs (Sai Prakash Ranjan) - Avoid allocation and leak of platform data * tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Avoid allocation and leak of platform data pstore/ram: Fix console ramoops to show the previous boot logs
2019-01-20pstore/ram: Avoid allocation and leak of platform dataKees Cook1-6/+3
Yue Hu noticed that when parsing device tree the allocated platform data was never freed. Since it's not used beyond the function scope, this switches to using a stack variable instead. Reported-by: Yue Hu <huyue2@yulong.com> Fixes: 35da60941e44 ("pstore/ram: add Device Tree bindings") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-21Merge tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds4-10/+35
Pull btrfs fixes from David Sterba: "A handful of fixes (some of them in testing for a long time): - fix some test failures regarding cleanup after transaction abort - revert of a patch that could cause a deadlock - delayed iput fixes, that can help in ENOSPC situation when there's low space and a lot data to write" * tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: wakeup cleaner thread when adding delayed iput btrfs: run delayed iputs before committing btrfs: wait on ordered extents on abort cleanup btrfs: handle delayed ref head accounting cleanup in abort Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
2019-01-20Merge tag 'nfs-for-5.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-7/+1
Pull NFS client fixes from Anna Schumaker: "These are mostly fixes for SUNRPC bugs, with a single v4.2 copy_file_range() fix mixed in. Stable bugfixes: - Fix TCP receive code on archs with flush_dcache_page() Other bugfixes: - Fix error code in rpcrdma_buffer_create() - Fix a double free in rpcrdma_send_ctxs_create() - Fix kernel BUG at kernel/cred.c:825 - Fix unnecessary retry in nfs42_proc_copy_file_range() - Ensure rq_bytes_sent is reset before request transmission - Ensure we respect the RPCSEC_GSS sequence number limit - Address Kerberos performance/behavior regression" * tag 'nfs-for-5.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: Address Kerberos performance/behavior regression SUNRPC: Ensure we respect the RPCSEC_GSS sequence number limit SUNRPC: Ensure rq_bytes_sent is reset before request transmission NFSv4.2 fix unnecessary retry in nfs4_copy_file_range sunrpc: kernel BUG at kernel/cred.c:825! SUNRPC: Fix TCP receive code on archs with flush_dcache_page() xprtrdma: Double free in rpcrdma_sendctxs_create() xprtrdma: Fix error code in rpcrdma_buffer_create()
2019-01-20Merge tag 'for-linus-20190118' of git://git.kernel.dk/linux-blockLinus Torvalds1-10/+18
Pull block fixes from Jens Axboe: - block size setting fixes for loop/nbd (Jan Kara) - md bio_alloc_mddev() cleanup (Marcos) - Ensure we don't lose the REQ_INTEGRITY flag (Ming) - Two NVMe fixes by way of Christoph: - Fix NVMe IRQ calculation (Ming) - Uninitialized variable in nvmet-tcp (Sagi) - BFQ comment fix (Paolo) - License cleanup for recently added blk-mq-debugfs-zoned (Thomas) * tag 'for-linus-20190118' of git://git.kernel.dk/linux-block: block: Cleanup license notice nvme-pci: fix nvme_setup_irqs() nvmet-tcp: fix uninitialized variable access block: don't lose track of REQ_INTEGRITY flag blockdev: Fix livelocks on loop device nbd: Use set_blocksize() to set device blocksize md: Make bio_alloc_mddev use bio_alloc_bioset block, bfq: fix comments on __bfq_deactivate_entity
2019-01-18btrfs: wakeup cleaner thread when adding delayed iputJosef Bacik3-0/+8
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. Delaying iputs means we are potentially delaying the eviction of an inode and it's respective space. The cleaner thread only gets woken up every 30 seconds, or when we require space. If there are a lot of inodes that need to be deleted we could induce a serious amount of latency while we wait for these inodes to be evicted. So instead wakeup the cleaner if it's not already awake to process any new delayed iputs we add to the list. If we suddenly need space we will less likely be backed up behind a bunch of inodes that are waiting to be deleted, and we could possibly free space before we need to get into the flushing logic which will save us some latency. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18btrfs: run delayed iputs before committingJosef Bacik1-0/+9
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18btrfs: wait on ordered extents on abort cleanupJosef Bacik1-0/+8
If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. generic/475 can produce this warning: [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs] [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G W 5.0.0-rc1-default+ #394 [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs] [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286 [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000 [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000 [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000 [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0 [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000 [ 8531.202707] FS: 00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000 [ 8531.204851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0 [ 8531.207516] Call Trace: [ 8531.208175] btrfs_free_fs_roots+0xdb/0x170 [btrfs] [ 8531.210209] ? wait_for_completion+0x5b/0x190 [ 8531.211303] close_ctree+0x157/0x350 [btrfs] [ 8531.212412] generic_shutdown_super+0x64/0x100 [ 8531.213485] kill_anon_super+0x14/0x30 [ 8531.214430] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8531.215539] deactivate_locked_super+0x29/0x60 [ 8531.216633] cleanup_mnt+0x3b/0x70 [ 8531.217497] task_work_run+0x98/0xc0 [ 8531.218397] exit_to_usermode_loop+0x83/0x90 [ 8531.219324] do_syscall_64+0x15b/0x180 [ 8531.220192] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8531.221286] RIP: 0033:0x7fc68e5e4d07 [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6 [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07 [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80 [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80 [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80 [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000 Leaving a tree in the rb-tree: 3853 void btrfs_free_fs_root(struct btrfs_root *root) 3854 { 3855 iput(root->ino_cache_inode); 3856 WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); CC: stable@vger.kernel.org Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add stacktrace ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18btrfs: handle delayed ref head accounting cleanup in abortJosef Bacik3-7/+10
We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. The test generic/475 on a 2G VM can trigger the problems eg.: [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs] [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394 [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014 [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs] [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206 [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000 [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400 [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000 [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108 [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100 [ 8502.170296] FS: 00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000 [ 8502.172439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0 [ 8502.175094] Call Trace: [ 8502.175759] close_ctree+0x17f/0x350 [btrfs] [ 8502.176721] generic_shutdown_super+0x64/0x100 [ 8502.177702] kill_anon_super+0x14/0x30 [ 8502.178607] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8502.179602] deactivate_locked_super+0x29/0x60 [ 8502.180595] cleanup_mnt+0x3b/0x70 [ 8502.181406] task_work_run+0x98/0xc0 [ 8502.182255] exit_to_usermode_loop+0x83/0x90 [ 8502.183113] do_syscall_64+0x15b/0x180 [ 8502.183919] entry_SYSCALL_64_after_hwframe+0x49/0xbe Corresponding to release_global_block_rsv() { ... WARN_ON(fs_info->delayed_refs_rsv.reserved > 0); CC: stable@vger.kernel.org Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add log dump ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"David Sterba1-3/+0
This reverts commit e73e81b6d0114d4a303205a952ab2e87c44bd279. This patch causes a few problems: - adds latency to btrfs_finish_ordered_io - as btrfs_finish_ordered_io is used for free space cache, generating more work from btrfs_btree_balance_dirty_nodelay could end up in the same workque, effectively deadlocking 12260 kworker/u96:16+btrfs-freespace-write D [<0>] balance_dirty_pages+0x6e6/0x7ad [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90 [<0>] btrfs_finish_ordered_io+0x3da/0x770 [<0>] normal_work_helper+0x1c5/0x5a0 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x46/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Transaction commit will wait on the freespace cache: 838 btrfs-transacti D [<0>] btrfs_start_ordered_extent+0x154/0x1e0 [<0>] btrfs_wait_ordered_range+0xbd/0x110 [<0>] __btrfs_wait_cache_io+0x49/0x1a0 [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0 [<0>] commit_cowonly_roots+0x215/0x2b0 [<0>] btrfs_commit_transaction+0x37e/0x910 [<0>] transaction_kthread+0x14d/0x180 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff And then writepages ends up waiting on transaction commit: 9520 kworker/u96:13+flush-btrfs-1 D [<0>] wait_current_trans+0xac/0xe0 [<0>] start_transaction+0x21b/0x4b0 [<0>] cow_file_range_inline+0x10b/0x6b0 [<0>] cow_file_range.isra.69+0x329/0x4a0 [<0>] run_delalloc_range+0x105/0x3c0 [<0>] writepage_delalloc+0x119/0x180 [<0>] __extent_writepage+0x10c/0x390 [<0>] extent_write_cache_pages+0x26f/0x3d0 [<0>] extent_writepages+0x4f/0x80 [<0>] do_writepages+0x17/0x60 [<0>] __writeback_single_inode+0x59/0x690 [<0>] writeback_sb_inodes+0x291/0x4e0 [<0>] __writeback_inodes_wb+0x87/0xb0 [<0>] wb_writeback+0x3bb/0x500 [<0>] wb_workfn+0x40d/0x610 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x1e0/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Eventually, we have every process in the system waiting on balance_dirty_pages(), and nobody is able to make progress on page writeback. The original patch tried to fix an OOM condition, that happened on 4.4 but no success reproducing that on later kernels (4.19 and 4.20). This is more likely a problem in OOM itself. Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/ Reported-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org # 4.18+ CC: ethanlien <ethanlien@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18Merge tag 'afs-fixes-20190117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsLinus Torvalds6-18/+59
Pull AFS fixes from David Howells: "Here's a set of fixes for AFS: - Use struct_size() for kzalloc() size calculation. - When calling YFS.CreateFile rather than AFS.CreateFile, it is possible to create a file with a file lock already held. The default value indicating no lock required is actually -1, not 0. - Fix an oops in inode/vnode validation if the target inode doesn't have a server interest assigned (ie. a server that will notify us of changes by third parties). - Fix refcounting of keys in file locking. - Fix a race in refcounting asynchronous operations in the event of an error during request transmission. The provision of a dedicated function to get an extra ref on a call is split into a separate commit" * tag 'afs-fixes-20190117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix race in async call refcounting afs: Provide a function to get a ref on a call afs: Fix key refcounting in file locking code afs: Don't set vnode->cb_s_break in afs_validate() afs: Set correct lock type for the yfs CreateFile afs: Use struct_size() in kzalloc()
2019-01-17pstore/ram: Fix console ramoops to show the previous boot logsSai Prakash Ranjan1-2/+1
commit b05c950698fe ("pstore/ram: Simplify ramoops_get_next_prz() arguments") changed update assignment in getting next persistent ram zone by adding a check for record type. But the check always returns true since the record type is assigned 0. And this breaks console ramoops by showing current console log instead of previous log on warm reset and hard reset (actually hard reset should not be showing any logs). Fix this by having persistent ram zone type check instead of record type check. Tested this on SDM845 MTP and dragonboard 410c. Reproducing this issue is simple as below: 1. Trigger hard reset and mount pstore. Will see console-ramoops record in the mounted location which is the current log. 2. Trigger warm reset and mount pstore. Will see the current console-ramoops record instead of previous record. Fixes: b05c950698fe ("pstore/ram: Simplify ramoops_get_next_prz() arguments") Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> [kees: dropped local variable usage] Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-17afs: Fix race in async call refcountingDavid Howells1-5/+30
There's a race between afs_make_call() and afs_wake_up_async_call() in the case that an error is returned from rxrpc_kernel_send_data() after it has queued the final packet. afs_make_call() will try and clean up the mess, but the call state may have been moved on thereby causing afs_process_async_call() to also try and to delete the call. Fix this by: (1) Getting an extra ref for an asynchronous call for the call itself to hold. This makes sure the call doesn't evaporate on us accidentally and will allow the call to be retained by the caller in a future patch. The ref is released on leaving afs_make_call() or afs_wait_for_call_to_complete(). (2) In the event of an error from rxrpc_kernel_send_data(): (a) Don't set the call state to AFS_CALL_COMPLETE until *after* the call has been aborted and ended. This prevents afs_deliver_to_call() from doing anything with any notifications it gets. (b) Explicitly end the call immediately to prevent further callbacks. (c) Cancel any queued async_work and wait for the work if it's executing. This allows us to be sure the race won't recur when we change the state. We put the work queue's ref on the call if we managed to cancel it. (d) Put the call's ref that we got in (1). This belongs to us as long as the call is in state AFS_CALL_CL_REQUESTING. Fixes: 341f741f04be ("afs: Refcount the afs_call struct") Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17afs: Provide a function to get a ref on a callDavid Howells1-6/+12
Provide a function to get a reference on an afs_call struct. Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17afs: Fix key refcounting in file locking codeDavid Howells2-2/+4
Fix the refcounting of the authentication keys in the file locking code. The vnode->lock_key member points to a key on which it expects to be holding a ref, but it isn't always given an extra ref, however. Fixes: 0fafdc9f888b ("afs: Fix file locking") Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17afs: Don't set vnode->cb_s_break in afs_validate()Marc Dionne1-1/+0
A cb_interest record is not necessarily attached to the vnode on entry to afs_validate(), which can cause an oops when we try to bring the vnode's cb_s_break up to date in the default case (ie. no current callback promise and the vnode has not been deleted). Fix this by simply removing the line, as vnode->cb_s_break will be set when needed by afs_register_server_cb_interest() when we next get a callback promise from RPC call. The oops looks something like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 ... RIP: 0010:afs_validate+0x66/0x250 [kafs] ... Call Trace: afs_d_revalidate+0x8d/0x340 [kafs] ? __d_lookup+0x61/0x150 lookup_dcache+0x44/0x70 ? lookup_dcache+0x44/0x70 __lookup_hash+0x24/0xa0 do_unlinkat+0x11d/0x2c0 __x64_sys_unlink+0x23/0x30 do_syscall_64+0x4d/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-15NFSv4.2 fix unnecessary retry in nfs4_copy_file_rangeOlga Kornievskaia1-7/+1
Currently nfs42_proc_copy_file_range() can not return EAGAIN. Fixes: e4648aa4f98a ("NFS recover from destination server reboot for copies") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-15blockdev: Fix livelocks on loop deviceJan Kara1-10/+18
bd_set_size() updates also block device's block size. This is somewhat unexpected from its name and at this point, only blkdev_open() uses this functionality. Furthermore, this can result in changing block size under a filesystem mounted on a loop device which leads to livelocks inside __getblk_gfp() like: Sending NMI from CPU 0 to CPUs 1: NMI backtrace for cpu 1 CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106 ... Call Trace: init_page_buffers+0x3e2/0x530 fs/buffer.c:904 grow_dev_page fs/buffer.c:947 [inline] grow_buffers fs/buffer.c:1009 [inline] __getblk_slow fs/buffer.c:1036 [inline] __getblk_gfp+0x906/0xb10 fs/buffer.c:1313 __bread_gfp+0x2d/0x310 fs/buffer.c:1347 sb_bread include/linux/buffer_head.h:307 [inline] fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75 fat_ent_read_block fs/fat/fatent.c:441 [inline] fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489 fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101 __fat_get_block fs/fat/inode.c:148 [inline] ... Trivial reproducer for the problem looks like: truncate -s 1G /tmp/image losetup /dev/loop0 /tmp/image mkfs.ext4 -b 1024 /dev/loop0 mount -t ext4 /dev/loop0 /mnt losetup -c /dev/loop0 l /mnt Fix the problem by moving initialization of a block device block size into a separate function and call it when needed. Thanks to Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> for help with debugging the problem. Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-14Merge tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds3-13/+64
Pull btrfs fixes from David Sterba: - two regression fixes in clone/dedupe ioctls, the generic check callback needs to lock extents properly and wait for io to avoid problems with writeback and relocation - fix deadlock when using free space tree due to block group creation - a recently added check refuses a valid fileystem with seeding device, make that work again with a quickfix, proper solution needs more intrusive changes * tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: Use real device structure to verify dev extent Btrfs: fix deadlock when using free space tree due to block group creation Btrfs: fix race between reflink/dedupe and relocation Btrfs: fix race between cloning range ending at eof and writeback
2019-01-14Merge tag 'driver-core-5.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-coreLinus Torvalds4-5/+10
Pull driver core fixes from Greg KH: "Here is one small sysfs change, and a documentation update for 5.0-rc2 The sysfs change moves from using BUG_ON to WARN_ON, as discussed in an email thread on lkml while trying to track down another driver bug. sysfs should not be crashing and preventing people from seeing where they went wrong. Now it properly recovers and warns the developer. The documentation update removes the use of BUS_ATTR() as the kernel is moving away from this to use the specific BUS_ATTR_RW() and friends instead. There are pending patches in all of the different subsystems to remove the last users of this macro, but for now, don't advertise it should be used anymore to keep new ones from being introduced. Both have been in linux-next with no reported issues" * tag 'driver-core-5.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: Documentation: driver core: remove use of BUS_ATTR sysfs: convert BUG_ON to WARN_ON
2019-01-14Merge tag '5.0-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds10-65/+211
Pull cifs fixes from Steve French: "A set of cifs/smb3 fixes, 4 for stable, most from Pavel. His patches fix an important set of crediting (flow control) problems, and also two problems in cifs_writepages, ddressing some large i/o and also compounding issues" * tag '5.0-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number CIFS: Fix error paths in writeback code CIFS: Move credit processing to mid callbacks for SMB3 CIFS: Fix credits calculation for cancelled requests cifs: Fix potential OOB access of lock element array cifs: Limit memory used by lock request calls to a page cifs: move large array from stack to heap CIFS: Do not hide EINTR after sending network packets CIFS: Fix credit computation for compounded requests CIFS: Do not set credits to 1 if the server didn't grant anything CIFS: Fix adjustment of credits for MTU requests cifs: Fix a tiny potential memory leak cifs: Fix a debug message
2019-01-11Merge tag 'ceph-for-5.0-rc2' of git://github.com/ceph/ceph-clientLinus Torvalds2-6/+3
Pull ceph updates from Ilya Dryomov: "A patch to allow setting abort_on_full and a fix for an old "rbd unmap" edge case, marked for stable" * tag 'ceph-for-5.0-rc2' of git://github.com/ceph/ceph-client: rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set ceph: use vmf_error() in ceph_filemap_fault() libceph: allow setting abort_on_full for rbd
2019-01-11cifs: update internal module version numberSteve French1-1/+1
To 2.16 Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11CIFS: Fix error paths in writeback codePavel Shilovsky4-9/+56
This patch aims to address writeback code problems related to error paths. In particular it respects EINTR and related error codes and stores and returns the first error occurred during writeback. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11CIFS: Move credit processing to mid callbacks for SMB3Pavel Shilovsky1-17/+34
Currently we account for credits in the thread initiating a request and waiting for a response. The demultiplex thread receives the response, wakes up the thread and the latter collects credits from the response buffer and add them to the server structure on the client. This approach is not accurate, because it may race with reconnect events in the demultiplex thread which resets the number of credits. Fix this by moving credit processing to new mid callbacks that collect credits granted by the server from the response in the demultiplex thread. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11CIFS: Fix credits calculation for cancelled requestsPavel Shilovsky2-2/+27
If a request is cancelled, we can't assume that the server returns 1 credit back. Instead we need to wait for a response and process the number of credits granted by the server. Create a separate mid callback for cancelled request, parse the number of credits in a response buffer and add them to the client's credits. If the didn't get a response (no response buffer available) assume 0 credits granted. The latter most probably happens together with session reconnect, so the client's credits are adjusted anyway. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11cifs: Fix potential OOB access of lock element arrayRoss Lagerwall2-6/+6
If maxBuf is small but non-zero, it could result in a zero sized lock element array which we would then try and access OOB. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-11cifs: Limit memory used by lock request calls to a pageRoss Lagerwall2-0/+12
The code tries to allocate a contiguous buffer with a size supplied by the server (maxBuf). This could fail if memory is fragmented since it results in high order allocations for commonly used server implementations. It is also wasteful since there are probably few locks in the usual case. Limit the buffer to be no larger than a page to avoid memory allocation failures due to fragmentation. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11cifs: move large array from stack to heapAurelien Aptel2-14/+32
This addresses some compile warnings that you can see depending on configuration settings. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11CIFS: Do not hide EINTR after sending network packetsPavel Shilovsky1-1/+1
Currently we hide EINTR code returned from sock_sendmsg() and return 0 instead. This makes a caller think that we successfully completed the network operation which is not true. Fix this by properly returning EINTR to callers. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-10CIFS: Fix credit computation for compounded requestsPavel Shilovsky1-18/+41
In SMB3 protocol every part of the compound chain consumes credits individually, so we need to call wait_for_free_credits() for each of the PDUs in the chain. If an operation is interrupted, we must ensure we return all credits taken from the server structure back. Without this patch server can sometimes disconnect the session due to credit mismatches, especially when first operation(s) are large writes. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>