aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2019-06-06Merge tag 'ovl-fixes-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfsLinus Torvalds8-22/+249
Pull overlayfs fixes from Miklos Szeredi: "Here's one fix for a class of bugs triggered by syzcaller, and one that makes xfstests fail less" * tag 'ovl-fixes-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: doc: add non-standard corner cases ovl: detect overlapping layers ovl: support the FS_IOC_FS[SG]ETXATTR ioctls
2019-06-06Merge tag 'fuse-fixes-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuseLinus Torvalds2-12/+38
Pull fuse fixes from Miklos Szeredi: "This fixes a leaked inode lock in an error cleanup path and a data consistency issue with copy_file_range(). It also adds a new flag for the WRITE request that allows userspace filesystems to clear suid/sgid bits on the file if necessary" * tag 'fuse-fixes-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: extract helper for range writeback fuse: fix copy_file_range() in the writeback case fuse: add FUSE_WRITE_KILL_PRIV fuse: fallocate: fix return with locked inode
2019-06-06Merge tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds3-37/+28
Pull NFS client fixes from Anna Schumaker: "These are mostly stable bugfixes found during testing, many during the recent NFS bake-a-thon. Stable bugfixes: - SUNRPC: Fix regression in umount of a secure mount - SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential - NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter - NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled Other bugfixes: - xprtrdma: Use struct_size() in kzalloc()" * tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential SUNRPC fix regression in umount of a secure mount xprtrdma: Use struct_size() in kzalloc()
2019-06-06pktgen: do not sleep with the thread lock held.Paolo Abeni1-0/+11
Currently, the process issuing a "start" command on the pktgen procfs interface, acquires the pktgen thread lock and never release it, until all pktgen threads are completed. The above can blocks indefinitely any other pktgen command and any (even unrelated) netdevice removal - as the pktgen netdev notifier acquires the same lock. The issue is demonstrated by the following script, reported by Matteo: ip -b - <<'EOF' link add type dummy link add type veth link set dummy0 up EOF modprobe pktgen echo reset >/proc/net/pktgen/pgctrl { echo rem_device_all echo add_device dummy0 } >/proc/net/pktgen/kpktgend_0 echo count 0 >/proc/net/pktgen/dummy0 echo start >/proc/net/pktgen/pgctrl & sleep 1 rmmod veth Fix the above releasing the thread lock around the sleep call. Additionally we must prevent racing with forcefull rmmod - as the thread lock no more protects from them. Instead, acquire a self-reference before waiting for any thread. As a side effect, running rmmod pktgen while some thread is running now fails with "module in use" error, before this patch such command hanged indefinitely. Note: the issue predates the commit reported in the fixes tag, but this fix can't be applied before the mentioned commit. v1 -> v2: - no need to check for thread existence after flipping the lock, pktgen threads are freed only at net exit time - Fixes: 6146e6a43b35 ("[PKTGEN]: Removes thread_{un,}lock() macros.") Reported-and-tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06Merge tag 'for-rc-adfs' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds4-129/+89
Pull ADFS cleanups/fixes from Russell King: "As a result of some of Al Viro's great work, here are a few cleanups with fixes for adfs: - factor out filename comparison, so we can be sure that adfs_compare() (used for namei compare) and adfs_match() (used for lookup) have the same behaviour. - factor out filename lowering (which is not the same as tolower() which will lower top-bit-set characters) to ensure that we have the same behaviour when comparing filenames as when we hash them. - factor out the object fixups, so we are applying all fixups to directory objects in the same way, independent of the disk format. - factor out the object name fixup (into the previously factored out function) to ensure that filenames are appropriately translated - for example, adfs allows '/' in filenames, which being the Unix path separator, need to be translated to a different character, which is normally '.' (DOS 8.3 filenames represent the . as a / on adfs, so this is the expected reverse translation.) - remove filename truncation; Al asked about this and apparently the decision is to remove it. In any case, adfs's truncation was buggy, so this rids us of that bug by removing the truncation feature. - we now have only one location which adds the "filetype" suffix to the filename, so there's no point that code being out of line. - since we translate '/' into '.', an adfs filename of "/" or "//" would end up being translated to "." and ".." which have special meanings. In this case, change the first character to "^" to avoid these special directory names being abused" * tag 'for-rc-adfs' of git://git.armlinux.org.uk/~rmk/linux-arm: fs/adfs: fix filename fixup handling for "/" and "//" names fs/adfs: move append_filetype_suffix() into adfs_object_fixup() fs/adfs: remove truncated filename hashing fs/adfs: factor out filename fixup fs/adfs: factor out object fixups fs/adfs: factor out filename case lowering fs/adfs: factor out filename comparison
2019-06-06net: mvpp2: Use strscpy to handle stat stringsMaxime Chevallier1-2/+2
Use a safe strscpy call to copy the ethtool stat strings into the relevant buffers, instead of a memcpy that will be accessing out-of-bound data. Fixes: 118d6298f6f0 ("net: mvpp2: add ethtool GOP statistics") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06net: rds: fix memory leak in rds_ib_flush_mr_poolZhu Yanjun1-4/+6
When the following tests last for several hours, the problem will occur. Server: rds-stress -r 1.1.1.16 -D 1M Client: rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30 The following will occur. " Starting up.... tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu % 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 " >From vmcore, we can find that clean_list is NULL. >From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker. Then rds_ib_mr_pool_flush_worker calls " rds_ib_flush_mr_pool(pool, 0, NULL); " Then in function " int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) " ibmr_ret is NULL. In the source code, " ... list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail); if (ibmr_ret) *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode); /* more than one entry in llist nodes */ if (clean_nodes->next) llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list); ... " When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next instead of clean_nodes is added in clean_list. So clean_nodes is discarded. It can not be used again. The workqueue is executed periodically. So more and more clean_nodes are discarded. Finally the clean_list is NULL. Then this problem will occur. Fixes: 1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist") Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06Merge branch 'ipv6-fix-EFAULT-on-sendto-with-icmpv6-and-hdrincl'David S. Miller1-7/+18
Olivier Matz says: ==================== ipv6: fix EFAULT on sendto with icmpv6 and hdrincl The following code returns EFAULT (Bad address): s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1); sendto(ipv6_icmp6_packet, addr); /* returns -1, errno = EFAULT */ The problem is fixed in the second patch. The first one aligns the code to ipv4, to avoid a race condition in the second patch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06ipv6: fix EFAULT on sendto with icmpv6 and hdrinclOlivier Matz1-5/+8
The following code returns EFAULT (Bad address): s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6); setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1); sendto(ipv6_icmp6_packet, addr); /* returns -1, errno = EFAULT */ The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW instead of IPPROTO_ICMPV6. The failure happens because 2 bytes are eaten from the msghdr by rawv6_probe_proto_opt() starting from commit 19e3c66b52ca ("ipv6 equivalent of "ipv4: Avoid reading user iov twice after raw_probe_proto_opt""), but at that time it was not a problem because IPV6_HDRINCL was not yet introduced. Only eat these 2 bytes if hdrincl == 0. Fixes: 715f504b1189 ("ipv6: add IPV6_HDRINCL option for raw sockets") Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06ipv6: use READ_ONCE() for inet->hdrincl as in ipv4Olivier Matz1-2/+10
As it was done in commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") and commit 20b50d79974e ("net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the value of inet->hdrincl in a local variable, to avoid introducing a race condition in the next commit. Signed-off-by: Olivier Matz <olivier.matz@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06nvme-rdma: use dynamic dma mapping per commandMax Gurtovoy1-17/+36
Commit 87fd125344d6 ("nvme-rdma: remove redundant reference between ib_device and tagset") caused a kernel panic when disconnecting from an inaccessible controller (disconnect during re-connection). -- nvme nvme0: Removing ctrl: NQN "testnqn1" nvme_rdma: nvme_rdma_exit_request: hctx 0 queue_idx 1 BUG: unable to handle kernel paging request at 0000000080000228 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... Call Trace: blk_mq_exit_hctx+0x5c/0xf0 blk_mq_exit_queue+0xd4/0x100 blk_cleanup_queue+0x9a/0xc0 nvme_rdma_destroy_io_queues+0x52/0x60 [nvme_rdma] nvme_rdma_shutdown_ctrl+0x3e/0x80 [nvme_rdma] nvme_do_delete_ctrl+0x53/0x80 [nvme_core] nvme_sysfs_delete+0x45/0x60 [nvme_core] kernfs_fop_write+0x105/0x180 vfs_write+0xad/0x1a0 ksys_write+0x5a/0xd0 do_syscall_64+0x55/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fa215417154 -- The reason for this crash is accessing an already freed ib_device for performing dma_unmap during exit_request commands. The root cause for that is that during re-connection all the queues are destroyed and re-created (and the ib_device is reference counted by the queues and freed as well) but the tagset stays alive and all the DMA mappings (that we perform in init_request) kept in the request context. The original commit fixed a different bug that was introduced during bonding (aka nic teaming) tests that for some scenarios change the underlying ib_device and caused memory leakage and possible segmentation fault. This commit is a complementary commit that also changes the wrong DMA mappings that were saved in the request context and making the request sqe dma mappings dynamic with the command lifetime (i.e. mapped in .queue_rq and unmapped in .complete). It also fixes the above crash of accessing freed ib_device during destruction of the tagset. Fixes: 87fd125344d6 ("nvme-rdma: remove redundant reference between ib_device and tagset") Reported-by: Jim Harris <james.r.harris@intel.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-06-06nvme: Fix u32 overflow in the number of namespace list calculationJaesoo Lee1-1/+2
The Number of Namespaces (nn) field in the identify controller data structure is defined as u32 and the maximum allowed value in NVMe specification is 0xFFFFFFFEUL. This change fixes the possible overflow of the DIV_ROUND_UP() operation used in nvme_scan_ns_list() by casting the nn to u64. Signed-off-by: Jaesoo Lee <jalee@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-06-06Revert "gfs2: Replace gl_revokes with a GLF flag"Bob Peterson6-31/+15
Commit 73118ca8baf7 introduced a glock reference counting bug in gfs2_trans_remove_revoke. Given that, replacing gl_revokes with a GLF flag is no longer useful, so revert that commit. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-06-06Merge tag 'misc-habanalabs-fixes-2019-06-06' of git://people.freedesktop.org/~gabbayo/linux into char-misc-linusGreg Kroah-Hartman3-9/+16
Oded writes: This tag contains the following fixes: - Fix the code that checks whether we can use 2MB page size when mapping memory in the ASIC's MMU. The current code had a "hole" which happened in architectures other then x86-64. - Fix the debugfs interface to read/write from/to the device using device virtual addresses. There was a bug in the translation regarding addresses that were mapped using 2MB page size. - Fix a bug in the debug/profiling code, where the code didn't read the full address but only the lower 32-bits of the address. * tag 'misc-habanalabs-fixes-2019-06-06' of git://people.freedesktop.org/~gabbayo/linux: habanalabs: Read upper bits of trace buffer from RWPHI habanalabs: Fix virtual address access via debugfs for 2MB pages habanalabs: fix bug in checking huge page optimization
2019-06-06arm64: Silence gcc warnings about arch ABI driftDave Martin1-0/+1
Since GCC 9, the compiler warns about evolution of the platform-specific ABI, in particular relating for the marshaling of certain structures involving bitfields. The kernel is a standalone binary, and of course nobody would be so stupid as to expose structs containing bitfields as function arguments in ABI. (Passing a pointer to such a struct, however inadvisable, should be unaffected by this change. perf and various drivers rely on that.) So these warnings do more harm than good: turn them off. We may miss warnings about future ABI drift, but that's too bad. Future ABI breaks of this class will have to be debugged and fixed the traditional way unless the compiler evolves finer-grained diagnostics. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-06-06parisc: Fix crash due alternative coding for NP iopdir_fdc bitHelge Deller1-1/+2
According to the found documentation, data cache flushes and sync instructions are needed on the PCX-U+ (PA8200, e.g. C200/C240) platforms, while PCX-W (PA8500, e.g. C360) platforms aparently don't need those flushes when changing the IO PDIR data structures. We have no documentation for PCX-W+ (PA8600) and PCX-W2 (PA8700) CPUs, but Carlo Pisani reported that his C3600 machine (PA8600, PCX-W+) fails when the fdc instructions were removed. His firmware didn't set the NIOP bit, so one may assume it's a firmware bug since other C3750 machines had the bit set. Even if documentation (as mentioned above) states that PCX-W (PA8500, e.g. J5000) does not need fdc flushes, Sven could show that an Adaptec 29320A PCI-X SCSI controller reliably failed on a dd command during the first five minutes in his J5000 when fdc flushes were missing. Going forward, we will now NOT replace the fdc and sync assembler instructions by NOPS if: a) the NP iopdir_fdc bit was set by firmware, or b) we find a CPU up to and including a PCX-W+ (PA8600). This fixes the HPMC crashes on a C240 and C36XX machines. For other machines we rely on the firmware to set the bit when needed. In case one finds HPMC issues, people could try to boot their machines with the "no-alternatives" kernel option to turn off any alternative patching. Reported-by: Sven Schnelle <svens@stackframe.org> Reported-by: Carlo Pisani <carlojpisani@gmail.com> Tested-by: Sven Schnelle <svens@stackframe.org> Fixes: 3847dab77421 ("parisc: Add alternative coding infrastructure") Signed-off-by: Helge Deller <deller@gmx.de> Cc: stable@vger.kernel.org # 5.0+
2019-06-06parisc: Use lpa instruction to load physical addresses in driver codeJohn David Anglin3-2/+26
Most I/O in the kernel is done using the kernel offset mapping. However, there is one API that uses aliased kernel address ranges: > The final category of APIs is for I/O to deliberately aliased address > ranges inside the kernel. Such aliases are set up by use of the > vmap/vmalloc API. Since kernel I/O goes via physical pages, the I/O > subsystem assumes that the user mapping and kernel offset mapping are > the only aliases. This isn't true for vmap aliases, so anything in > the kernel trying to do I/O to vmap areas must manually manage > coherency. It must do this by flushing the vmap range before doing > I/O and invalidating it after the I/O returns. For this reason, we should use the hardware lpa instruction to load the physical address of kernel virtual addresses in the driver code. I believe we only use the vmap/vmalloc API with old PA 1.x processors which don't have a sba, so we don't hit this problem. Tested on c3750, c8000 and rp3440. Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de>
2019-06-06parisc: configs: Remove useless UEVENT_HELPER_PATHKrzysztof Kozlowski7-7/+0
Remove the CONFIG_UEVENT_HELPER_PATH because: 1. It is disabled since commit 1be01d4a5714 ("driver: base: Disable CONFIG_UEVENT_HELPER by default") as its dependency (UEVENT_HELPER) was made default to 'n', 2. It is not recommended (help message: "This should not be used today [...] creates a high system load") and was kept only for ancient userland, 3. Certain userland specifically requests it to be disabled (systemd README: "Legacy hotplug slows down the system and confuses udev"). Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Helge Deller <deller@gmx.de>
2019-06-06parisc: Use implicit space register selection for loading the coherence index of I/O pdirsJohn David Anglin2-5/+2
We only support I/O to kernel space. Using %sr1 to load the coherence index may be racy unless interrupts are disabled. This patch changes the code used to load the coherence index to use implicit space register selection. This saves one instruction and eliminates the race. Tested on rp3440, c8000 and c3750. Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org Signed-off-by: Helge Deller <deller@gmx.de>
2019-06-06ARM64: trivial: s/TIF_SECOMP/TIF_SECCOMP/ comment typo fixGeorge G. Davis1-1/+1
Fix a s/TIF_SECOMP/TIF_SECCOMP/ comment typo Cc: Jiri Kosina <trivial@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org Signed-off-by: George G. Davis <george_davis@mentor.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-06-06drm/komeda: Potential error pointer dereferenceDan Carpenter1-1/+1
We need to check whether drm_atomic_get_crtc_state() returns an error pointer before dereferencing "crtc_st". Fixes: 9e5603094176 ("drm/komeda: Add komeda_plane/plane_helper_funcs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: "james qian wang (Arm Technology China)" <james.qian.wang@arm.com> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
2019-06-06drm/komeda: remove set but not used variable 'kcrtc'YueHaibing1-2/+0
Fixes gcc '-Wunused-but-set-variable' warning: drivers/gpu/drm/arm/display/komeda/komeda_plane.c: In function komeda_plane_atomic_check: drivers/gpu/drm/arm/display/komeda/komeda_plane.c:49:22: warning: variable kcrtc set but not used [-Wunused-but-set-variable] It is never used since introduction in commit 9e5603094176 ("drm/komeda: Add komeda_plane/plane_helper_funcs") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: James Qian Wang (Arm Technology China) <james.qian.wang@arm.com> Signed-off-by: Liviu Dudau <liviu.dudau@arm.com>
2019-06-06Merge tag 'drm-misc-fixes-2019-06-05' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixesDave Airlie7-38/+53
- Allow fb changes in async commits (fixes igt failures) (Helen) - Actually unmap the scatterlist when unmapping udmabuf (Lucas) Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Helen Koike <helen.koike@collabora.com> Signed-off-by: Dave Airlie <airlied@redhat.com> From: Sean Paul <sean@poorly.run> Link: https://patchwork.freedesktop.org/patch/msgid/20190605210335.GA35431@art_vandelay
2019-06-06Merge branch 'drm-fixes-5.2' of git://people.freedesktop.org/~agd5f/linux into drm-fixesDave Airlie8-14/+63
- A fix to make VCE resume more reliable - Updates for new raven variants Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190605182332.4073-1-alexander.deucher@amd.com
2019-06-06Merge tag 'drm-intel-fixes-2019-06-03' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixesDave Airlie5-11/+62
- Add missing Icelake W/A to disable GPU hang on cache ECC error - GVT a fix for recently seen arbitrary DMA map fault and more enforcement fixes. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190603132928.GA4866@jlahtine-desk.ger.corp.intel.com
2019-06-05hwmon: (pmbus/core) Treat parameters as paged if on multiple pagesRobert Hancock1-4/+30
Some chips have attributes which exist on more than one page but the attribute is not presently marked as paged. This causes the attributes to be generated with the same label, which makes it impossible for userspace to tell them apart. Marking all such attributes as paged would result in the page suffix being added regardless of whether they were present on more than one page or not, which might break existing setups. Therefore, we add a second check which treats the attribute as paged, even if not marked as such, if it is present on multiple pages. Fixes: b4ce237b7f7d ("hwmon: (pmbus) Introduce infrastructure to detect sensors and limit registers") Signed-off-by: Robert Hancock <hancock@sedsystems.ca> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2019-06-05hwmon: (pmbus/core) mutex_lock write in pmbus_set_samplesAdamski, Krzysztof (Nokia - PL/Wroclaw)1-0/+3
update_lock is a mutex intended to protect write operations. It was not taken, however, when _pmbus_write_word_data is called from pmbus_set_samples() function which may cause problems especially when some PMBUS_VIRT_* operation is implemented as a read-modify-write cycle. This patch makes sure the lock is held during the operation. Fixes: 49c4455dccf2 ("hwmon: (pmbus) Introduce PMBUS_VIRT_*_SAMPLES registers") Signed-off-by: Krzysztof Adamski <krzysztof.adamski@nokia.com> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> [groeck: Declared and initialized missing 'data' variable] Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2019-06-05hwmon: (core) add thermal sensors only if dev->of_node is presentEduardo Valentin1-1/+1
Drivers may register to hwmon and request for also registering with the thermal subsystem (HWMON_C_REGISTER_TZ). However, some of these driver, e.g. marvell phy, may be probed from Device Tree or being dynamically allocated, and in the later case, it will not have a dev->of_node entry. Registering with hwmon without the dev->of_node may result in different outcomes depending on the device tree, which may be a bit misleading. If the device tree blob has no 'thermal-zones' node, the *hwmon_device_register*() family functions are going to gracefully succeed, because of-thermal, *thermal_zone_of_sensor_register() return -ENODEV in this case, and the hwmon error path handles this error code as success to cover for the case where CONFIG_THERMAL_OF is not set. However, if the device tree blob has the 'thermal-zones' entry, the *hwmon_device_register*() will always fail on callers with no dev->of_node, propagating -EINVAL. If dev->of_node is not present, calling of-thermal does not make sense. For this reason, this patch checks first if the device has a of_node before going over the process of registering with the thermal subsystem of-thermal interface. And in this case, when a caller of *hwmon_device_register*() with HWMON_C_REGISTER_TZ and no dev->of_node will still register with hwmon, but not with the thermal subsystem. If all the hwmon part bits are in place, the registration will succeed. Fixes: d560168b5d0f ("hwmon: (core) New hwmon registration API") Cc: Jean Delvare <jdelvare@suse.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: linux-hwmon@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Eduardo Valentin <eduval@amazon.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2019-06-05Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"Hangbin Liu1-3/+3
This reverts commit e9919a24d3022f72bcadc407e73a6ef17093a849. Nathan reported the new behaviour breaks Android, as Android just add new rules and delete old ones. If we return 0 without adding dup rules, Android will remove the new added rules and causing system to soft-reboot. Fixes: e9919a24d302 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Yaro Slav <yaro330@gmail.com> Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05net: aquantia: fix wol configuration not applied sometimesNikita Danilov2-8/+10
WoL magic packet configuration sometimes does not work due to couple of leakages found. Mainly there was a regression introduced during readx_poll refactoring. Next, fw request waiting time was too small. Sometimes that caused sleep proxy config function to return with an error and to skip WoL configuration. At last, WoL data were passed to FW from not clean buffer. That could cause FW to accept garbage as a random configuration data. Fixes: 6a7f2277313b ("net: aquantia: replace AQ_HW_WAIT_FOR with readx_poll_timeout_atomic") Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05ethtool: fix potential userspace buffer overflowVivien Didelot1-1/+4
ethtool_get_regs() allocates a buffer of size ops->get_regs_len(), and pass it to the kernel driver via ops->get_regs() for filling. There is no restriction about what the kernel drivers can or cannot do with the open ethtool_regs structure. They usually set regs->version and ignore regs->len or set it to the same size as ops->get_regs_len(). But if userspace allocates a smaller buffer for the registers dump, we would cause a userspace buffer overflow in the final copy_to_user() call, which uses the regs.len value potentially reset by the driver. To fix this, make this case obvious and store regs.len before calling ops->get_regs(), to only copy as much data as requested by userspace, up to the value returned by ops->get_regs_len(). While at it, remove the redundant check for non-null regbuf. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05Fix memory leak in sctp_process_initNeil Horman2-10/+8
syzbot found the following leak in sctp_process_init BUG: memory leak unreferenced object 0xffff88810ef68400 (size 1024): comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s) hex dump (first 32 bytes): 1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25 ..(........h...% 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline] [<00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline] [<00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline] [<00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675 [<000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119 [<00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline] [<00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20 net/sctp/sm_make_chunk.c:2437 [<00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682 [inline] [<00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384 [inline] [<00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194 [inline] [<00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165 [<0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200 net/sctp/associola.c:1074 [<00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95 [<00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354 [<00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline] [<00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418 [<00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934 [<00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122 [<00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802 [<00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline] [<00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671 [<00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292 [<000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330 [<00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline] [<00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline] [<00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337 [<00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3 The problem was that the peer.cookie value points to an skb allocated area on the first pass through this function, at which point it is overwritten with a heap allocated value, but in certain cases, where a COOKIE_ECHO chunk is included in the packet, a second pass through sctp_process_init is made, where the cookie value is re-allocated, leaking the first allocation. Fix is to always allocate the cookie value, and free it when we are done using it. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05net: rds: fix memory leak when unload rds_rdmaZhu Yanjun2-1/+4
When KASAN is enabled, after several rds connections are created, then "rmmod rds_rdma" is run. The following will appear. " BUG rds_ib_incoming (Not tainted): Objects remaining in rds_ib_incoming on __kmem_cache_shutdown() Call Trace: dump_stack+0x71/0xab slab_err+0xad/0xd0 __kmem_cache_shutdown+0x17d/0x370 shutdown_cache+0x17/0x130 kmem_cache_destroy+0x1df/0x210 rds_ib_recv_exit+0x11/0x20 [rds_rdma] rds_ib_exit+0x7a/0x90 [rds_rdma] __x64_sys_delete_module+0x224/0x2c0 ? __ia32_sys_delete_module+0x2c0/0x2c0 do_syscall_64+0x73/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 " This is rds connection memory leak. The root cause is: When "rmmod rds_rdma" is run, rds_ib_remove_one will call rds_ib_dev_shutdown to drop the rds connections. rds_ib_dev_shutdown will call rds_conn_drop to drop rds connections as below. " rds_conn_path_drop(&conn->c_path[0], false); " In the above, destroy is set to false. void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy) { atomic_set(&cp->cp_state, RDS_CONN_ERROR); rcu_read_lock(); if (!destroy && rds_destroy_pending(cp->cp_conn)) { rcu_read_unlock(); return; } queue_work(rds_wq, &cp->cp_down_w); rcu_read_unlock(); } In the above function, destroy is set to false. rds_destroy_pending is called. This does not move rds connections to ib_nodev_conns. So destroy is set to true to move rds connections to ib_nodev_conns. In rds_ib_unregister_client, flush_workqueue is called to make rds_wq finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns is called to shutdown rds connections finally. Then rds_ib_recv_exit is called to destroy slab. void rds_ib_recv_exit(void) { kmem_cache_destroy(rds_ib_incoming_slab); kmem_cache_destroy(rds_ib_frag_slab); } The above slab memory leak will not occur again. >From tests, 256 rds connections [root@ca-dev14 ~]# time rmmod rds_rdma real 0m16.522s user 0m0.000s sys 0m8.152s 512 rds connections [root@ca-dev14 ~]# time rmmod rds_rdma real 0m32.054s user 0m0.000s sys 0m15.568s To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed. And with 512 rds connections, about 32 seconds are needed. >From ftrace, when one rds connection is destroyed, " 19) | rds_conn_destroy [rds]() { 19) 7.782 us | rds_conn_path_drop [rds](); 15) | rds_shutdown_worker [rds]() { 15) | rds_conn_shutdown [rds]() { 15) 1.651 us | rds_send_path_reset [rds](); 15) 7.195 us | } 15) + 11.434 us | } 19) 2.285 us | rds_cong_remove_conn [rds](); 19) * 24062.76 us | } " So if many rds connections will be destroyed, this function rds_ib_destroy_nodev_conns uses most of time. Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05ipv6: fix the check before getting the cookie in rt6_get_cookieXin Long1-2/+1
In Jianlin's testing, netperf was broken with 'Connection reset by peer', as the cookie check failed in rt6_check() and ip6_dst_check() always returned NULL. It's caused by Commit 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes"), where the cookie can be got only when 'c1'(see below) for setting dst_cookie whereas rt6_check() is called when !'c1' for checking dst_cookie, as we can see in ip6_dst_check(). Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check() (!c1) will check the 'from' cookie, this patch is to remove the c1 check in rt6_get_cookie(), so that the dst_cookie can always be set properly. c1: (rt->rt6i_flags & RTF_PCPU || unlikely(!list_empty(&rt->rt6i_uncached))) Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05ipv4: not do cache for local delivery if bc_forwarding is enabledXin Long1-12/+12
With the topo: h1 ---| rp1 | | route rp3 |--- h3 (192.168.200.1) h2 ---| rp2 | If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on h2, and the packets can still be forwared. This issue was caused by the input route cache. It should only do the cache for either bc forwarding or local delivery. Otherwise, local delivery can use the route cache for bc forwarding of other interfaces. This patch is to fix it by not doing cache for local delivery if all.bc_forwarding is enabled. Note that we don't fix it by checking route cache local flag after rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the common route code shouldn't be touched for bc_forwarding. Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05selftests: vm: Fix test build failure when built by itselfShuah Khan1-4/+0
vm test build fails when test is built by itself using make -C tools/testing/selftests/vm or cd tools/testing/selftests/vm; make When the test is built invoking its Makefile directly, it defines OUTPUT which conflicts with lib.mk's logic to install headers. make --no-builtin-rules INSTALL_HDR_PATH=$OUTPUT/usr \ ARCH=x86 -C ../../../.. headers_install make[1]: Entering directory '/mnt/data/lkml/linux_5.2' REMOVE shmparam.h rm: cannot remove '/usr/include/asm-generic/shmparam.h': Permission denied scripts/Makefile.headersinst:96: recipe for target '/usr/include/asm-generic/.install' failed make[3]: *** [/usr/include/asm-generic/.install] Error 1 scripts/Makefile.headersinst:32: recipe for target 'asm-generic' failed make[2]: *** [asm-generic] Error 2 Makefile:1199: recipe for target 'headers_install' failed make[1]: *** [headers_install] Error 2 make[1]: Leaving directory '/mnt/data/lkml/linux_5.2' ../lib.mk:52: recipe for target 'khdr' failed make: *** [khdr] Error 2 Fixes: 8ce72dc32578 ("selftests: fix headers_install circular dependency") Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-06-05Merge tag 'linux-kselftest-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftestLinus Torvalds4-2/+13
Pull Kselftest fixes from Shuah Khan: - fixes to cgroup tests (Alex Shi) - fix to userfaultfd compiler warning (Alakesh Haloi) - fix to vm install to include test script to run the test (Naresh Kamboju) * tag 'linux-kselftest-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: vm: install test_vmalloc.sh for run_vmtests userfaultfd: selftest: fix compiler warning kselftest/cgroup: fix incorrect test_core skip kselftest/cgroup: fix unexpected testing failure on test_core kselftest/cgroup: fix unexpected testing failure on test_memcontrol
2019-06-05Merge tag 'pidfd-fixes-v5.2-rc4' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds3-6/+13
Pull pidfd fixes from Christian Brauner: "The contains two small patches to the pidfd samples and test binaries respectively. They were lacking appropriate ifdefines for __NR_pidfd_send_signal and could hence lead to compilation errors when that was not defined. This was spotted on mips independently by Guenter Roeck (who was kind enough to send a fix for the samples binary) and Arnd who spotted it in linux-next. Apart from these two patches, there's also a patch to update the comments for the pidfd_send_signal() syscall which were slightly wrong/inconsistenly worded" * tag 'pidfd-fixes-v5.2-rc4' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux: tests: fix pidfd-test compilation signal: improve comments samples: fix pidfd-metadata compilation
2019-06-05Merge tag 'pstore-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds2-15/+28
Pull pstore fixes from Kees Cook: - Avoid NULL deref when unloading/reloading ramoops module (Pi-Hsun Shih) - Run ramoops without crash dump region * tag 'pstore-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Run without kernel crash dump region pstore: Set tfm to NULL on free_buf_for_compression
2019-06-05mmc: also set max_segment_size in the deviceChristoph Hellwig1-0/+2
If we only set the max_segment_size on the queue an IOMMU merge might create bigger segments again, so limit the IOMMU merges as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05mtip32xx: also set max_segment_size in the deviceChristoph Hellwig1-0/+1
If we only set the max_segment_size on the queue an IOMMU merge might create bigger segments again, so limit the IOMMU merges as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05rsxx: don't call dma_set_max_seg_sizeChristoph Hellwig1-1/+0
This driver does never uses dma_map_sg, so the setting is rather pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05nvme-pci: don't limit DMA segement sizeChristoph Hellwig1-0/+6
NVMe uses PRPs (or optionally unlimited SGLs) for data transfers and has no specific limit for a single DMA segement. Limiting the size will cause problems because the block layer assumes PRP-ish devices using a virt boundary mask don't have a segment limit. And while this is true, we also really need to tell the DMA mapping layer about it, otherwise dma-debug will trip over it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Sebastian Ott <sebott@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05Merge branch 's390-qeth-fixes'David S. Miller3-12/+44
Julian Wiedmann says: ==================== s390/qeth: fixes 2019-06-05 one more shot... now with patch 2 fixed up so that it uses the dst entry returned from dst_check(). From the v1 cover letter: Please apply the following set of qeth fixes to -net. - The first two patches fix issues in the L3 driver's cast type selection for transmitted skbs. - Alexandra adds a sanity check when retrieving VLAN information from neighbour address events. - The last patch adds some missing error handling for qeth's new multiqueue code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05s390/qeth: handle error when updating TX queue countJulian Wiedmann1-6/+16
netif_set_real_num_tx_queues() can return an error, deal with it. Fixes: 73dc2daf110f ("s390/qeth: add TX multiqueue support for OSA devices") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05s390/qeth: fix VLAN attribute in bridge_hostnotify udev eventAlexandra Winter1-1/+1
Enabling sysfs attribute bridge_hostnotify triggers a series of udev events for the MAC addresses of all currently connected peers. In case no VLAN is set for a peer, the device reports the corresponding MAC addresses with VLAN ID 4096. This currently results in attribute VLAN=4096 for all non-VLAN interfaces in the initial series of events after host-notify is enabled. Instead, no VLAN attribute should be reported in the udev event for non-VLAN interfaces. Only the initial events face this issue. For dynamic changes that are reported later, the device uses a validity flag. This also changes the code so that it now sets the VLAN attribute for MAC addresses with VID 0. On Linux, no qeth interface will ever be registered with VID 0: Linux kernel registers VID 0 on all network interfaces initially, but qeth will drop .ndo_vlan_rx_add_vid for VID 0. Peers with other OSs could register MACs with VID 0. Fixes: 9f48b9db9a22 ("qeth: bridgeport support - address notifications") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05s390/qeth: check dst entry before useJulian Wiedmann1-5/+25
While qeth_l3 uses netif_keep_dst() to hold onto the dst, a skb's dst may still have been obsoleted (via dst_dev_put()) by the time that we end up using it. The dst then points to the loopback interface, which means the neighbour lookup in qeth_l3_get_cast_type() determines a bogus cast type of RTN_BROADCAST. For IQD interfaces this causes us to place such skbs on the wrong HW queue, resulting in TX errors. Fix-up the various call sites to first validate the dst entry with dst_check(), and fall back accordingly. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05s390/qeth: handle limited IPv4 broadcast in L3 TX pathJulian Wiedmann1-0/+2
When selecting the cast type of a neighbourless IPv4 skb (eg. on a raw socket), qeth_l3 falls back to the packet's destination IP address. For this case we should classify traffic sent to 255.255.255.255 as broadcast. This fixes DHCP requests, which were misclassified as unicast (and for IQD interfaces thus ended up on the wrong HW queue). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05ceph: fix error handling in ceph_get_caps()Yan, Zheng1-11/+11
The function return 0 even when interrupted or try_get_cap_refs() return error. Fixes: 1199d7da2d ("ceph: simplify arguments and return semantics of try_get_cap_refs") Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-06-05ceph: avoid iput_final() while holding mutex or in dispatch threadYan, Zheng6-27/+71
iput_final() may wait for reahahead pages. The wait can cause deadlock. For example: Workqueue: ceph-msgr ceph_con_workfn [libceph] Call Trace: schedule+0x36/0x80 io_schedule+0x16/0x40 __lock_page+0x101/0x140 truncate_inode_pages_range+0x556/0x9f0 truncate_inode_pages_final+0x4d/0x60 evict+0x182/0x1a0 iput+0x1d2/0x220 iterate_session_caps+0x82/0x230 [ceph] dispatch+0x678/0xa80 [ceph] ceph_con_workfn+0x95b/0x1560 [libceph] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 ret_from_fork+0x22/0x40 Workqueue: ceph-msgr ceph_con_workfn [libceph] Call Trace: __schedule+0x3d6/0x8b0 schedule+0x36/0x80 schedule_preempt_disabled+0xe/0x10 mutex_lock+0x2f/0x40 ceph_check_caps+0x505/0xa80 [ceph] ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph] writepages_finish+0x2d3/0x410 [ceph] __complete_request+0x26/0x60 [libceph] handle_reply+0x6c8/0xa10 [libceph] dispatch+0x29a/0xbb0 [libceph] ceph_con_workfn+0x95b/0x1560 [libceph] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 ret_from_fork+0x22/0x40 In above example, truncate_inode_pages_range() waits for readahead pages while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks OSD dispatch thread. Later OSD replies (for readahead) can't be handled. ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock can happen if iput_final() is called while holding snap_rwsem. In general, it's not good to call iput_final() inside MDS/OSD dispatch threads or while holding any mutex. The fix is introducing ceph_async_iput(), which calls iput_final() in workqueue. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>