From ff9f84b7d79ddccab4c293c9d3e289f95ae594f7 Mon Sep 17 00:00:00 2001 From: Steve French Date: Sat, 26 Sep 2015 09:48:58 -0500 Subject: [SMB3] Missing null tcon check Pointed out by Dan Carpenter via smatch code analysis tool CC: Dan Carpenter Signed-off-by: Steve French --- fs/cifs/smb2pdu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index ce83e2edbe0a..597a417ba94d 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -922,7 +922,7 @@ SMB2_tcon(const unsigned int xid, struct cifs_ses *ses, const char *tree, if (tcon && tcon->bad_network_name) return -ENOENT; - if ((tcon->seal) && + if ((tcon && tcon->seal) && ((ses->server->capabilities & SMB2_GLOBAL_CAP_ENCRYPTION) == 0)) { cifs_dbg(VFS, "encryption requested but no server support"); return -EOPNOTSUPP; -- cgit v1.3-7-g2ca7 From cf6f54e3f133229f02a90c04fe0ff9dd9d3264b4 Mon Sep 17 00:00:00 2001 From: Richard Weinberger Date: Wed, 8 Jul 2015 11:46:36 +0200 Subject: UBIFS: Kill unneeded locking in ubifs_init_security Fixes the following lockdep splat: [ 1.244527] ============================================= [ 1.245193] [ INFO: possible recursive locking detected ] [ 1.245193] 4.2.0-rc1+ #37 Not tainted [ 1.245193] --------------------------------------------- [ 1.245193] cp/742 is trying to acquire lock: [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] ubifs_init_security+0x29/0xb0 [ 1.245193] [ 1.245193] but task is already holding lock: [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280 [ 1.245193] [ 1.245193] other info that might help us debug this: [ 1.245193] Possible unsafe locking scenario: [ 1.245193] [ 1.245193] CPU0 [ 1.245193] ---- [ 1.245193] lock(&sb->s_type->i_mutex_key#9); [ 1.245193] lock(&sb->s_type->i_mutex_key#9); [ 1.245193] [ 1.245193] *** DEADLOCK *** [ 1.245193] [ 1.245193] May be due to missing lock nesting notation [ 1.245193] [ 1.245193] 2 locks held by cp/742: [ 1.245193] #0: (sb_writers#5){.+.+.+}, at: [] mnt_want_write+0x1f/0x50 [ 1.245193] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280 [ 1.245193] [ 1.245193] stack backtrace: [ 1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37 [ 1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014 [ 1.245193] ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5 [ 1.245193] ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68 [ 1.245193] 000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500 [ 1.245193] Call Trace: [ 1.245193] [] dump_stack+0x4c/0x65 [ 1.245193] [] ? console_unlock+0x1c5/0x510 [ 1.245193] [] __lock_acquire+0x1a6d/0x1ea0 [ 1.245193] [] ? __lock_is_held+0x58/0x80 [ 1.245193] [] lock_acquire+0xd3/0x270 [ 1.245193] [] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [] mutex_lock_nested+0x6b/0x3a0 [ 1.245193] [] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [] ? ubifs_init_security+0x29/0xb0 [ 1.245193] [] ubifs_init_security+0x29/0xb0 [ 1.245193] [] ubifs_create+0xa6/0x1f0 [ 1.245193] [] ? path_openat+0x3af/0x1280 [ 1.245193] [] vfs_create+0x95/0xc0 [ 1.245193] [] path_openat+0x7cc/0x1280 [ 1.245193] [] ? __lock_acquire+0x543/0x1ea0 [ 1.245193] [] ? sched_clock_cpu+0x90/0xc0 [ 1.245193] [] ? calc_global_load_tick+0x60/0x90 [ 1.245193] [] ? sched_clock_cpu+0x90/0xc0 [ 1.245193] [] ? __alloc_fd+0xaf/0x180 [ 1.245193] [] do_filp_open+0x75/0xd0 [ 1.245193] [] ? _raw_spin_unlock+0x26/0x40 [ 1.245193] [] ? __alloc_fd+0xaf/0x180 [ 1.245193] [] do_sys_open+0x129/0x200 [ 1.245193] [] SyS_open+0x19/0x20 [ 1.245193] [] entry_SYSCALL_64_fastpath+0x12/0x6f While the lockdep splat is a false positive, becuase path_openat holds i_mutex of the parent directory and ubifs_init_security() tries to acquire i_mutex of a new inode, it reveals that taking i_mutex in ubifs_init_security() is in vain because it is only being called in the inode allocation path and therefore nobody else can see the inode yet. Cc: stable@vger.kernel.org # 3.20- Reported-and-tested-by: Boris Brezillon Reviewed-and-tested-by: Dongsheng Yang Signed-off-by: Richard Weinberger Signed-off-by: dedekind1@gmail.com --- fs/ubifs/xattr.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'fs') diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c index 96f3448b6eb4..fd65b3f1923c 100644 --- a/fs/ubifs/xattr.c +++ b/fs/ubifs/xattr.c @@ -652,11 +652,8 @@ int ubifs_init_security(struct inode *dentry, struct inode *inode, { int err; - mutex_lock(&inode->i_mutex); err = security_inode_init_security(inode, dentry, qstr, &init_xattrs, 0); - mutex_unlock(&inode->i_mutex); - if (err) { struct ubifs_info *c = dentry->i_sb->s_fs_info; ubifs_err(c, "cannot initialize security for inode %lu, error %d", -- cgit v1.3-7-g2ca7 From 8346c416d17bf5b4ea1508662959bb62e73fd6a5 Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Thu, 1 Oct 2015 15:36:59 -0700 Subject: dax: fix NULL pointer in __dax_pmd_fault() Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") moved some code in __dax_pmd_fault() that was responsible for zeroing newly allocated PMD pages. The new location didn't properly set up 'kaddr', so when run this code resulted in a NULL pointer BUG. Fix this by getting the correct 'kaddr' via bdev_direct_access(). Signed-off-by: Ross Zwisler Reported-by: Dan Williams Reviewed-by: Dan Williams Cc: Alexander Viro Cc: Matthew Wilcox Cc: "Kirill A. Shutemov" Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/dax.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/dax.c b/fs/dax.c index 7ae6df7ea1d2..bcfb14bfc1e4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -569,8 +569,20 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) goto fallback; + sector = bh.b_blocknr << (blkbits - 9); + if (buffer_unwritten(&bh) || buffer_new(&bh)) { int i; + + length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, + bh.b_size); + if (length < 0) { + result = VM_FAULT_SIGBUS; + goto out; + } + if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) + goto fallback; + for (i = 0; i < PTRS_PER_PMD; i++) clear_pmem(kaddr + i * PAGE_SIZE, PAGE_SIZE); wmb_pmem(); @@ -623,7 +635,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_NOPAGE; spin_unlock(ptl); } else { - sector = bh.b_blocknr << (blkbits - 9); length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size); if (length < 0) { -- cgit v1.3-7-g2ca7 From 646200a041203f440fb6fcf9cacd9efeda9de74c Mon Sep 17 00:00:00 2001 From: Steve French Date: Mon, 28 Sep 2015 17:21:07 -0500 Subject: [SMB3] Do not fall back to SMBWriteX in set_file_size error cases The error paths in set_file_size for cifs and smb3 are incorrect. In the unlikely event that a server did not support set file info of the file size, the code incorrectly falls back to trying SMBWriteX (note that only the original core SMB Write, used for example by DOS, can set the file size this way - this actually does not work for the more recent SMBWriteX). The idea was since the old DOS SMB Write could set the file size if you write zero bytes at that offset then use that if server rejects the normal set file info call. Fortunately the SMBWriteX will never be sent on the wire (except when file size is zero) since the length and offset fields were reversed in the two places in this function that call SMBWriteX causing the fall back path to return an error. It is also important to never call an SMB request from an SMB2/sMB3 session (which theoretically would be possible, and can cause a brief session drop, although the client recovers) so this should be fixed. In practice this path does not happen with modern servers but the error fall back to SMBWriteX is clearly wrong. Removing the calls to SMBWriteX in the error paths in cifs_set_file_size Pointed out by PaX/grsecurity team Signed-off-by: Steve French Reported-by: PaX Team CC: Emese Revfy CC: Brad Spengler CC: Stable --- fs/cifs/inode.c | 34 ---------------------------------- 1 file changed, 34 deletions(-) (limited to 'fs') diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index f621b44cb800..6b66dd5d1540 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -2034,7 +2034,6 @@ cifs_set_file_size(struct inode *inode, struct iattr *attrs, struct tcon_link *tlink = NULL; struct cifs_tcon *tcon = NULL; struct TCP_Server_Info *server; - struct cifs_io_parms io_parms; /* * To avoid spurious oplock breaks from server, in the case of @@ -2056,18 +2055,6 @@ cifs_set_file_size(struct inode *inode, struct iattr *attrs, rc = -ENOSYS; cifsFileInfo_put(open_file); cifs_dbg(FYI, "SetFSize for attrs rc = %d\n", rc); - if ((rc == -EINVAL) || (rc == -EOPNOTSUPP)) { - unsigned int bytes_written; - - io_parms.netfid = open_file->fid.netfid; - io_parms.pid = open_file->pid; - io_parms.tcon = tcon; - io_parms.offset = 0; - io_parms.length = attrs->ia_size; - rc = CIFSSMBWrite(xid, &io_parms, &bytes_written, - NULL, NULL, 1); - cifs_dbg(FYI, "Wrt seteof rc %d\n", rc); - } } else rc = -EINVAL; @@ -2093,28 +2080,7 @@ cifs_set_file_size(struct inode *inode, struct iattr *attrs, else rc = -ENOSYS; cifs_dbg(FYI, "SetEOF by path (setattrs) rc = %d\n", rc); - if ((rc == -EINVAL) || (rc == -EOPNOTSUPP)) { - __u16 netfid; - int oplock = 0; - rc = SMBLegacyOpen(xid, tcon, full_path, FILE_OPEN, - GENERIC_WRITE, CREATE_NOT_DIR, &netfid, - &oplock, NULL, cifs_sb->local_nls, - cifs_remap(cifs_sb)); - if (rc == 0) { - unsigned int bytes_written; - - io_parms.netfid = netfid; - io_parms.pid = current->tgid; - io_parms.tcon = tcon; - io_parms.offset = 0; - io_parms.length = attrs->ia_size; - rc = CIFSSMBWrite(xid, &io_parms, &bytes_written, NULL, - NULL, 1); - cifs_dbg(FYI, "wrt seteof rc %d\n", rc); - CIFSSMBClose(xid, tcon, netfid); - } - } if (tlink) cifs_put_tlink(tlink); -- cgit v1.3-7-g2ca7 From 40f90271a835478e5910dc84f32b8e25885419a4 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Thu, 1 Oct 2015 11:36:38 -0400 Subject: NFS: Fix up page writeback accounting Currently, we are crediting all the calls to nfs_writepages_callback() (i.e. the nfs_writepages() callback) to nfs_writepage(). Aside from being inconsistent with the behaviour of the equivalent readpage/readpages accounting, this also means that we cannot distinguish between bulk writes and single page writebacks (which confuses the 'nfsiostat -p' tool). Signed-off-by: Trond Myklebust --- fs/nfs/write.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'fs') diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 72624dc4a623..cfa47761a8ce 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -569,19 +569,17 @@ static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio, if (!nfs_pageio_add_request(pgio, req)) { nfs_redirty_request(req); ret = pgio->pg_error; - } + } else + nfs_add_stats(page_file_mapping(page)->host, + NFSIOS_WRITEPAGES, 1); out: return ret; } static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, struct nfs_pageio_descriptor *pgio) { - struct inode *inode = page_file_mapping(page)->host; int ret; - nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE); - nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1); - nfs_pageio_cond_complete(pgio, page_file_index(page)); ret = nfs_page_async_flush(pgio, page, wbc->sync_mode == WB_SYNC_NONE); if (ret == -EAGAIN) { @@ -597,9 +595,11 @@ static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, st static int nfs_writepage_locked(struct page *page, struct writeback_control *wbc) { struct nfs_pageio_descriptor pgio; + struct inode *inode = page_file_mapping(page)->host; int err; - nfs_pageio_init_write(&pgio, page->mapping->host, wb_priority(wbc), + nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE); + nfs_pageio_init_write(&pgio, inode, wb_priority(wbc), false, &nfs_async_write_completion_ops); err = nfs_do_writepage(page, wbc, &pgio); nfs_pageio_complete(&pgio); -- cgit v1.3-7-g2ca7 From 8fa4592a14ebb3c22a21d846d1e4f65dab7d1a7c Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Thu, 1 Oct 2015 18:38:27 -0400 Subject: NFS: Fix a write performance regression If all other conditions in nfs_can_extend_write() are met, and there are no locks, then we should be able to assume close-to-open semantics and the ability to extend our write to cover the whole page. With this patch, the xfstests generic/074 test completes in 242s instead of >1400s on my test rig. Fixes: bd61e0a9c852 ("locks: convert posix locks to file_lock_context") Cc: Jeff Layton Signed-off-by: Trond Myklebust --- fs/nfs/write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/nfs/write.c b/fs/nfs/write.c index cfa47761a8ce..75ab7622e0cc 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1223,7 +1223,7 @@ static int nfs_can_extend_write(struct file *file, struct page *page, struct ino return 1; if (!flctx || (list_empty_careful(&flctx->flc_flock) && list_empty_careful(&flctx->flc_posix))) - return 0; + return 1; /* Check to see if there are whole file write locks */ ret = 0; -- cgit v1.3-7-g2ca7 From 4a0954ef347de7409ddf8f8153d893827d3feba8 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Fri, 2 Oct 2015 11:11:16 -0400 Subject: NFSv4: Don't try to reclaim unused state owners Currently, we don't test if the state owner is in use before we try to recover it. The problem is that if the refcount is zero, then the state owner will be waiting on the lru list for garbage collection. The expectation in that case is that if you bump the refcount, then you must also remove the state owner from the lru list. Otherwise the call to nfs4_put_state_owner will corrupt that list by trying to add our state owner a second time. Avoid the whole problem by just skipping state owners that hold no state. Reported-by: Andrew W Elble Signed-off-by: Trond Myklebust --- fs/nfs/nfs4state.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 5db324635e92..d854693a15b0 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -1725,7 +1725,8 @@ restart: if (!test_and_clear_bit(ops->owner_flag_bit, &sp->so_flags)) continue; - atomic_inc(&sp->so_count); + if (!atomic_inc_not_zero(&sp->so_count)) + continue; spin_unlock(&clp->cl_lock); rcu_read_unlock(); -- cgit v1.3-7-g2ca7 From 72d79ff83c34d430a9f7f72c45717905762ef4d9 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Fri, 2 Oct 2015 11:44:54 -0400 Subject: NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH We need to warn against broken NFSv4.1 servers that try to hand out delegations in response to NFS4_OPEN_CLAIM_DELEG_CUR_FH. Signed-off-by: Trond Myklebust --- fs/nfs/nfs4proc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'fs') diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index f93b9cdb4934..12e9808d5df0 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -1458,12 +1458,18 @@ nfs4_opendata_check_deleg(struct nfs4_opendata *data, struct nfs4_state *state) if (delegation) delegation_flags = delegation->flags; rcu_read_unlock(); - if (data->o_arg.claim == NFS4_OPEN_CLAIM_DELEGATE_CUR) { + switch (data->o_arg.claim) { + default: + break; + case NFS4_OPEN_CLAIM_DELEGATE_CUR: + case NFS4_OPEN_CLAIM_DELEG_CUR_FH: pr_err_ratelimited("NFS: Broken NFSv4 server %s is " "returning a delegation for " "OPEN(CLAIM_DELEGATE_CUR)\n", clp->cl_hostname); - } else if ((delegation_flags & 1UL<inode, data->owner->so_cred, &data->o_res); -- cgit v1.3-7-g2ca7 From e92c1e0d40c50472f80820bd829645ce9fefd6c1 Mon Sep 17 00:00:00 2001 From: Benjamin Coddington Date: Thu, 1 Oct 2015 09:17:33 -0400 Subject: NFSv4: Fix a nograce recovery hang Since commit 5cae02f42793130e1387f4ec09c4d07056ce9fa5 an OPEN_CONFIRM should have a privileged sequence in the recovery case to allow nograce recovery to proceed for NFSv4.0. Signed-off-by: Benjamin Coddington Signed-off-by: Trond Myklebust --- fs/nfs/nfs4proc.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs') diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 12e9808d5df0..83d577085bd9 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -1869,6 +1869,8 @@ static int _nfs4_proc_open_confirm(struct nfs4_opendata *data) data->rpc_done = 0; data->rpc_status = 0; data->timestamp = jiffies; + if (data->is_recover) + nfs4_set_sequence_privileged(&data->c_arg.seq_args); task = rpc_run_task(&task_setup_data); if (IS_ERR(task)) return PTR_ERR(task); -- cgit v1.3-7-g2ca7 From 5e99b532bb95f8e6bf39f4500f0caef070bac16d Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Fri, 2 Oct 2015 13:14:37 -0400 Subject: nfs4: reset states to use open_stateid when returning delegation voluntarily When the client goes to return a delegation, it should always update any nfs4_state currently set up to use that delegation stateid to instead use the open stateid. It already does do this in some cases, particularly in the state recovery code, but not currently when the delegation is voluntarily returned (e.g. in advance of a RENAME). This causes the client to try to continue using the delegation stateid after the DELEGRETURN, e.g. in LAYOUTGET. Set the nfs4_state back to using the open stateid in nfs4_open_delegation_recall, just before clearing the NFS_DELEGATED_STATE bit. Signed-off-by: Jeff Layton Signed-off-by: Trond Myklebust --- fs/nfs/nfs4proc.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'fs') diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 83d577085bd9..5133bb18830e 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -1777,6 +1777,9 @@ int nfs4_open_delegation_recall(struct nfs_open_context *ctx, if (IS_ERR(opendata)) return PTR_ERR(opendata); nfs4_stateid_copy(&opendata->o_arg.u.delegation, stateid); + write_seqlock(&state->seqlock); + nfs4_stateid_copy(&state->stateid, &state->open_stateid); + write_sequnlock(&state->seqlock); clear_bit(NFS_DELEGATED_STATE, &state->flags); switch (type & (FMODE_READ|FMODE_WRITE)) { case FMODE_READ|FMODE_WRITE: -- cgit v1.3-7-g2ca7 From 616a5399b8f11a526cedcbaeb68bc0a1b08a3f43 Mon Sep 17 00:00:00 2001 From: Steve French Date: Sat, 3 Oct 2015 16:54:17 -0500 Subject: [CIFS] Update cifs version number Update modinfo cifs.ko version number to 2.08 Signed-off-by: Steve French --- fs/cifs/cifsfs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index 27aea110e923..c3cc1609025f 100644 --- a/fs/cifs/cifsfs.h +++ b/fs/cifs/cifsfs.h @@ -136,5 +136,5 @@ extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long arg); extern const struct export_operations cifs_export_ops; #endif /* CONFIG_CIFS_NFSD_EXPORT */ -#define CIFS_VERSION "2.07" +#define CIFS_VERSION "2.08" #endif /* _CIFSFS_H */ -- cgit v1.3-7-g2ca7 From b786f16ac3c5d4f7a5fd136656b6a1301b29b73b Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Sat, 26 Sep 2015 15:30:23 +0100 Subject: Btrfs: send, fix corner case for reference overwrite detection When the inode given to did_overwrite_ref() matches the current progress and has a reference that collides with the reference of other inode that has the same number as the current progress, we were always telling our caller that the inode's reference was overwritten, which is incorrect because the other inode might be a new inode (different generation number) in which case we must return false from did_overwrite_ref() so that its callers don't use an orphanized path for the inode (as it will never be orphanized, instead it will be unlinked and the new inode created later). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a single extent of 64K. mkdir -p $SCRATCH_MNT/foo $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \ | _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/mysnap1 _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \ $SCRATCH_MNT/mysnap2 echo "File digest before being replaced:" md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch # Remove the file and then create a new one in the same location with # the same name but with different content. This new file ends up # getting the same inode number as the previous one, because that inode # number was the highest inode number used by the snapshot's root and # therefore when attempting to find the a new inode number for the new # file, we end up reusing the same inode number. This happens because # currently btrfs uses the highest inode number summed by 1 for the # first inode created once a snapshot's root is loaded (done at # fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel # tree). # Having these two different files in the snapshots with the same inode # number (but different generation numbers) caused the btrfs send code # to emit an incorrect path for the file when issuing an unlink # operation because it failed to realize they were different files. rm -f $SCRATCH_MNT/mysnap2/foo/bar $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \ $SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \ $SCRATCH_MNT/mysnap2_ro _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \ $SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap echo "File digest in the original filesystem after being replaced:" md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch # Now recreate the filesystem by receiving both send streams and verify # we get the same file contents that the original filesystem had. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap echo "File digest in the new filesystem:" # Must match the digest from the new file. md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch status=0 exit Reported-by: Martin Raiber Fixes: 8b191a684968 ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename") Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'fs') diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index aa72bfd28f7d..a739b825bdd3 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1920,10 +1920,12 @@ static int did_overwrite_ref(struct send_ctx *sctx, /* * We know that it is or will be overwritten. Check this now. * The current inode being processed might have been the one that caused - * inode 'ino' to be orphanized, therefore ow_inode can actually be the - * same as sctx->send_progress. + * inode 'ino' to be orphanized, therefore check if ow_inode matches + * the current inode being processed. */ - if (ow_inode <= sctx->send_progress) + if ((ow_inode < sctx->send_progress) || + (ino != sctx->cur_ino && ow_inode == sctx->cur_ino && + gen == sctx->cur_inode_gen)) ret = 1; else ret = 0; -- cgit v1.3-7-g2ca7 From 808f80b46790f27e145c72112189d6a3be2bc884 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 28 Sep 2015 09:56:26 +0100 Subject: Btrfs: update fix for read corruption of compressed and shared extents My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana Tested-by: Timofey Titovets --- fs/btrfs/extent_io.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) (limited to 'fs') diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 11aa8f743b90..363726b08a51 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3144,12 +3144,12 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree, get_extent_t *get_extent, struct extent_map **em_cached, struct bio **bio, int mirror_num, - unsigned long *bio_flags, int rw) + unsigned long *bio_flags, int rw, + u64 *prev_em_start) { struct inode *inode; struct btrfs_ordered_extent *ordered; int index; - u64 prev_em_start = (u64)-1; inode = pages[0]->mapping->host; while (1) { @@ -3165,7 +3165,7 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree, for (index = 0; index < nr_pages; index++) { __do_readpage(tree, pages[index], get_extent, em_cached, bio, - mirror_num, bio_flags, rw, &prev_em_start); + mirror_num, bio_flags, rw, prev_em_start); page_cache_release(pages[index]); } } @@ -3175,7 +3175,8 @@ static void __extent_readpages(struct extent_io_tree *tree, int nr_pages, get_extent_t *get_extent, struct extent_map **em_cached, struct bio **bio, int mirror_num, - unsigned long *bio_flags, int rw) + unsigned long *bio_flags, int rw, + u64 *prev_em_start) { u64 start = 0; u64 end = 0; @@ -3196,7 +3197,7 @@ static void __extent_readpages(struct extent_io_tree *tree, index - first_index, start, end, get_extent, em_cached, bio, mirror_num, bio_flags, - rw); + rw, prev_em_start); start = page_start; end = start + PAGE_CACHE_SIZE - 1; first_index = index; @@ -3207,7 +3208,8 @@ static void __extent_readpages(struct extent_io_tree *tree, __do_contiguous_readpages(tree, &pages[first_index], index - first_index, start, end, get_extent, em_cached, bio, - mirror_num, bio_flags, rw); + mirror_num, bio_flags, rw, + prev_em_start); } static int __extent_read_full_page(struct extent_io_tree *tree, @@ -4218,6 +4220,7 @@ int extent_readpages(struct extent_io_tree *tree, struct page *page; struct extent_map *em_cached = NULL; int nr = 0; + u64 prev_em_start = (u64)-1; for (page_idx = 0; page_idx < nr_pages; page_idx++) { page = list_entry(pages->prev, struct page, lru); @@ -4234,12 +4237,12 @@ int extent_readpages(struct extent_io_tree *tree, if (nr < ARRAY_SIZE(pagepool)) continue; __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, - &bio, 0, &bio_flags, READ); + &bio, 0, &bio_flags, READ, &prev_em_start); nr = 0; } if (nr) __extent_readpages(tree, pagepool, nr, get_extent, &em_cached, - &bio, 0, &bio_flags, READ); + &bio, 0, &bio_flags, READ, &prev_em_start); if (em_cached) free_extent_map(em_cached); -- cgit v1.3-7-g2ca7 From d9a0540a79f87456907f2ce031f058cf745c5bff Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Sat, 3 Oct 2015 13:13:13 +0100 Subject: Btrfs: fix deadlock when finalizing block group creation Josef ran into a deadlock while a transaction handle was finalizing the creation of its block groups, which produced the following trace: [260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084 [260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488 [260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0 [260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00 [260445.593137] Call Trace: [260445.593145] [] schedule+0x37/0x80 [260445.593189] [] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [260445.593197] [] ? prepare_to_wait_event+0xf0/0xf0 [260445.593225] [] btrfs_lock_root_node+0x34/0x50 [btrfs] [260445.593253] [] btrfs_search_slot+0x88b/0xa00 [btrfs] [260445.593295] [] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593324] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593351] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593394] [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593427] [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593459] [] do_chunk_alloc+0x2a4/0x2e0 [btrfs] [260445.593491] [] find_free_extent+0xa55/0xd90 [btrfs] [260445.593524] [] btrfs_reserve_extent+0xd2/0x220 [btrfs] [260445.593532] [] ? account_page_dirtied+0xdd/0x170 [260445.593564] [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [260445.593597] [] ? btree_set_page_dirty+0xe/0x10 [btrfs] [260445.593626] [] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [260445.593654] [] btrfs_cow_block+0x11f/0x1c0 [btrfs] [260445.593682] [] btrfs_search_slot+0x1e7/0xa00 [btrfs] [260445.593724] [] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593752] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593830] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593905] [] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593946] [] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593990] [] btrfs_commit_transaction+0xa8/0xb40 [btrfs] [260445.594042] [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [260445.594089] [] btrfs_sync_file+0x294/0x350 [btrfs] [260445.594115] [] vfs_fsync_range+0x3b/0xa0 [260445.594133] [] ? syscall_trace_enter_phase1+0x131/0x180 [260445.594149] [] do_fsync+0x3d/0x70 [260445.594169] [] ? syscall_trace_leave+0xb8/0x110 [260445.594187] [] SyS_fsync+0x10/0x20 [260445.594204] [] entry_SYSCALL_64_fastpath+0x12/0x71 This happened because the same transaction handle created a large number of block groups and while finalizing their creation (inserting new items and updating existing items in the chunk and device trees) a new metadata extent had to be allocated and no free space was found in the current metadata block groups, which made find_free_extent() attempt to allocate a new block group via do_chunk_alloc(). However at do_chunk_alloc() we ended up allocating a new system chunk too and exceeded the threshold of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the final part of block group creation again (at btrfs_create_pending_block_groups()) and attempt to lock again the root of the chunk tree when it's already write locked by the same task. Similarly we can deadlock on extent tree nodes/leafs if while we are running delayed references we end up creating a new metadata block group in order to allocate a new node/leaf for the extent tree (as part of a CoW operation or growing the tree), as btrfs_create_pending_block_groups inserts items into the extent tree as well. In this case we get the following trace: [14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084 [14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438 [14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460 [14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190 [14242.773606] Call Trace: [14242.773613] [] schedule+0x37/0x80 [14242.773656] [] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [14242.773664] [] ? prepare_to_wait_event+0xf0/0xf0 [14242.773692] [] btrfs_lock_root_node+0x34/0x50 [btrfs] [14242.773720] [] btrfs_search_slot+0x88b/0xa00 [btrfs] [14242.773750] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.773758] [] ? kmem_cache_alloc+0x1d2/0x200 [14242.773786] [] btrfs_insert_item+0x71/0xf0 [btrfs] [14242.773818] [] btrfs_create_pending_block_groups+0x102/0x200 [btrfs] [14242.773850] [] do_chunk_alloc+0x2ae/0x2f0 [btrfs] [14242.773934] [] find_free_extent+0xa55/0xd90 [btrfs] [14242.773998] [] btrfs_reserve_extent+0xc2/0x1d0 [btrfs] [14242.774041] [] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [14242.774078] [] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [14242.774118] [] btrfs_cow_block+0x11f/0x1c0 [btrfs] [14242.774155] [] btrfs_search_slot+0x1e7/0xa00 [btrfs] [14242.774194] [] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs] [14242.774235] [] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.774274] [] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [14242.774318] [] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs] [14242.774358] [] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs] [14242.774391] [] btrfs_run_delayed_refs+0x17/0x20 [btrfs] [14242.774432] [] commit_cowonly_roots+0x8d/0x2bd [btrfs] [14242.774474] [] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs] [14242.774516] [] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs] [14242.774558] [] btrfs_commit_transaction+0x590/0xb40 [btrfs] [14242.774599] [] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [14242.774642] [] btrfs_sync_file+0x294/0x350 [btrfs] [14242.774650] [] vfs_fsync_range+0x3b/0xa0 [14242.774657] [] ? syscall_trace_enter_phase1+0x131/0x180 [14242.774663] [] do_fsync+0x3d/0x70 [14242.774669] [] ? syscall_trace_leave+0xb8/0x110 [14242.774675] [] SyS_fsync+0x10/0x20 [14242.774681] [] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by never recursing into the finalization phase of block group creation and making sure we never trigger the finalization of block group creation while running delayed references. Reported-by: Josef Bacik Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock") Signed-off-by: Filipe Manana --- fs/btrfs/extent-tree.c | 9 ++++++++- fs/btrfs/transaction.c | 1 + fs/btrfs/transaction.h | 1 + 3 files changed, 10 insertions(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9f9604201333..601d7d45d164 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2828,6 +2828,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head; int ret; int run_all = count == (unsigned long)-1; + bool can_flush_pending_bgs = trans->can_flush_pending_bgs; /* We'll clean this up in btrfs_cleanup_transaction */ if (trans->aborted) @@ -2844,6 +2845,7 @@ again: #ifdef SCRAMBLE_DELAYED_REFS delayed_refs->run_delayed_start = find_middle(&delayed_refs->root); #endif + trans->can_flush_pending_bgs = false; ret = __btrfs_run_delayed_refs(trans, root, count); if (ret < 0) { btrfs_abort_transaction(trans, root, ret); @@ -2893,6 +2895,7 @@ again: } out: assert_qgroups_uptodate(trans); + trans->can_flush_pending_bgs = can_flush_pending_bgs; return 0; } @@ -4306,7 +4309,8 @@ out: * the block groups that were made dirty during the lifetime of the * transaction. */ - if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) { + if (trans->can_flush_pending_bgs && + trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) { btrfs_create_pending_block_groups(trans, trans->root); btrfs_trans_release_chunk_metadata(trans); } @@ -9560,7 +9564,9 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, struct btrfs_block_group_item item; struct btrfs_key key; int ret = 0; + bool can_flush_pending_bgs = trans->can_flush_pending_bgs; + trans->can_flush_pending_bgs = false; list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) { if (ret) goto next; @@ -9581,6 +9587,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, next: list_del_init(&block_group->bg_list); } + trans->can_flush_pending_bgs = can_flush_pending_bgs; } int btrfs_make_block_group(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a2d6f7bcef6c..376191c7da13 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -557,6 +557,7 @@ again: h->delayed_ref_elem.seq = 0; h->type = type; h->allocating_chunk = false; + h->can_flush_pending_bgs = true; h->reloc_reserved = false; h->sync = false; INIT_LIST_HEAD(&h->qgroup_ref_list); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 87964bf8892d..a994bb097ee5 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -118,6 +118,7 @@ struct btrfs_trans_handle { short aborted; short adding_csums; bool allocating_chunk; + bool can_flush_pending_bgs; bool reloc_reserved; bool sync; unsigned int type; -- cgit v1.3-7-g2ca7 From e5fffbac4a49270e4976d71a0e054c0cf3ef4f8e Mon Sep 17 00:00:00 2001 From: chandan Date: Mon, 5 Oct 2015 22:14:25 +0530 Subject: Btrfs: open_ctree: Fix possible memory leak After reading one of chunk or tree root tree's root node from disk, if the root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release the memory used by the root node. Fix this. Signed-off-by: Chandan Rajendra --- fs/btrfs/disk-io.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'fs') diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index aa59871885da..807f6854acaa 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2842,6 +2842,8 @@ int open_ctree(struct super_block *sb, !extent_buffer_uptodate(chunk_root->node)) { printk(KERN_ERR "BTRFS: failed to read chunk root on %s\n", sb->s_id); + if (!IS_ERR(chunk_root->node)) + free_extent_buffer(chunk_root->node); chunk_root->node = NULL; goto fail_tree_roots; } @@ -2880,6 +2882,8 @@ retry_root_backup: !extent_buffer_uptodate(tree_root->node)) { printk(KERN_WARNING "BTRFS: failed to read tree root on %s\n", sb->s_id); + if (!IS_ERR(tree_root->node)) + free_extent_buffer(tree_root->node); tree_root->node = NULL; goto recovery_tree_root; } -- cgit v1.3-7-g2ca7 From 7d35199e15b82a4d1a20049153b03e6258ce79f7 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 8 May 2015 10:16:23 +1000 Subject: BTRFS: support NFSv2 export The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as that returned by encode_fh - it may be larger. With NFSv2, the filehandle is fixed length, so it may appear longer than expected and be zero-padded. So we must test that fh_len is at least some value, not exactly equal to it. Signed-off-by: NeilBrown Acked-by: David Sterba --- fs/btrfs/export.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'fs') diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c index 8d052209f473..2513a7f53334 100644 --- a/fs/btrfs/export.c +++ b/fs/btrfs/export.c @@ -112,11 +112,11 @@ static struct dentry *btrfs_fh_to_parent(struct super_block *sb, struct fid *fh, u32 generation; if (fh_type == FILEID_BTRFS_WITH_PARENT) { - if (fh_len != BTRFS_FID_SIZE_CONNECTABLE) + if (fh_len < BTRFS_FID_SIZE_CONNECTABLE) return NULL; root_objectid = fid->root_objectid; } else if (fh_type == FILEID_BTRFS_WITH_PARENT_ROOT) { - if (fh_len != BTRFS_FID_SIZE_CONNECTABLE_ROOT) + if (fh_len < BTRFS_FID_SIZE_CONNECTABLE_ROOT) return NULL; root_objectid = fid->parent_root_objectid; } else @@ -136,11 +136,11 @@ static struct dentry *btrfs_fh_to_dentry(struct super_block *sb, struct fid *fh, u32 generation; if ((fh_type != FILEID_BTRFS_WITH_PARENT || - fh_len != BTRFS_FID_SIZE_CONNECTABLE) && + fh_len < BTRFS_FID_SIZE_CONNECTABLE) && (fh_type != FILEID_BTRFS_WITH_PARENT_ROOT || - fh_len != BTRFS_FID_SIZE_CONNECTABLE_ROOT) && + fh_len < BTRFS_FID_SIZE_CONNECTABLE_ROOT) && (fh_type != FILEID_BTRFS_WITHOUT_PARENT || - fh_len != BTRFS_FID_SIZE_NON_CONNECTABLE)) + fh_len < BTRFS_FID_SIZE_NON_CONNECTABLE)) return NULL; objectid = fid->objectid; -- cgit v1.3-7-g2ca7 From 39d0d3bdf7bab3021a31e501172ac0f18947f9b3 Mon Sep 17 00:00:00 2001 From: Anna Schumaker Date: Mon, 5 Oct 2015 16:43:26 -0400 Subject: NFS: Fix a tracepoint NULL-pointer dereference Running xfstest generic/013 with the tracepoint nfs:nfs4_open_file enabled produces a NULL-pointer dereference when calculating fileid and filehandle of the opened file. Fix this by checking if state is NULL before trying to use the inode pointer. Reported-by: Olga Kornievskaia Signed-off-by: Anna Schumaker Signed-off-by: Trond Myklebust --- fs/nfs/nfs4trace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h index 28df12e525ba..671cf68fe56b 100644 --- a/fs/nfs/nfs4trace.h +++ b/fs/nfs/nfs4trace.h @@ -409,7 +409,7 @@ DECLARE_EVENT_CLASS(nfs4_open_event, __entry->flags = flags; __entry->fmode = (__force unsigned int)ctx->mode; __entry->dev = ctx->dentry->d_sb->s_dev; - if (!IS_ERR(state)) + if (!IS_ERR_OR_NULL(state)) inode = state->inode; if (inode != NULL) { __entry->fileid = NFS_FILEID(inode); -- cgit v1.3-7-g2ca7 From daf3761c9fcde0f4ca64321cbed6c1c86d304193 Mon Sep 17 00:00:00 2001 From: Trond Myklebust Date: Fri, 9 Oct 2015 13:44:34 -0400 Subject: namei: results of d_is_negative() should be checked after dentry revalidation Leandro Awa writes: "After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form: T34: ./regex.c:39:22: error: config.h: No such file or directory From our 'git bisect' testing, the following commit appears to be the possible cause of the behavior we've been seeing: commit 766c4cbfacd8" Al Viro says: "What happens is that 766c4cbfacd8 got the things subtly wrong. We used to treat d_is_negative() after lookup_fast() as "fall with ENOENT". That was wrong - checking ->d_flags outside of ->d_seq protection is unreliable and failing with hard error on what should've fallen back to non-RCU pathname resolution is a bug. Unfortunately, we'd pulled the test too far up and ran afoul of another kind of staleness. The dentry might have been absolutely stable from the RCU point of view (and we might be on UP, etc), but stale from the remote fs point of view. If ->d_revalidate() returns "it's actually stale", dentry gets thrown away and the original code wouldn't even have looked at its ->d_flags. What we need is to check ->d_flags where 766c4cbfacd8 does (prior to ->d_seq validation) but only use the result in cases where we do not discard this dentry outright" Reported-by: Leandro Awa Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911 Fixes: 766c4cbfacd8 ("namei: d_is_negative() should be checked...") Tested-by: Leandro Awa Cc: stable@vger.kernel.org # v4.1+ Signed-off-by: Trond Myklebust Acked-by: Al Viro Signed-off-by: Linus Torvalds --- fs/namei.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'fs') diff --git a/fs/namei.c b/fs/namei.c index 726d211db484..33e9495a3129 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1558,8 +1558,6 @@ static int lookup_fast(struct nameidata *nd, negative = d_is_negative(dentry); if (read_seqcount_retry(&dentry->d_seq, seq)) return -ECHILD; - if (negative) - return -ENOENT; /* * This sequence count validates that the parent had no @@ -1580,6 +1578,12 @@ static int lookup_fast(struct nameidata *nd, goto unlazy; } } + /* + * Note: do negative dentry check after revalidation in + * case that drops it. + */ + if (negative) + return -ENOENT; path->mnt = mnt; path->dentry = dentry; if (likely(__follow_mount_rcu(nd, path, inode, seqp))) -- cgit v1.3-7-g2ca7