aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2015-07-18Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-2/+2
Pull x86 fixes from Ingo Molnar: "Two families of fixes: - Fix an FPU context related boot crash on newer x86 hardware with larger context sizes than what most people test. To fix this without ugly kludges or extensive reverts we had to touch core task allocator, to allow x86 to determine the task size dynamically, at boot time. I've tested it on a number of x86 platforms, and I cross-built it to a handful of architectures: (warns) (warns) testing x86-64: -git: pass ( 0), -tip: pass ( 0) testing x86-32: -git: pass ( 0), -tip: pass ( 0) testing arm: -git: pass ( 1359), -tip: pass ( 1359) testing cris: -git: pass ( 1031), -tip: pass ( 1031) testing m32r: -git: pass ( 1135), -tip: pass ( 1135) testing m68k: -git: pass ( 1471), -tip: pass ( 1471) testing mips: -git: pass ( 1162), -tip: pass ( 1162) testing mn10300: -git: pass ( 1058), -tip: pass ( 1058) testing parisc: -git: pass ( 1846), -tip: pass ( 1846) testing sparc: -git: pass ( 1185), -tip: pass ( 1185) ... so I hope the cross-arch impact 'none', as intended. (by Dave Hansen) - Fix various NMI handling related bugs unearthed by the big asm code rewrite and generally make the NMI code more robust and more maintainable while at it. These changes are a bit late in the cycle, I hope they are still acceptable. (by Andy Lutomirski)" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86 x86/fpu, sched: Dynamically allocate 'struct fpu' x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code x86/nmi/64: Make the "NMI executing" variable more consistent x86/nmi/64: Minor asm simplification x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection x86/nmi/64: Reorder nested NMI checks x86/nmi/64: Improve nested NMI comments x86/nmi/64: Switch stacks on userspace NMI entry x86/nmi/64: Remove asm code that saves CR2 x86/nmi: Enable nested do_nmi() handling for 64-bit kernels
2015-07-18Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-22/+27
Merge fixes from Andrew Morton: "25 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits) lib/decompress: set the compressor name to NULL on error mm/cma_debug: correct size input to bitmap function mm/cma_debug: fix debugging alloc/free interface mm/page_owner: set correct gfp_mask on page_owner mm/page_owner: fix possible access violation fsnotify: fix oops in fsnotify_clear_marks_by_group_flags() /proc/$PID/cmdline: fixup empty ARGV case dma-debug: skip debug_dma_assert_idle() when disabled hexdump: fix for non-aligned buffers checkpatch: fix long line messages about patch context mm: clean up per architecture MM hook header files MAINTAINERS: uclinux-h8-devel is moderated for non-subscribers mailmap: update Sudeep Holla's email id Update Viresh Kumar's email address mm, meminit: suppress unused memory variable warning configfs: fix kernel infoleak through user-controlled format string include, lib: add __printf attributes to several function prototypes s390/hugetlb: add hugepages_supported define mm: hugetlb: allow hugepages_supported to be architecture specific revert "s390/mm: make hugepages_supported a boot time decision" ...
2015-07-17Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfsLinus Torvalds4-6/+34
Pull btrfs fixes from Chris Mason: "These are all from Filipe, and cover a few problems we've had reported on the list recently (along with ones he found on his own)" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix file corruption after cloning inline extents Btrfs: fix order by which delayed references are run Btrfs: fix list transaction->pending_ordered corruption Btrfs: fix memory leak in the extent_same ioctl Btrfs: fix shrinking truncate when the no_holes feature is enabled
2015-07-18x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86Ingo Molnar1-2/+2
Don't burden architectures without dynamic task_struct sizing with the overhead of dynamic sizing. Also optimize the x86 code a bit by caching task_struct_size. Acked-and-Tested-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-18x86/fpu, sched: Dynamically allocate 'struct fpu'Dave Hansen1-2/+2
The FPU rewrite removed the dynamic allocations of 'struct fpu'. But, this potentially wastes massive amounts of memory (2k per task on systems that do not have AVX-512 for instance). Instead of having a separate slab, this patch just appends the space that we need to the 'task_struct' which we dynamically allocate already. This saves from doing an extra slab allocation at fork(). The only real downside here is that we have to stick everything and the end of the task_struct. But, I think the BUILD_BUG_ON()s I stuck in there should keep that from being too fragile. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-17fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara1-20/+14
fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and we dereference free memory in the loop there. Fix the problem by keeping mark_mutex held in fsnotify_destroy_mark_locked(). The reason why we drop that mutex is that we need to call a ->freeing_mark() callback which may acquire mark_mutex again. To avoid this and similar lock inversion issues, we move the call to ->freeing_mark() callback to the kthread destroying the mark. Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Suggested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17/proc/$PID/cmdline: fixup empty ARGV caseAlexey Dobriyan1-0/+5
/proc/*/cmdline code checks if it should look at ENVP area by checking last byte of ARGV area: rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0); if (rv <= 0) goto out_free_page; If ARGV is somehow made empty (by doing execve(..., NULL, ...) or manually setting ->arg_start and ->arg_end to equal values), the decision will be based on byte which doesn't even belong to ARGV/ENVP. So, quickly check if ARGV area is empty and report 0 to match previous behaviour. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17configfs: fix kernel infoleak through user-controlled format stringNicolas Iooss1-2/+2
Some modules call config_item_init_type_name() and config_group_init_type_name() with parameter "name" directly controlled by userspace. These two functions call config_item_set_name() with this name used as a format string, which can be used to leak information such as content of the stack to userspace. For example, make_netconsole_target() in netconsole module calls config_item_init_type_name() with the name of a newly-created directory. This means that the following commands give some unexpected output, with configfs mounted in /sys/kernel/config/ and on a system with a configured eth0 ethernet interface: # modprobe netconsole # mkdir /sys/kernel/config/netconsole/target_%lx # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name # echo 1 > /sys/kernel/config/netconsole/target_%lx/enabled # echo eth0 > /sys/kernel/config/netconsole/target_%lx/dev_name # dmesg |tail -n1 [ 142.697668] netconsole: target (target_ffffffffc0ae8080) is enabled, disable to update parameters The directory name is correct but %lx has been interpreted in the internal item name, displayed here in the error message used by store_dev_name() in drivers/net/netconsole.c. To fix this, update every caller of config_item_set_name to use "%s" when operating on untrusted input. This issue was found using -Wformat-security gcc flag, once a __printf attribute has been added to config_item_set_name(). Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Felipe Balbi <balbi@ti.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-17fs, proc: add help for CONFIG_PROC_CHILDRENIago López Galeiras1-0/+6
The purpose of the option was documented in Documentation/filesystems/proc.txt but the help text was missing. Add small help text that also points to the documentation. Signed-off-by: Iago López Galeiras <iago@endocode.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-16Merge tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggyLinus Torvalds3-17/+16
Pull jfs fixes from David Kleikamp: "A couple trivial fixes and an error path fix" * tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggy: jfs: clean up jfs_rename and fix out of order unlock jfs: fix indentation on if statement jfs: removed a prohibited space after opening parenthesis
2015-07-15Merge tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linuxLinus Torvalds2-30/+26
Pull file locking updates from Jeff Layton: "I had thought that I was going to get away without a pull request this cycle. There was a NFSv4 file locking problem that cropped up that I tried to fix in the NFSv4 code alone, but that fix has turned out to be problematic. These patches fix this in the correct way. Note that this touches some NFSv4 code as well. Ordinarily I'd wait for Trond to ACK this, but he's on holiday right now and the bug is rather nasty. So I suggest we merge this and if he raises issues with it we can sort it out when he gets back" Acked-by: Bruce Fields <bfields@fieldses.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [ +1 to this series fixing a 100% reproducible slab corruption + general protection fault in my nfs-root test environment. - Dan ] Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linux: locks: inline posix_lock_file_wait and flock_lock_file_wait nfs4: have do_vfs_lock take an inode pointer locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait locks: have flock_lock_file take an inode pointer instead of a filp Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"
2015-07-15jfs: clean up jfs_rename and fix out of order unlockDave Kleikamp1-14/+13
The end of jfs_rename(), which is also used by the error paths, included a call to IWRITE_UNLOCK(new_ip) after labels out1, out2 and out3. If we come in through these labels, IWRITE_LOCK() has not been called yet. In moving that call to the correct spot, I also moved some exceptional truncate code earlier as well, since the early error paths don't need to deal with it, and I renamed out4: to out_tx: so a future patch by Jan Kara doesn't need to deal with renumbering or confusing out-of-order labels. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2015-07-14Btrfs: fix file corruption after cloning inline extentsFilipe Manana1-0/+14
Using the clone ioctl (or extent_same ioctl, which calls the same extent cloning function as well) we end up allowing copy an inline extent from the source file into a non-zero offset of the destination file. This is something not expected and that the btrfs code is not prepared to deal with - all inline extents must be at a file offset equals to 0. For example, the following excerpt of a test case for fstests triggers a crash/BUG_ON() on a write operation after an inline extent is cloned into a non-zero offset: _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test files. File foo has the same 2K of data at offset 4K # as file bar has at its offset 0. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ -c "pwrite -S 0xbb 4k 2K" \ -c "pwrite -S 0xcc 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # File bar consists of a single inline extent (2K size). $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ $SCRATCH_MNT/bar | _filter_xfs_io # Now call the clone ioctl to clone the extent of file bar into file # foo at its offset 4K. This made file foo have an inline extent at # offset 4K, something which the btrfs code can not deal with in future # IO operations because all inline extents are supposed to start at an # offset of 0, resulting in all sorts of chaos. # So here we validate that clone ioctl returns an EOPNOTSUPP, which is # what it returns for other cases dealing with inlined extents. $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Because of the inline extent at offset 4K, the following write made # the kernel crash with a BUG_ON(). $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io status=0 exit The stack trace of the BUG_ON() triggered by the last write is: [152154.035903] ------------[ cut here ]------------ [152154.036424] kernel BUG at mm/page-writeback.c:2286! [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 [152154.036424] RIP: 0010:[<ffffffff8111a9d5>] [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 [152154.036424] Stack: [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 [152154.036424] Call Trace: [152154.036424] [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] [152154.036424] [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs] [152154.036424] [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] [152154.036424] [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5 [152154.036424] [<ffffffff81165f89>] vfs_write+0xa0/0xe4 [152154.036424] [<ffffffff81166855>] SyS_pwrite64+0x64/0x82 [152154.036424] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ [152154.036424] RIP [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP <ffff880429effc68> [152154.242621] ---[ end trace e3d3376b23a57041 ]--- Fix this by returning the error EOPNOTSUPP if an attempt to copy an inline extent into a non-zero offset happens, just like what is done for other scenarios that would require copying/splitting inline extents, which were introduced by the following commits: 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-13locks: inline posix_lock_file_wait and flock_lock_file_waitJeff Layton1-28/+0
They just call file_inode and then the corresponding *_inode_file_wait function. Just make them static inlines instead. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-07-13nfs4: have do_vfs_lock take an inode pointerJeff Layton1-8/+8
Now that we have file locking helpers that can deal with an inode instead of a filp, we can change the NFSv4 locking code to use that instead. This should fix the case where we have a filp that is closed while flock or OFD locks are set on it, and the task is signaled so that it doesn't wait for the LOCKU reply to come in before the filp is freed. At that point we can end up with a use-after-free with the current code, which relies on dereferencing the fl_file in the lock request. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-13locks: new helpers - flock_lock_inode_wait and posix_lock_inode_waitJeff Layton1-12/+38
Allow callers to pass in an inode instead of a filp. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-13locks: have flock_lock_file take an inode pointer instead of a filpJeff Layton1-6/+6
...and rename it to better describe how it works. In order to fix a use-after-free in NFS, we need to be able to remove locks from an inode after the filp associated with them may have already been freed. flock_lock_file already only dereferences the filp to get to the inode, so just change it so the callers do that. All of the callers already pass in a lock request that has the fl_file set properly, so we don't need to pass it in individually. With that change it now only dereferences the filp to get to the inode, so just push that out to the callers. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-13Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"Jeff Layton1-2/+0
This reverts commit db2efec0caba4f81a22d95a34da640b86c313c8e. William reported that he was seeing instability with this patch, which is likely due to the fact that it can cause the kernel to take a new reference to a filp after the last reference has already been put. Revert this patch for now, as we'll need to fix this in another way. Cc: stable@vger.kernel.org Reported-by: William Dauchy <william@gandi.net> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
2015-07-12Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds4-6/+10
Pull VFS fixes from Al Viro: "Fixes for this cycle regression in overlayfs and a couple of long-standing (== all the way back to 2.6.12, at least) bugs" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: freeing unlinked file indefinitely delayed fix a braino in ovl_d_select_inode() 9p: don't leave a half-initialized inode sitting around
2015-07-12freeing unlinked file indefinitely delayedAl Viro1-2/+5
Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/ Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-12fix a braino in ovl_d_select_inode()Al Viro1-0/+3
when opening a directory we want the overlayfs inode, not one from the topmost layer. Reported-By: Andrey Jr. Melnikov <temnota.am@gmail.com> Tested-By: Andrey Jr. Melnikov <temnota.am@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-129p: don't leave a half-initialized inode sitting aroundAl Viro2-4/+2
Cc: stable@vger.kernel.org # all branches Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-11Btrfs: fix order by which delayed references are runFilipe Manana1-0/+13
When we have an extent that got N references removed and N new references added in the same transaction, we must run the insertion of the references first because otherwise the last removed reference will remove the extent item from the extent tree, resulting in a failure for the insertions. This is a regression introduced in the 4.2-rc1 release and this fix just brings back the behaviour of selecting reference additions before any reference removals. The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_flakey _require_cloner _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create prealloc extent covering range [160K, 620K[ $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo # Now write to the last 80K of the prealloc extent plus 40K to the unallocated # space that immediately follows it. This creates a new extent of 40K that spans # the range [620K, 660K[. $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io # At this point, there are now 2 back references to the prealloc extent in our # extent tree. Both are for our file offset 160K and one relates to a file # extent item with a data offset of 0 and a length of 380K, while the other # relates to a file extent item with a data offset of 380K and a length of 80K. # Make sure everything done so far is durably persisted (all back references are # in the extent tree, etc). sync # Now clone all extents of our file that cover the offset 160K up to its eof # (660K at this point) into itself at offset 2M. This leaves a hole in the file # covering the range [660K, 2M[. The prealloc extent will now be referenced by # the file twice, once for offset 160K and once for offset 2M. The 40K extent # that follows the prealloc extent will also be referenced twice by our file, # once for offset 620K and once for offset 2M + 460K. $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \ $SCRATCH_MNT/foo # Now create one new extent in our file with a size of 100Kb. It will span the # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the # range [2M + 460K, 3M[. Our new file size is 3M + 100K. $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io # At this point, there are now (in memory) 4 back references to the prealloc # extent. # # Two of them are for file offset 160K, related to file extent items # matching the file offsets 160K and 540K respectively, with data offsets of # 0 and 380K respectively, and with lengths of 380K and 80K respectively. # # The other two references are for file offset 2M, related to file extent items # matching the file offsets 2M and 2M + 380K respectively, with data offsets of # 0 and 380K respectively, and with lengths of 389K and 80K respectively. # # The 40K extent has 2 back references, one for file offset 620K and the other # for file offset 2M + 460K. # # The 100K extent has a single back reference and it relates to file offset 3M. # Now clone our 100K extent into offset 600K. That offset covers the last 20K # of the prealloc extent, the whole 40K extent and 40K of the hole starting at # offset 660K. $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # At this point there's only one reference to the 40K extent, at file offset # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for # file offset 3M and a new one for file offset 600K). # Now fsync our file to make all its new data and metadata updates are durably # persisted and present if a power failure/crash happens after a successful # fsync and before the next transaction commit. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo echo "File digest before power failure:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Silently drop all writes and ummount to simulate a crash/power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file contents. # During log replay, the btrfs delayed references implementation used to run the # deletion of back references before the addition of new back references, which # made the addition fail as it didn't find the key in the extent tree that it # was looking for. The failure triggered by this test was related to the 40K # extent, which got 1 reference dropped and 1 reference added during the fsync # log replay - when running the delayed references at transaction commit time, # btrfs was applying the deletion before the insertion, resulting in a failure # of the insertion that ended up turning the fs into read-only mode. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File digest after log replay:" md5sum $SCRATCH_MNT/foo | _filter_scratch _unmount_flakey status=0 exit This issue turned the filesystem into read-only mode (current transaction aborted) and produced the following traces: [ 8247.578385] ------------[ cut here ]------------ [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]() (...) [ 8247.601697] Call Trace: [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [ 8247.605488] [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs] [ 8247.608226] [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs] [ 8247.617061] [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs] [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs] [ 8247.624366] [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs] [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 [ 8247.628531] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] (...) [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() [ 8247.728954] BTRFS: Transaction aborted (error -5) (...) [ 8247.760866] Call Trace: [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [ 8247.767582] [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 [ 8247.773664] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 [ 8247.777427] [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs] [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs] [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] (...) [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree) Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Signed-off-by: Filipe Manana <fdmanana@suse.com> Acked-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-07-11Btrfs: fix list transaction->pending_ordered corruptionFilipe Manana1-2/+2
When we call btrfs_commit_transaction(), we splice the list "ordered" of our transaction handle into the transaction's "pending_ordered" list, but we don't re-initialize the "ordered" list of our transaction handle, this means it still points to the same elements it used to before the splice. Then we check if the current transaction's state is >= TRANS_STATE_COMMIT_START and if it is we end up calling btrfs_end_transaction() which simply splices again the "ordered" list of our handle into the transaction's "pending_ordered" list, leaving multiple pointers to the same ordered extents which results in list corruption when we are iterating, removing and freeing ordered extents at btrfs_wait_pending_ordered(), resulting in access to dangling pointers / use-after-free issues. Similarly, btrfs_end_transaction() can end up in some cases calling btrfs_commit_transaction(), and both did a list splice of the transaction handle's "ordered" list into the transaction's "pending_ordered" without re-initializing the handle's "ordered" list, resulting in exactly the same problem. This produces the following warning on a kernel with linked list debugging enabled: [109749.265416] ------------[ cut here ]------------ [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d (...) [109749.287505] Call Trace: [109749.288135] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [109749.298080] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [109749.331605] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [109749.334849] [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98 [109749.337093] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [109749.337847] [<ffffffff81260642>] __list_del_entry+0x5a/0x98 [109749.338678] [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs] [109749.340145] [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs] [109749.348313] [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs] [109749.349745] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [109749.350819] [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs] [109749.351976] [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e [109749.360341] [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e [109749.368828] [<ffffffff8118ee1d>] do_fsync+0x34/0x4e [109749.369790] [<ffffffff8118f045>] SyS_fsync+0x10/0x14 [109749.370925] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [109749.382274] ---[ end trace 48e0d07f7c03d95a ]--- On a non-debug kernel this leads to invalid memory accesses, causing a crash. Fix this by using list_splice_init() instead of list_splice() in btrfs_commit_transaction() and btrfs_end_transaction(). Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
2015-07-11Btrfs: fix memory leak in the extent_same ioctlFilipe Manana1-1/+3
We were allocating memory with memdup_user() but we were never releasing that memory. This affected pretty much every call to the ioctl, whether it deduplicated extents or not. This issue was reported on IRC by Julian Taylor and on the mailing list by Marcel Ritter, credit goes to them for finding the issue. Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-07-11Btrfs: fix shrinking truncate when the no_holes feature is enabledFilipe Manana1-3/+2
If the no_holes feature is enabled, we attempt to shrink a file to a size that ends up in the middle of a hole and we don't have any file extent items in the fs/subvol tree that go beyond the new file size (or any ordered extents that will insert such file extent items), we end up not updating the inode's disk_i_size, we only update the inode's i_size. This means that after unmounting and mounting the filesystem, or after the inode is evicted and reloaded, its i_size ends up being incorrect (an inode's i_size is set to the disk_i_size field when an inode is loaded). This happens when btrfs_truncate_inode_items() doesn't find any file extent items to drop - in this case it never makes a call to btrfs_ordered_update_i_size() in order to update the inode's disk_i_size. Example reproducer: $ mkfs.btrfs -O no-holes -f /dev/sdd $ mount /dev/sdd /mnt # Create our test file with some data and durably persist it. $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo $ sync # Append some data to the file, increasing its size, and leave a hole # between the old size and the start offset if the following write. So # our file gets a hole in the range [128Kb, 256Kb[. $ xfs_io -c "truncate 160K" /mnt/foo # We expect to see our file with a size of 160Kb, with the first 128Kb # of data all having the value 0xaa and the remaining 32Kb of data all # having the value 0x00. $ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0500000 # Now cleanly unmount and mount again the filesystem. $ umount /mnt $ mount /dev/sdd /mnt # We expect to get the same result as before, a file with a size of # 160Kb, with the first 128Kb of data all having the value 0xaa and the # remaining 32Kb of data all having the value 0x00. $ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0400000 In the example above the file size/data do not match what they were before the remount. Fix this by always calling btrfs_ordered_update_i_size() with a size matching the size the file was truncated to if btrfs_truncate_inode_items() is not called for a log tree and no file extent items were dropped. This ensures the same behaviour as when the no_holes feature is not enabled. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfsLinus Torvalds13-124/+641
Pull btrfs fixes from Chris Mason: "This is an assortment of fixes. Most of the commits are from Filipe (fsync, the inode allocation cache and a few others). Mark kicked in a series fixing corners in the extent sharing ioctls, and everyone else fixed up on assorted other problems" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix wrong check for btrfs_force_chunk_alloc() Btrfs: fix warning of bytes_may_use Btrfs: fix hang when failing to submit bio of directIO Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() Btrfs: fix memory corruption on failure to submit bio for direct IO btrfs: don't update mtime/ctime on deduped inodes btrfs: allow dedupe of same inode btrfs: fix deadlock with extent-same and readpage btrfs: pass unaligned length to btrfs_cmp_data() Btrfs: fix fsync after truncate when no_holes feature is enabled Btrfs: fix fsync xattr loss in the fast fsync path Btrfs: fix fsync data loss after append write Btrfs: fix crash on close_ctree() if cleaner starts new transaction Btrfs: fix race between caching kthread and returning inode to inode cache Btrfs: use kmem_cache_free when freeing entry in inode cache Btrfs: fix race between balance and unused block group deletion btrfs: add error handling for scrub_workers_get() btrfs: cleanup noused initialization of dev in btrfs_end_bio() btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-11pNFS: Don't throw out valid layout segmentsTrond Myklebust1-0/+6
It is OK for layout segments to remain hashed even if no-one holds any references to them, provided that the segments are still valid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-11pNFS: pnfs_roc_drain() fix a race with openTrond Myklebust1-6/+9
If a process reopens the file before we can send off the CLOSE/DELEGRETURN, then pnfs_roc_drain() may end up waiting for a new set of layout segments that are marked as return-on-close, but haven't yet been returned. Fix this by only waiting for those layout segments that were invalidated in pnfs_roc(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-11pNFS: Fix races between return-on-close and layoutreturn.Trond Myklebust2-30/+35
If one or more of the layout segments reports an error during I/O, then we may have to send a layoutreturn to report the error back to the NFS metadata server. This patch ensures that the return-on-close code can detect the outstanding layoutreturn, and not preempt it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-11pNFS: pnfs_roc_drain should return 'true' when sleepingTrond Myklebust1-13/+11
Also clean up the case where we don't find a return-on-close layout segment. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-11pNFS: Layoutreturn must invalidate all existing layout segments.Trond Myklebust1-0/+3
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-09hpfs: hpfs_error: Remove static buffer, use vsprintf extension %pV insteadJoe Perches1-4/+7
Removing unnecessary static buffers is good. Use the vsprintf %pV extension instead. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Cc: stable@vger.kernel.org # v2.6.36+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09hpfs: kstrdup() out of memory handlingSanidhya Kashyap1-2/+5
There is a possibility of nothing being allocated to the new_opts in case of memory pressure, therefore return ENOMEM for such case. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09hpfs: Remove unessary castFiro Yang1-1/+1
Avoid a pointless kmem_cache_alloc() return value cast in fs/hpfs/super.c::hpfs_alloc_inode() Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09hpfs: add fstrim supportMikulas Patocka5-0/+128
This patch adds support for fstrim to the HPFS filesystem. Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09ioctl_compat: handle FITRIMMikulas Patocka6-7/+1
The FITRIM ioctl has the same arguments on 32-bit and 64-bit architectures, so we can add it to the list of compatible ioctls and drop it from compat_ioctl method of various filesystems. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ted Ts'o <tytso@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-07-09udf: Don't corrupt unalloc spacetable when writing itSteven J. Magnani1-12/+7
For a UDF filesystem configured with an Unallocated Space Table, a filesystem operation that triggers an update to the table results in on-disk corruption that prevents remounting: udf_read_tagged: tag version 0x0000 != 0x0002 || 0x0003, block 274 For example: 1. Create a filesystem $ mkudffs --media-type=hd --blocksize=512 --lvid=BUGTEST \ --vid=BUGTEST --fsid=BUGTEST --space=unalloctable \ /dev/mmcblk0 2. Mount it # mount /dev/mmcblk0 /mnt 3. Create a file $ echo "No corruption, please" > /mnt/new.file 4. Umount # umount /mnt 5. Attempt remount # mount /dev/mmcblk0 /mnt This appears to be a longstanding bug caused by zero-initialization of the Unallocated Space Entry block buffer and only partial repopulation of required fields before writing to disk. Commit 0adfb339fd64 ("udf: Fix unalloc space handling in udf_update_inode") addressed one such field, but several others are required. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Signed-off-by: Jan Kara <jack@suse.com>
2015-07-08NFSv4.2/flexfiles: Fix a typo in the flexfiles layoutstats codeTrond Myklebust1-1/+1
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds4-21/+40
Pull ext4 bugfixes from Ted Ts'o: "Bug fixes (all for stable kernels) for ext4: - address corner cases for indirect blocks->extent migration - fix reserved block accounting invalidate_page when page_size != block_size (i.e., ppc or 1k block size file systems) - fix deadlocks when a memcg is under heavy memory pressure - fix fencepost error in lazytime optimization" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: replace open coded nofail allocation in ext4_free_blocks() ext4: correctly migrate a file with a hole at the beginning ext4: be more strict when migrating to non-extent based file ext4: fix reservation release on invalidatepage for delalloc fs ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp bufferhead: Add _gfp version for sb_getblk() ext4: fix fencepost error in lazytime optimization
2015-07-05NFSv4: Leases are renewed in sequence_done when we have sessionsTrond Myklebust1-7/+5
Ensure that the calls to renew_lease() in open_done() etc. only apply to session-less versions of NFSv4.x (i.e. NFSv4.0). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05NFSv4.1: nfs41_sequence_done should handle sequence flag errorsTrond Myklebust1-2/+1
Instead of just kicking off lease recovery, we should look into the sequence flag errors and handle them. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05NFSv4.1: Handle SEQ4_STATUS_BACKCHANNEL_FAULT correctlyTrond Myklebust1-3/+3
RFC5661 states: The server has encountered an unrecoverable fault with the backchannel (e.g., it has lost track of the sequence ID for a slot in the backchannel). The client MUST stop sending more requests on the session's fore channel, wait for all outstanding requests to complete on the fore and back channel, and then destroy the session. Ensure we do so... Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05NFSv4.1: Handle SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit correctlyTrond Myklebust1-2/+4
Try to handle this for now by invalidating all outstanding layouts for this server and then testing all the open+lock+delegation stateids. At some later stage, we may want to optimise by separating out the testing of delegation stateids only, and adding testing of layout stateids. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05NFSv4.1: Handle SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED status bit correctly.Trond Myklebust1-4/+13
If the server tells us that only some state has been revoked, then we need to run the full TEST_STATEID dog and pony show in order to discover which locks and delegations are still OK. Currently we blow away all state, which means that we lose all locks! Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-05ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko1-11/+5
ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-07-04Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds69-467/+657
Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
2015-07-04dax: bdev_direct_access() may sleepMatthew Wilcox1-0/+6
The brd driver is the only in-tree driver that may sleep currently. After some discussion on linux-fsdevel, we decided that any driver may choose to sleep in its ->direct_access method. To ensure that all callers of bdev_direct_access() are prepared for this, add a call to might_sleep(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04block: Add support for DAX reads/writes to block devicesMatthew Wilcox2-2/+8
If a block device supports the ->direct_access methods, bypass the normal DIO path and use DAX to go straight to memcpy() instead of allocating a DIO and a BIO. Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in do_blockdev_direct_IO(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-07-04dax: Use copy_from_iter_nocacheMatthew Wilcox1-1/+1
When userspace does a write, there's no need for the written data to pollute the CPU cache. This matches the original XIP code. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>