aboutsummaryrefslogtreecommitdiffstats
path: root/fs/dax.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-02-22mm, dax: make pmd_fault() and friends be the same as fault()Dave Jiang1-17/+11
Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22dax: add tracepoints to dax_pmd_insert_mapping()Ross Zwisler1-3/+9
Add tracepoints to dax_pmd_insert_mapping(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: big-1504 [001] .... 326.960743: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1504 [001] .... 326.960753: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1504 [001] .... 326.960981: dax_pmd_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10505000 length 0x200000 pfn 0x100600 DEV|MAP radix_entry 0xc000e big-1504 [001] .... 326.960986: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-6-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22dax: add tracepoints to dax_pmd_load_hole()Ross Zwisler1-4/+10
Add tracepoints to dax_pmd_load_hole(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: read_big-1478 [004] .... 238.242188: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 read_big-1478 [004] .... 238.242191: dax_pmd_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 read_big-1478 [004] .... 238.242390: dax_pmd_load_hole: dev 259:0 ino 0x1003 shared address 0x10400000 zero_page ffffea0002c20000 radix_entry 0x1e read_big-1478 [004] .... 238.242392: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-5-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22dax: add tracepoint infrastructure, PMD tracingRoss Zwisler1-10/+20
Tracepoints are the standard way to capture debugging and tracing information in many parts of the kernel, including the XFS and ext4 filesystems. Create a tracepoint header for FS DAX and add the first DAX tracepoints to the PMD fault handler. This allows the tracing for DAX to be done in the same way as the filesystem tracing so that developers can look at them together and get a coherent idea of what the system is doing. I added both an entry and exit tracepoint because future patches will add tracepoints to child functions of dax_iomap_pmd_fault() like dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to be wrapped by the parent function tracepoints so the code flow is more easily understood. Having entry and exit tracepoints for faults also allows us to easily see what filesystems functions were called during the fault. These filesystem functions get executed via iomap_begin() and iomap_end() calls, for example, and will have their own tracepoints. For PMD faults we primarily want to understand the type of mapping, the fault flags, the faulting address and whether it fell back to 4k faults. If it fell back to 4k faults the tracepoints should let us understand why. I named the new tracepoint header file "fs_dax.h" to allow for device DAX to have its own separate tracing header in the same directory at some point. Here is an example output for these events from a successful PMD fault: big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-20Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds1-1/+5
Pull ext4 updates from Ted Ts'o: "For this cycle we add support for the shutdown ioctl, which is primarily used for testing, but which can be useful on production systems when a scratch volume is being destroyed and the data on it doesn't need to be saved. This found (and we fixed) a number of bugs with ext4's recovery to corrupted file system --- the bugs increased the amount of data that could be potentially lost, and in the case of the inline data feature, could cause the kernel to BUG. Also included are a number of other bug fixes, including in ext4's fscrypt, DAX, inline data support" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits) ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN ext4: fix fencepost in s_first_meta_bg validation ext4: don't BUG when truncating encrypted inodes on the orphan list ext4: do not use stripe_width if it is not set ext4: fix stripe-unaligned allocations dax: assert that i_rwsem is held exclusive for writes ext4: fix DAX write locking ext4: add EXT4_IOC_GOINGDOWN ioctl ext4: add shutdown bit and check for it ext4: rename s_resize_flags to s_ext4_flags ext4: return EROFS if device is r/o and journal replay is needed ext4: preserve the needs_recovery flag when the journal is aborted jbd2: don't leak modified metadata buffers on an aborted journal ext4: fix inline data error paths ext4: move halfmd4 into hash.c directly ext4: fix use-after-iput when fscrypt contexts are inconsistent jbd2: fix use after free in kjournald2() ext4: fix data corruption in data=journal mode ext4: trim allocation requests to group size ext4: replace BUG_ON with WARN_ON in mb_find_extent() ...
2017-02-08dax: assert that i_rwsem is held exclusive for writesChristoph Hellwig1-1/+5
Make sure all callers follow the same locking protocol, given that DAX transparantly replaced the normal buffered I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2017-02-03fs: break out of iomap_file_buffered_write on fatal signalsMichal Hocko1-0/+5
Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write and other callers of iomap_apply loop to complete the full request. We need to check for fatal signals and back off with a short write instead. As the iomap_apply delegates all the work down to the actor we have to hook into those. All callers that work with the page cache are calling iomap_write_begin so we will check for signals there. dax_iomap_actor has to handle the situation explicitly because it copies data to the userspace directly. Other callers like iomap_page_mkwrite work on a single page or iomap_fiemap_actor do not allocate memory based on the given len. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-30iomap: constify struct iomap_opsChristoph Hellwig1-3/+3
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-24dax: fix build warnings with FS_DAX and !FS_IOMAPRoss Zwisler1-2/+0
As reported by Arnd: https://lkml.org/lkml/2017/1/10/756 Compiling with the following configuration: # CONFIG_EXT2_FS is not set # CONFIG_EXT4_FS is not set # CONFIG_XFS_FS is not set # CONFIG_FS_IOMAP depends on the above filesystems, as is not set CONFIG_FS_DAX=y generates build warnings about unused functions in fs/dax.c: fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function] static int dax_insert_mapping(struct address_space *mapping, ^~~~~~~~~~~~~~~~~~ fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function] static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, ^~~~~~~~~~~~~ fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function] static int dax_load_hole(struct address_space *mapping, void **entry, ^~~~~~~~~~~~~ fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function] static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index, ^~~~~~~~~~~~~~~~~~ Now that the struct buffer_head based DAX fault paths and I/O path have been removed we really depend on iomap support being present for DAX. Make this explicit by selecting FS_IOMAP if we compile in DAX support. This allows us to remove conditional selections of FS_IOMAP when FS_DAX was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c. Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10dax: wrprotect pmd_t in dax_mapping_entry_mkcleanRoss Zwisler1-15/+36
Currently dax_mapping_entry_mkclean() fails to clean and write protect the pmd_t of a DAX PMD entry during an *sync operation. This can result in data loss in the following sequence: 1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the pmd_t dirty and writeable 2) fsync, flushing out PMD data and cleaning the radix tree entry. We currently fail to mark the pmd_t as clean and write protected. 3) more mmap writes to the PMD. These don't cause any page faults since the pmd_t is dirty and writeable. The radix tree entry remains clean. 4) fsync, which fails to flush the dirty PMD data because the radix tree entry was clean. 5) crash - dirty data that should have been fsync'd as part of 4) could still have been in the processor cache, and is lost. Fix this by marking the pmd_t clean and write protected in dax_mapping_entry_mkclean(), which is called as part of the fsync operation 2). This will cause the writes in step 3) above to generate page faults where we'll re-dirty the PMD radix tree entry, resulting in flushes in the fsync that happens in step 4). Fixes: 4b4bb46d00b3 ("dax: clear dirty entry tags on cache flush") Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26dax: Call ->iomap_begin without entry lock during dax faultJan Kara1-55/+66
Currently ->iomap_begin() handler is called with entry lock held. If the filesystem held any locks between ->iomap_begin() and ->iomap_end() (such as ext4 which will want to hold transaction open), this would cause lock inversion with the iomap_apply() from standard IO path which first calls ->iomap_begin() and only then calls ->actor() callback which grabs entry locks for DAX (if it faults when copying from/to user provided buffers). Fix the problem by nesting grabbing of entry lock inside ->iomap_begin() - ->iomap_end() pair. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26dax: Finish fault completely when loading holesJan Kara1-9/+18
The only case when we do not finish the page fault completely is when we are loading hole pages into a radix tree. Avoid this special case and finish the fault in that case as well inside the DAX fault handler. It will allow us for easier iomap handling. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26dax: Avoid page invalidation races and unnecessary radix tree traversalsJan Kara1-17/+11
Currently dax_iomap_rw() takes care of invalidating page tables and evicting hole pages from the radix tree when write(2) to the file happens. This invalidation is only necessary when there is some block allocation resulting from write(2). Furthermore in current place the invalidation is racy wrt page fault instantiating a hole page just after we have invalidated it. So perform the page invalidation inside dax_iomap_actor() where we can do it only when really necessary and after blocks have been allocated so nobody will be instantiating new hole pages anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26mm: Invalidate DAX radix tree entries only if appropriateJan Kara1-10/+61
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages() just delete all exceptional radix tree entries they find. For DAX this is not desirable as we track cache dirtiness in these entries and when they are evicted, we may not flush caches although it is necessary. This can for example manifest when we write to the same block both via mmap and via write(2) (to different offsets) and fsync(2) then does not properly flush CPU caches when modification via write(2) was the last one. Create appropriate DAX functions to handle invalidation of DAX entries for invalidate_inode_pages2_range() and invalidate_mapping_pages() and wire them up into the corresponding mm functions. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-14dax: clear dirty entry tags on cache flushJan Kara1-0/+64
Currently we never clear dirty tags in DAX mappings and thus address ranges to flush accumulate. Now that we have locking of radix tree entries, we have all the locking necessary to reliably clear the radix tree dirty tag when flushing caches for corresponding address range. Similarly to page_mkclean() we also have to write-protect pages to get a page fault when the page is next written to so that we can mark the entry dirty again. Link: http://lkml.kernel.org/r/1479460644-25076-21-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14dax: protect PTE modification on WP fault by radix tree entry lockJan Kara1-6/+16
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite() has released corresponding radix tree entry lock. When we want to writeprotect PTE on cache flush, we need PTE modification to happen under radix tree entry lock to ensure consistent updates of PTE and radix tree (standard faults use page lock to ensure this consistency). So move update of PTE bit into dax_pfn_mkwrite(). Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14dax: make cache flushing protected by entry lockJan Kara1-22/+39
Currently, flushing of caches for DAX mappings was ignoring entry lock. So far this was ok (modulo a bug that a difference in entry lock could cause cache flushing to be mistakenly skipped) but in the following patches we will write-protect PTEs on cache flushing and clear dirty tags. For that we will need more exclusion. So do cache flushing under an entry lock. This allows us to remove one lock-unlock pair of mapping->tree_lock as a bonus. Link: http://lkml.kernel.org/r/1479460644-25076-19-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: move handling of COW faults into DAX codeJan Kara1-32/+29
Move final handling of COW faults from generic code into DAX fault handler. That way generic code doesn't have to be aware of peculiarities of DAX locking so remove that knowledge and make locking functions private to fs/dax.c. Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14mm: use vmf->address instead of of vmf->virtual_addressJan Kara1-2/+2
Every single user of vmf->virtual_address typed that entry to unsigned long before doing anything with it so the type of virtual_address does not really provide us any additional safety. Just use masked vmf->address which already has the appropriate type. Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds1-674/+458
Pull ext4 updates from Ted Ts'o: "This merge request includes the dax-4.0-iomap-pmd branch which is needed for both ext4 and xfs dax changes to use iomap for DAX. It also includes the fscrypt branch which is needed for ubifs encryption work as well as ext4 encryption and fscrypt cleanups. Lots of cleanups and bug fixes, especially making sure ext4 is robust against maliciously corrupted file systems --- especially maliciously corrupted xattr blocks and a maliciously corrupted superblock. Also fix ext4 support for 64k block sizes so it works well on ppcle. Fixed mbcache so we don't miss some common xattr blocks that can be merged" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) dax: Fix sleep in atomic contex in grab_mapping_entry() fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL fscrypt: Delay bounce page pool allocation until needed fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page() fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page() fscrypt: Never allocate fscrypt_ctx on in-place encryption fscrypt: Use correct index in decrypt path. fscrypt: move the policy flags and encryption mode definitions to uapi header fscrypt: move non-public structures and constants to fscrypt_private.h fscrypt: unexport fscrypt_initialize() fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info() fscrypto: move ioctl processing more fully into common code fscrypto: remove unneeded Kconfig dependencies MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches ext4: do not perform data journaling when data is encrypted ext4: return -ENOMEM instead of success ext4: reject inodes with negative size ext4: remove another test in ext4_alloc_file_blocks() Documentation: fix description of ext4's block_validity mount option ext4: fix checks for data=ordered and journal_async_commit options ...
2016-12-12lib: radix-tree: update callback for changing leaf nodesJohannes Weiner1-1/+2
Support handing __radix_tree_replace() a callback that gets invoked for all leaf nodes that change or get freed as a result of the slot replacement, to assist users tracking nodes with node->private_list. This prepares for putting page cache shadow entries into the radix tree root again and drastically simplifying the shadow tracking. Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12lib: radix-tree: check accounting of existing slot replacement usersJohannes Weiner1-2/+2
The bug in khugepaged fixed earlier in this series shows that radix tree slot replacement is fragile; and it will become more so when not only NULL<->!NULL transitions need to be caught but transitions from and to exceptional entries as well. We need checks. Re-implement radix_tree_replace_slot() on top of the sanity-checked __radix_tree_replace(). This requires existing callers to also pass the radix tree root, but it'll warn us when somebody replaces slots with contents that need proper accounting (transitions between NULL entries, real entries, exceptional entries) and where a replacement through the slot pointer would corrupt the radix tree node counts. Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12lib: radix-tree: native accounting of exceptional entriesJohannes Weiner1-2/+3
The way the page cache is sneaking shadow entries of evicted pages into the radix tree past the node entry accounting and tracking them manually in the upper bits of node->count is fraught with problems. These shadow entries are marked in the tree as exceptional entries, which are a native concept to the radix tree. Maintain an explicit counter of exceptional entries in the radix tree node. Subsequent patches will switch shadow entry tracking over to that counter. DAX and shmem are the other users of exceptional entries. Since slot replacements that change the entry type from regular to exceptional must now be accounted, introduce a __radix_tree_replace() function that does replacement and accounting, and switch DAX and shmem over. The increase in radix tree node size is temporary. A followup patch switches the shadow tracking to this new scheme and we'll no longer need the upper bits in node->count and shrink that back to one byte. Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12dax: Fix sleep in atomic contex in grab_mapping_entry()Jan Kara1-8/+7
Commit 642261ac995e: "dax: add struct iomap based DAX PMD support" has introduced unmapping of page tables if huge page needs to be split in grab_mapping_entry(). However the unmapping happens after radix_tree_preload() call which disables preemption and thus unmap_mapping_range() tries to acquire i_mmap_lock in atomic context which is a bug. Fix the problem by moving unmapping before radix_tree_preload() call. Fixes: 642261ac995e01d7837db1f4b90181496f7e6835 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20dax: rip out get_block based IO supportJan Kara1-315/+0
No one uses functions using the get_block callback anymore. Rip them out and update documentation. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-10dax: Introduce IOMAP_FAULT flagJan Kara1-2/+2
Introduce a flag telling iomap operations whether they are handling a fault or other IO. That may influence behavior wrt inode size and similar things. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: add struct iomap based DAX PMD supportRoss Zwisler1-37/+341
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based locking. This patch allows DAX PMDs to participate in the DAX radix tree based locking scheme so that they can be re-enabled using the new struct iomap based fault handlers. There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX mappings that have an associated block allocation, and 4k DAX empty entries. The empty entries exist to provide locking for the duration of a given page fault. This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP) entries, PMD DAX entries that have associated block allocations, and 2 MiB DAX empty entries. Unlike the 4k case where we insert a struct page* into the radix tree for 4k zero pages, for HZP we insert a DAX exceptional entry with the new RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in every 2MiB hole mapping, and it doesn't make sense to have that same struct page* with multiple entries in multiple trees. This would cause contention on the single page lock for the one Huge Zero Page, and it would break the page->index and page->mapping associations that are assumed to be valid in many other places in the kernel. One difficult use case is when one thread is trying to use 4k entries in radix tree for a given offset, and another thread is using 2 MiB entries for that same offset. The current code handles this by making the 2 MiB user fall back to 4k entries for most cases. This was done because it is the simplest solution, and because the use of 2MiB pages is already opportunistic. If we were to try to upgrade from 4k pages to 2MiB pages for a given range, we run into the problem of how we lock out 4k page faults for the entire 2MiB range while we clean out the radix tree so we can insert the 2MiB entry. We can solve this problem if we need to, but I think that the cases where both 2MiB entries and 4K entries are being used for the same range will be rare enough and the gain small enough that it probably won't be worth the complexity. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: move put_(un)locked_mapping_entry() in dax.cRoss Zwisler1-25/+25
No functional change. The static functions put_locked_mapping_entry() and put_unlocked_mapping_entry() will soon be used in error cases in grab_mapping_entry(), so move their definitions above this function. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: move RADIX_DAX_* defines to dax.hRoss Zwisler1-14/+0
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need access to the RADIX_DAX_PTE type so it can properly construct a 4k sized empty entry. Instead of shifting the defines between dax.c and dax.h as they are individually used in other code, just move them wholesale to dax.h so they'll be available when we need them. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: dax_iomap_fault() needs to call iomap_end()Ross Zwisler1-8/+29
Currently iomap_end() doesn't do anything for DAX page faults for both ext2 and XFS. ext2_iomap_end() just checks for a write underrun, and xfs_file_iomap_end() checks to see if it needs to finish a delayed allocation. However, in the future iomap_end() calls might be needed to make sure we have balanced allocations, locks, etc. So, add calls to iomap_end() with appropriate error handling to dax_iomap_fault(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: add dax_iomap_sector() helper functionRoss Zwisler1-3/+7
To be able to correctly calculate the sector from a file position and a struct iomap there is a complex little bit of logic that currently happens in both dax_iomap_actor() and dax_iomap_fault(). This will need to be repeated yet again in the DAX PMD fault handler when it is added, so break it out into a helper function. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: correct dax iomap code namespaceRoss Zwisler1-8/+8
The recently added DAX functions that use the new struct iomap data structure were named iomap_dax_rw(), iomap_dax_fault() and iomap_dax_actor(). These are actually defined in fs/dax.c, though, so should be part of the "dax" namespace and not the "iomap" namespace. Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor() respectively. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: remove dax_pmd_fault()Ross Zwisler1-213/+0
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX fault handler. This fault handler has been disabled for several kernel releases, and support for PMDs will be reintroduced using the struct iomap interface instead. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: coordinate locking for offsets in PMD rangeRoss Zwisler1-24/+41
DAX radix tree locking currently locks entries based on the unique combination of the 'mapping' pointer and the pgoff_t 'index' for the entry. This works for PTEs, but as we move to PMDs we will need to have all the offsets within the range covered by the PMD to map to the same bit lock. To accomplish this, for ranges covered by a PMD entry we will instead lock based on the page offset of the beginning of the PMD entry. The 'mapping' pointer is still used in the same way. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: consistent variable naming for DAX entriesRoss Zwisler1-17/+17
No functional change. Consistently use the variable name 'entry' instead of 'ret' for DAX radix tree entries. This was already happening in most of the code, so update get_unlocked_mapping_entry(), grab_mapping_entry() and dax_unlock_mapping_entry(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: remove the last BUG_ON() from fs/dax.cRoss Zwisler1-1/+2
Don't take down the kernel if we get an invalid 'from' and 'length' argument pair. Just warn once and return an error. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: make 'wait_table' global variable staticRoss Zwisler1-1/+1
The global 'wait_table' variable is only used within fs/dax.c, and generates the following sparse warning: fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static? Make it static so it has scope local to fs/dax.c, and to make sparse happy. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-11-08dax: remove buffer_size_valid()Ross Zwisler1-21/+1
Now that ext4 properly sets bh.b_size when we call get_block() for a hole, rely on that value and remove the buffer_size_valid() sanity check. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-07thp: reduce usage of huge zero page's atomic counterAaron Lu1-1/+1
The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19dax: provide an iomap based fault handlerChristoph Hellwig1-0/+114
Very similar to the existing dax_fault function, but instead of using the get_block callback we rely on the iomap_ops vector from iomap.c. That also avoids having to do two calls into the file system for write faults. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19dax: provide an iomap based dax read/write pathChristoph Hellwig1-0/+114
This is a much simpler implementation of the DAX read/write path that makes use of the iomap infrastructure. It does not try to mirror the direct I/O calling conventions and thus doesn't have to deal with i_dio_count or the end_io handler, but instead leaves locking and filesystem-specific I/O completion to the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19dax: don't pass buffer_head to copy_user_daxChristoph Hellwig1-6/+6
This way we can use this helper for the iomap based DAX implementation as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19dax: don't pass buffer_head to dax_insert_mappingChristoph Hellwig1-6/+6
This way we can use this helper for the iomap based DAX implementation as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-28Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds1-9/+4
Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...
2016-07-26dax: remote unused fault wrappersRoss Zwisler1-59/+14
Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and dax_pmd_fault() respectively, and update all callers. The dax_fault() and dax_pmd_fault() wrappers were initially intended to capture some filesystem independent functionality around page faults (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime and ctime). However, the following commits: 5726b27b09cc ("ext2: Add locking for DAX faults") ea3d7209ca01 ("ext4: fix races between page faults and hole punching") added locking to the ext2 and ext4 filesystems after these common operations but before __dax_fault() and __dax_pmd_fault() were called. This means that these wrappers are no longer used, and are unlikely to be used in the future. XFS has had locking analogous to what was recently added to ext2 and ext4 since DAX support was initially introduced by: 6b698edeeef0 ("xfs: add DAX file operations support") Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-24Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-nextDan Williams1-9/+4
2016-07-12pmem: kill __pmem address spaceDan Williams1-3/+3
The __pmem address space was meant to annotate codepaths that touch persistent memory and need to coordinate a call to wmb_pmem(). Now that wmb_pmem() is gone, there is little need to keep this annotation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-12fs/dax: remove wmb_pmem()Dan Williams1-6/+1
Flushing posted-write queues is now deferred to REQ_FLUSH context, or otherwise handled by an ADR event at the platform level. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-06-27dax: fix offset overflow in dax_ioEric Sandeen1-1/+6
This isn't functionally apparent for some reason, but when we test io at extreme offsets at the end of the loff_t rang, such as in fstests xfs/071, the calculation of "max" in dax_io() can be wrong due to pos + size overflowing. For example, # xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file enters dax_io with: start 0x7ffffffffffff000 end 0x7ffffffffffff200 and the rounded up "size" variable is 0x1000. This yields: pos + size 0x8000000000000000 (overflows loff_t) end 0x7ffffffffffff200 Due to the overflow, the min() function picks the wrong value for the "max" variable, and when we send (max - pos) into i.e. copy_from_iter_pmem() it is also the wrong value. This somehow(tm) gets magically absorbed without incident, probably because iter->count is correct. But it seems best to fix it up properly by comparing the two values as unsigned. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-26Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimmLinus Torvalds1-171/+421
Pull DAX locking updates from Ross Zwisler: "Filesystem DAX locking for 4.7 - We use a bit in an exceptional radix tree entry as a lock bit and use it similarly to how page lock is used for normal faults. This fixes races between hole instantiation and read faults of the same index. - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD locking is implemented" * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Remove i_mmap_lock protection dax: Use radix tree entry lock to protect cow faults dax: New fault locking dax: Allow DAX code to replace exceptional entries dax: Define DAX lock bit for radix tree exceptional entry dax: Make huge page handling depend of CONFIG_BROKEN dax: Fix condition for filling of PMD holes