path: root/Documentation
diff options
authorLinus Torvalds <torvalds@linux-foundation.org>2015-02-17 08:38:30 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2015-02-17 08:38:30 -0800
commitc397f8fa4379040bada53256c848e62c8b060392 (patch)
tree8101efb5c0c3b0a73e5e65f3474843c0914cc4d0 /Documentation
parentMerge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux (diff)
parentrtc: add driver for DS1685 family of real time clocks (diff)
Merge branch 'akpm' (patches from Andrew)
Merge fifth set of updates from Andrew Morton: - A few things which were awaiting merges from linux-next: - rtc - ocfs2 - misc others - Willy's "dax" feature: direct fs access to memory (mainly NV-DIMMs) which isn't backed by pageframes. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (37 commits) rtc: add driver for DS1685 family of real time clocks MAINTAINERS: add entry for Maxim PMICs on Samsung boards lib/Kconfig: use bool instead of boolean powerpc: drop _PAGE_FILE and pte_file()-related helpers ocfs2: set append dio as a ro compat feature ocfs2: wait for orphan recovery first once append O_DIRECT write crash ocfs2: complete the rest request through buffer io ocfs2: do not fallback to buffer I/O write if appending ocfs2: allocate blocks in ocfs2_direct_IO_get_blocks ocfs2: implement ocfs2_direct_IO_write ocfs2: add orphan recovery types in ocfs2_recover_orphans ocfs2: add functions to add and remove inode in orphan dir ocfs2: prepare some interfaces used in append direct io MAINTAINERS: fix spelling mistake & remove trailing WS dax: does not work correctly with virtual aliasing caches brd: rename XIP to DAX ext4: add DAX functionality dax: add dax_zero_page_range ext2: get rid of most mentions of XIP in ext2 ext2: remove ext2_aops_xip ...
Diffstat (limited to 'Documentation')
7 files changed, 104 insertions, 85 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index ac28149aede4..9922939e7d99 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -34,6 +34,9 @@ configfs/
- directory containing configfs documentation and example code.
- info on the cram filesystem for small storage (ROMs etc).
+ - info on avoiding the page cache for files stored on CPU-addressable
+ storage devices.
- info on the debugfs filesystem.
@@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
- info on XFS Self Describing Metadata.
- info and mount options for the XFS filesystem.
- - info on execute-in-place for file mappings.
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b30753cbf431..2ca3d17eee56 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -199,8 +199,6 @@ prototypes:
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
- unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
@@ -225,7 +223,6 @@ invalidatepage: yes
releasepage: yes
freepage: yes
-get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 000000000000..baf41118660d
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,94 @@
+Direct Access for files
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage. The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual. When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+Implementation Tips for Block Driver Writers
+To support DAX in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+Implementation Tips for Filesystem Writers
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+ i_flags
+- implementing the direct_IO address space operation, and calling
+ dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+ VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+ for fault and page_mkwrite (which should probably call dax_fault() and
+ dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+The get_block() callback passed to the DAX functions may return
+uninitialised extents. If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+The DAX code does not work correctly on architectures which have virtually
+mapped caches such as ARM, MIPS and SPARC.
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages. This problem is being worked on. That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here). Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f905f10..b9714569e472 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -20,6 +20,9 @@ minixdf Makes `df' act like Minix.
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
+dax Use direct access (no page cache). See
+ Documentation/filesystems/dax.txt.
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
@@ -56,8 +59,6 @@ noacl Don't support POSIX ACLs.
nobh Do not attach buffer_heads to file pagecache.
-xip Use execute in place (no caching) if possible
grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a3293aaa4..6c0108eb0137 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,10 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.
+dax Use direct access (no page cache). See
+ Documentation/filesystems/dax.txt. Note that
+ this option is incompatible with data=journal.
Data Mode
There are 3 different data modes:
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 43ce0507ee25..966b22829f3b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -591,8 +591,6 @@ struct address_space_operations {
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- struct page* (*get_xip_page)(struct address_space *, sector_t,
- int);
/* migrate the contents of a page to the specified target */
int (*migratepage) (struct page *, struct page *);
int (*launder_page) (struct page *);
@@ -748,11 +746,6 @@ struct address_space_operations {
and transfer data directly between the storage and the
application's address space.
- get_xip_page: called by the VM to translate a block number to a page.
- The page is valid until the corresponding filesystem is unmounted.
- Filesystems that want to use execute-in-place (XIP) need to implement
- it. An example implementation can be found in fs/ext2/xip.c.
migrate_page: This is used to compact the physical memory usage.
If the VM wants to relocate a page (maybe off a memory card
that is signalling imminent failure) it will pass a new page
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b77472949ede..000000000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory. It also returns a kernel virtual
-address that can be used to access the memory.
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that can be contiguously accessed at that offset. It may also
-return a negative errno if an error occurs.
-The block device operation is optional, these block devices support it as of
-- dcssblk: s390 dcss block device driver
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.