diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/stable/sysfs-fs-orangefs | 87 | ||||
-rw-r--r-- | Documentation/ABI/testing/sysfs-devices-system-cpu | 69 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/mtd/atmel-nand.txt | 31 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/mtd/fsl-quadspi.txt | 5 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/mtd/qcom_nandc.txt | 86 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/power/rockchip-io-domain.txt | 11 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/rtc/maxim,mcp795.txt | 11 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt | 3 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/vendor-prefixes.txt | 2 | ||||
-rw-r--r-- | Documentation/dma-buf-sharing.txt | 11 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/pnfs-scsi-server.txt | 23 | ||||
-rw-r--r-- | Documentation/filesystems/orangefs.txt | 406 | ||||
-rw-r--r-- | Documentation/kasan.txt | 5 |
13 files changed, 728 insertions, 22 deletions
diff --git a/Documentation/ABI/stable/sysfs-fs-orangefs b/Documentation/ABI/stable/sysfs-fs-orangefs new file mode 100644 index 000000000000..affdb114bd33 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-fs-orangefs @@ -0,0 +1,87 @@ +What: /sys/fs/orangefs/perf_counters/* +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Counters and settings for various caches. + Read only. + + +What: /sys/fs/orangefs/perf_counter_reset +Date: June 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + echo a 0 or a 1 into perf_counter_reset to + reset all the counters in + /sys/fs/orangefs/perf_counters + except ones with PINT_PERF_PRESERVE set. + + +What: /sys/fs/orangefs/perf_time_interval_secs +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Length of perf counter intervals in + seconds. + + +What: /sys/fs/orangefs/perf_history_size +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + The perf_counters cache statistics have N, or + perf_history_size, samples. The default is + one. + + Every perf_time_interval_secs the (first) + samples are reset. + + If N is greater than one, the "current" set + of samples is reset, and the samples from the + other N-1 intervals remain available. + + +What: /sys/fs/orangefs/op_timeout_secs +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Service operation timeout in seconds. + + +What: /sys/fs/orangefs/slot_timeout_secs +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + "Slot" timeout in seconds. A "slot" + is an indexed buffer in the shared + memory segment used for communication + between the kernel module and userspace. + Slots are requested and waited for, + the wait times out after slot_timeout_secs. + + +What: /sys/fs/orangefs/acache/* +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Attribute cache configurable settings. + + +What: /sys/fs/orangefs/ncache/* +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Name cache configurable settings. + + +What: /sys/fs/orangefs/capcache/* +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Capability cache configurable settings. + + +What: /sys/fs/orangefs/ccache/* +Date: Jun 2015 +Contact: Mike Marshall <hubcap@omnibond.com> +Description: + Credential cache configurable settings. diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index b683e8ee69ec..16501334b99f 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -271,3 +271,72 @@ Description: Parameters for the CPU cache attributes - WriteBack: data is written only to the cache line and the modified cache line is written to main memory only when it is replaced + +What: /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub_turbo_stat + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/unthrottle + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/powercap + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overtemp + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/supply_fault + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overcurrent + /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/occ_reset +Date: March 2016 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> + Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org> +Description: POWERNV CPUFreq driver's frequency throttle stats directory and + attributes + + 'cpuX/cpufreq/throttle_stats' directory contains the CPU frequency + throttle stat attributes for the chip. The throttle stats of a cpu + is common across all the cpus belonging to a chip. Below are the + throttle attributes exported in the 'throttle_stats' directory: + + - turbo_stat : This file gives the total number of times the max + frequency is throttled to lower frequency in turbo (at and above + nominal frequency) range of frequencies. + + - sub_turbo_stat : This file gives the total number of times the + max frequency is throttled to lower frequency in sub-turbo(below + nominal frequency) range of frequencies. + + - unthrottle : This file gives the total number of times the max + frequency is unthrottled after being throttled. + + - powercap : This file gives the total number of times the max + frequency is throttled due to 'Power Capping'. + + - overtemp : This file gives the total number of times the max + frequency is throttled due to 'CPU Over Temperature'. + + - supply_fault : This file gives the total number of times the + max frequency is throttled due to 'Power Supply Failure'. + + - overcurrent : This file gives the total number of times the + max frequency is throttled due to 'Overcurrent'. + + - occ_reset : This file gives the total number of times the max + frequency is throttled due to 'OCC Reset'. + + The sysfs attributes representing different throttle reasons like + powercap, overtemp, supply_fault, overcurrent and occ_reset map to + the reasons provided by OCC firmware for throttling the frequency. + +What: /sys/devices/system/cpu/cpufreq/policyX/throttle_stats + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/turbo_stat + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/sub_turbo_stat + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/unthrottle + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/powercap + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/overtemp + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/supply_fault + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/overcurrent + /sys/devices/system/cpu/cpufreq/policyX/throttle_stats/occ_reset +Date: March 2016 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> + Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org> +Description: POWERNV CPUFreq driver's frequency throttle stats directory and + attributes + + 'policyX/throttle_stats' directory and all the attributes are same as + the /sys/devices/system/cpu/cpuX/cpufreq/throttle_stats directory and + attributes which give the frequency throttle information of the chip. diff --git a/Documentation/devicetree/bindings/mtd/atmel-nand.txt b/Documentation/devicetree/bindings/mtd/atmel-nand.txt index 7d4c8eb775a5..d53aba98fbc9 100644 --- a/Documentation/devicetree/bindings/mtd/atmel-nand.txt +++ b/Documentation/devicetree/bindings/mtd/atmel-nand.txt @@ -1,7 +1,10 @@ Atmel NAND flash Required properties: -- compatible : should be "atmel,at91rm9200-nand" or "atmel,sama5d4-nand". +- compatible: The possible values are: + "atmel,at91rm9200-nand" + "atmel,sama5d2-nand" + "atmel,sama5d4-nand" - reg : should specify localbus address and size used for the chip, and hardware ECC controller if available. If the hardware ECC is PMECC, it should contain address and size for @@ -21,10 +24,11 @@ Optional properties: - nand-ecc-mode : String, operation mode of the NAND ecc mode, soft by default. Supported values are: "none", "soft", "hw", "hw_syndrome", "hw_oob_first", "soft_bch". -- atmel,has-pmecc : boolean to enable Programmable Multibit ECC hardware. - Only supported by at91sam9x5 or later sam9 product. +- atmel,has-pmecc : boolean to enable Programmable Multibit ECC hardware, + capable of BCH encoding and decoding, on devices where it is present. - atmel,pmecc-cap : error correct capability for Programmable Multibit ECC - Controller. Supported values are: 2, 4, 8, 12, 24. + Controller. Supported values are: 2, 4, 8, 12, 24. If the compatible string + is "atmel,sama5d2-nand", 32 is also valid. - atmel,pmecc-sector-size : sector size for ECC computation. Supported values are: 512, 1024. - atmel,pmecc-lookup-table-offset : includes two offsets of lookup table in ROM @@ -32,15 +36,16 @@ Optional properties: sector size 1024. If not specified, driver will build the table in runtime. - nand-bus-width : 8 or 16 bus width if not present 8 - nand-on-flash-bbt: boolean to enable on flash bbt option if not present false -- Nand Flash Controller(NFC) is a slave driver under Atmel nand flash - - Required properties: - - compatible : "atmel,sama5d3-nfc". - - reg : should specify the address and size used for NFC command registers, - NFC registers and NFC Sram. NFC Sram address and size can be absent - if don't want to use it. - - clocks: phandle to the peripheral clock - - Optional properties: - - atmel,write-by-sram: boolean to enable NFC write by sram. + +Nand Flash Controller(NFC) is an optional sub-node +Required properties: +- compatible : "atmel,sama5d3-nfc" or "atmel,sama5d4-nfc". +- reg : should specify the address and size used for NFC command registers, + NFC registers and NFC SRAM. NFC SRAM address and size can be absent + if don't want to use it. +- clocks: phandle to the peripheral clock +Optional properties: +- atmel,write-by-sram: boolean to enable NFC write by SRAM. Examples: nand0: nand@40000000,0 { diff --git a/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt b/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt index 00c587b3d3ae..0333ec87dc49 100644 --- a/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt +++ b/Documentation/devicetree/bindings/mtd/fsl-quadspi.txt @@ -3,7 +3,9 @@ Required properties: - compatible : Should be "fsl,vf610-qspi", "fsl,imx6sx-qspi", "fsl,imx7d-qspi", "fsl,imx6ul-qspi", - "fsl,ls1021-qspi" + "fsl,ls1021a-qspi" + or + "fsl,ls2080a-qspi" followed by "fsl,ls1021a-qspi" - reg : the first contains the register location and length, the second contains the memory mapping address and length - reg-names: Should contain the reg names "QuadSPI" and "QuadSPI-memory" @@ -19,6 +21,7 @@ Optional properties: But if there are two NOR flashes connected to the bus, you should enable this property. (Please check the board's schematic.) + - big-endian : That means the IP register is big endian Example: diff --git a/Documentation/devicetree/bindings/mtd/qcom_nandc.txt b/Documentation/devicetree/bindings/mtd/qcom_nandc.txt new file mode 100644 index 000000000000..70dd5118a324 --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/qcom_nandc.txt @@ -0,0 +1,86 @@ +* Qualcomm NAND controller + +Required properties: +- compatible: should be "qcom,ipq806x-nand" +- reg: MMIO address range +- clocks: must contain core clock and always on clock +- clock-names: must contain "core" for the core clock and "aon" for the + always on clock +- dmas: DMA specifier, consisting of a phandle to the ADM DMA + controller node and the channel number to be used for + NAND. Refer to dma.txt and qcom_adm.txt for more details +- dma-names: must be "rxtx" +- qcom,cmd-crci: must contain the ADM command type CRCI block instance + number specified for the NAND controller on the given + platform +- qcom,data-crci: must contain the ADM data type CRCI block instance + number specified for the NAND controller on the given + platform +- #address-cells: <1> - subnodes give the chip-select number +- #size-cells: <0> + +* NAND chip-select + +Each controller may contain one or more subnodes to represent enabled +chip-selects which (may) contain NAND flash chips. Their properties are as +follows. + +Required properties: +- compatible: should contain "qcom,nandcs" +- reg: a single integer representing the chip-select + number (e.g., 0, 1, 2, etc.) +- #address-cells: see partition.txt +- #size-cells: see partition.txt +- nand-ecc-strength: see nand.txt +- nand-ecc-step-size: must be 512. see nand.txt for more details. + +Optional properties: +- nand-bus-width: see nand.txt + +Each nandcs device node may optionally contain a 'partitions' sub-node, which +further contains sub-nodes describing the flash partition mapping. See +partition.txt for more detail. + +Example: + +nand@1ac00000 { + compatible = "qcom,ebi2-nandc"; + reg = <0x1ac00000 0x800>; + + clocks = <&gcc EBI2_CLK>, + <&gcc EBI2_AON_CLK>; + clock-names = "core", "aon"; + + dmas = <&adm_dma 3>; + dma-names = "rxtx"; + qcom,cmd-crci = <15>; + qcom,data-crci = <3>; + + #address-cells = <1>; + #size-cells = <0>; + + nandcs@0 { + compatible = "qcom,nandcs"; + reg = <0>; + + nand-ecc-strength = <4>; + nand-ecc-step-size = <512>; + nand-bus-width = <8>; + + partitions { + compatible = "fixed-partitions"; + #address-cells = <1>; + #size-cells = <1>; + + partition@0 { + label = "boot-nand"; + reg = <0 0x58a0000>; + }; + + partition@58a0000 { + label = "fs-nand"; + reg = <0x58a0000 0x4000000>; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt index b8627e763dba..c84fb47265eb 100644 --- a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt +++ b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt @@ -35,6 +35,8 @@ Required properties: - "rockchip,rk3288-io-voltage-domain" for rk3288 - "rockchip,rk3368-io-voltage-domain" for rk3368 - "rockchip,rk3368-pmu-io-voltage-domain" for rk3368 pmu-domains + - "rockchip,rk3399-io-voltage-domain" for rk3399 + - "rockchip,rk3399-pmu-io-voltage-domain" for rk3399 pmu-domains - rockchip,grf: phandle to the syscon managing the "general register files" @@ -79,6 +81,15 @@ Possible supplies for rk3368 pmu-domains: - pmu-supply: The supply connected to PMUIO_VDD. - vop-supply: The supply connected to LCDC_VDD. +Possible supplies for rk3399: +- bt656-supply: The supply connected to APIO2_VDD. +- audio-supply: The supply connected to APIO5_VDD. +- sdmmc-supply: The supply connected to SDMMC0_VDD. +- gpio1830 The supply connected to APIO4_VDD. + +Possible supplies for rk3399 pmu-domains: +- pmu1830-supply:The supply connected to PMUIO2_VDD. + Example: io-domains { diff --git a/Documentation/devicetree/bindings/rtc/maxim,mcp795.txt b/Documentation/devicetree/bindings/rtc/maxim,mcp795.txt new file mode 100644 index 000000000000..a59fdd8c236d --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/maxim,mcp795.txt @@ -0,0 +1,11 @@ +* Maxim MCP795 SPI Serial Real-Time Clock + +Required properties: +- compatible: Should contain "maxim,mcp795". +- reg: SPI address for chip + +Example: + mcp795: rtc@0 { + compatible = "maxim,mcp795"; + reg = <0>; + }; diff --git a/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt b/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt index 03c0e989e020..66f6adf8d44d 100644 --- a/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt +++ b/Documentation/devicetree/bindings/ufs/ufshcd-pltfrm.txt @@ -38,6 +38,9 @@ Optional properties: defined or a value in the array is "0" then it is assumed that the frequency is set by the parent clock or a fixed rate clock source. +-lanes-per-direction : number of lanes available per direction - either 1 or 2. + Note that it is assume same number of lanes is used both + directions at once. If not specified, default is 2 lanes per direction. Note: If above properties are not defined it can be assumed that the supply regulators or clocks are always on. diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index 42adb4101cc6..86740d4a270d 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -10,6 +10,7 @@ ad Avionic Design GmbH adapteva Adapteva, Inc. adh AD Holdings Plc. adi Analog Devices, Inc. +advantech Advantech Corporation aeroflexgaisler Aeroflex Gaisler AB al Annapurna Labs allwinner Allwinner Technology Co., Ltd. @@ -89,6 +90,7 @@ fcs Fairchild Semiconductor firefly Firefly focaltech FocalTech Systems Co.,Ltd fsl Freescale Semiconductor +ge General Electric Company GEFanuc GE Fanuc Intelligent Platforms Embedded Systems, Inc. gef GE Fanuc Intelligent Platforms Embedded Systems, Inc. geniatech Geniatech, Inc. diff --git a/Documentation/dma-buf-sharing.txt b/Documentation/dma-buf-sharing.txt index 32ac32e773e1..ca44c5820585 100644 --- a/Documentation/dma-buf-sharing.txt +++ b/Documentation/dma-buf-sharing.txt @@ -352,7 +352,8 @@ Being able to mmap an export dma-buf buffer object has 2 main use-cases: No special interfaces, userspace simply calls mmap on the dma-buf fd, making sure that the cache synchronization ioctl (DMA_BUF_IOCTL_SYNC) is *always* - used when the access happens. This is discussed next paragraphs. + used when the access happens. Note that DMA_BUF_IOCTL_SYNC can fail with + -EAGAIN or -EINTR, in which case it must be restarted. Some systems might need some sort of cache coherency management e.g. when CPU and GPU domains are being accessed through dma-buf at the same time. To @@ -366,10 +367,10 @@ Being able to mmap an export dma-buf buffer object has 2 main use-cases: want (with the new data being consumed by the GPU or say scanout device) - munmap once you don't need the buffer any more - Therefore, for correctness and optimal performance, systems with the memory - cache shared by the GPU and CPU i.e. the "coherent" and also the - "incoherent" are always required to use SYNC_START and SYNC_END before and - after, respectively, when accessing the mapped address. + For correctness and optimal performance, it is always required to use + SYNC_START and SYNC_END before and after, respectively, when accessing the + mapped address. Userspace cannot rely on coherent access, even when there + are systems where it just works without calling these ioctls. 2. Supporting existing mmap interfaces in importers diff --git a/Documentation/filesystems/nfs/pnfs-scsi-server.txt b/Documentation/filesystems/nfs/pnfs-scsi-server.txt new file mode 100644 index 000000000000..5bef7268bd9f --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs-scsi-server.txt @@ -0,0 +1,23 @@ + +pNFS SCSI layout server user guide +================================== + +This document describes support for pNFS SCSI layouts in the Linux NFS server. +With pNFS SCSI layouts, the NFS server acts as Metadata Server (MDS) for pNFS, +which in addition to handling all the metadata access to the NFS export, +also hands out layouts to the clients so that they can directly access the +underlying SCSI LUNs that are shared with the client. + +To use pNFS SCSI layouts with with the Linux NFS server, the exported file +system needs to support the pNFS SCSI layouts (currently just XFS), and the +file system must sit on a SCSI LUN that is accessible to the clients in +addition to the MDS. As of now the file system needs to sit directly on the +exported LUN, striping or concatenation of LUNs on the MDS and clients +is not supported yet. + +On a server built with CONFIG_NFSD_SCSI, the pNFS SCSI volume support is +automatically enabled if the file system is exported using the "pnfs" +option and the underlying SCSI device support persistent reservations. +On the client make sure the kernel has the CONFIG_PNFS_BLOCK option +enabled, and the file system is mounted using the NFSv4.1 protocol +version (mount -o vers=4.1). diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt new file mode 100644 index 000000000000..e1a0056a365f --- /dev/null +++ b/Documentation/filesystems/orangefs.txt @@ -0,0 +1,406 @@ +ORANGEFS +======== + +OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal +for large storage problems faced by HPC, BigData, Streaming Video, +Genomics, Bioinformatics. + +Orangefs, originally called PVFS, was first developed in 1993 by +Walt Ligon and Eric Blumer as a parallel file system for Parallel +Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns +of parallel programs. + +Orangefs features include: + + * Distributes file data among multiple file servers + * Supports simultaneous access by multiple clients + * Stores file data and metadata on servers using local file system + and access methods + * Userspace implementation is easy to install and maintain + * Direct MPI support + * Stateless + + +MAILING LIST +============ + +http://beowulf-underground.org/mailman/listinfo/pvfs2-users + + +DOCUMENTATION +============= + +http://www.orangefs.org/documentation/ + + +USERSPACE FILESYSTEM SOURCE +=========================== + +http://www.orangefs.org/download + +Orangefs versions prior to 2.9.3 would not be compatible with the +upstream version of the kernel client. + + +BUILDING THE USERSPACE FILESYSTEM ON A SINGLE SERVER +==================================================== + +When Orangefs is upstream, "--with-kernel" shouldn't be needed, but +until then the path to where the kernel with the Orangefs kernel client +patch was built is needed to ensure that pvfs2-client-core (the bridge +between kernel space and user space) will build properly. You can omit +--prefix if you don't care that things are sprinkled around in +/usr/local. + +./configure --prefix=/opt/ofs --with-kernel=/path/to/orangefs/kernel + +make + +make install + +Create an orangefs config file: +/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf + + for "Enter hostnames", use the hostname, don't let it default to + localhost. + +create a pvfs2tab file in /etc: +cat /etc/pvfs2tab +tcp://myhostname:3334/orangefs /mymountpoint pvfs2 defaults,noauto 0 0 + +create the mount point you specified in the tab file if needed: +mkdir /mymountpoint + +bootstrap the server: +/opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf -f + +start the server: +/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf + +Now the server is running. At this point you might like to +prove things are working with: + +/opt/osf/bin/pvfs2-ls /mymountpoint + +You might not want to enforce selinux, it doesn't seem to matter by +linux 3.11... + +If stuff seems to be working, turn on the client core: +/opt/osf/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core + +Mount your filesystem. +mount -t pvfs2 tcp://myhostname:3334/orangefs /mymountpoint + + +OPTIONS +======= + +The following mount options are accepted: + + acl + Allow the use of Access Control Lists on files and directories. + + intr + Some operations between the kernel client and the user space + filesystem can be interruptible, such as changes in debug levels + and the setting of tunable parameters. + + local_lock + Enable posix locking from the perspective of "this" kernel. The + default file_operations lock action is to return ENOSYS. Posix + locking kicks in if the filesystem is mounted with -o local_lock. + Distributed locking is being worked on for the future. + + +DEBUGGING +========= + +If you want the debug (GOSSIP) statements in a particular +source file (inode.c for example) go to syslog: + + echo inode > /sys/kernel/debug/orangefs/kernel-debug + +No debugging (the default): + + echo none > /sys/kernel/debug/orangefs/kernel-debug + +Debugging from several source files: + + echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug + +All debugging: + + echo all > /sys/kernel/debug/orangefs/kernel-debug + +Get a list of all debugging keywords: + + cat /sys/kernel/debug/orangefs/debug-help + + +PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE +============================================ + +Orangefs is a user space filesystem and an associated kernel module. +We'll just refer to the user space part of Orangefs as "userspace" +from here on out. Orangefs descends from PVFS, and userspace code +still uses PVFS for function and variable names. Userspace typedefs +many of the important structures. Function and variable names in +the kernel module have been transitioned to "orangefs", and The Linux +Coding Style avoids typedefs, so kernel module structures that +correspond to userspace structures are not typedefed. + +The kernel module implements a pseudo device that userspace +can read from and write to. Userspace can also manipulate the +kernel module through the pseudo device with ioctl. + +THE BUFMAP: + +At startup userspace allocates two page-size-aligned (posix_memalign) +mlocked memory buffers, one is used for IO and one is used for readdir +operations. The IO buffer is 41943040 bytes and the readdir buffer is +4194304 bytes. Each buffer contains logical chunks, or partitions, and +a pointer to each buffer is added to its own PVFS_dev_map_desc structure +which also describes its total size, as well as the size and number of +the partitions. + +A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a +mapping routine in the kernel module with an ioctl. The structure is +copied from user space to kernel space with copy_from_user and is used +to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which +then contains: + + * refcnt - a reference counter + * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's + partition size, which represents the filesystem's block size and + is used for s_blocksize in super blocks. + * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of + partitions in the IO buffer. + * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. + * total_size - the total size of the IO buffer. + * page_count - the number of 4096 byte pages in the IO buffer. + * page_array - a pointer to page_count * (sizeof(struct page*)) bytes + of kcalloced memory. This memory is used as an array of pointers + to each of the pages in the IO buffer through a call to get_user_pages. + * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) + bytes of kcalloced memory. This memory is further intialized: + + user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc + structure. user_desc->ptr points to the IO buffer. + + pages_per_desc = bufmap->desc_size / PAGE_SIZE + offset = 0 + + bufmap->desc_array[0].page_array = &bufmap->page_array[offset] + bufmap->desc_array[0].array_count = pages_per_desc = 1024 + bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) + offset += 1024 + . + . + . + bufmap->desc_array[9].page_array = &bufmap->page_array[offset] + bufmap->desc_array[9].array_count = pages_per_desc = 1024 + bufmap->desc_array[9].uaddr = (user_desc->ptr) + + (9 * 1024 * 4096) + offset += 1024 + + * buffer_index_array - a desc_count sized array of ints, used to + indicate which of the IO buffer's partitions are available to use. + * buffer_index_lock - a spinlock to protect buffer_index_array during update. + * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element + int array used to indicate which of the readdir buffer's partitions are + available to use. + * readdir_index_lock - a spinlock to protect readdir_index_array during + update. + +OPERATIONS: + +The kernel module builds an "op" (struct orangefs_kernel_op_s) when it +needs to communicate with userspace. Part of the op contains the "upcall" +which expresses the request to userspace. Part of the op eventually +contains the "downcall" which expresses the results of the request. + +The slab allocator is used to keep a cache of op structures handy. + +At init time the kernel module defines and initializes a request list +and an in_progress hash table to keep track of all the ops that are +in flight at any given time. + +Ops are stateful: + + * unknown - op was just initialized + * waiting - op is on request_list (upward bound) + * inprogr - op is in progress (waiting for downcall) + * serviced - op has matching downcall; ok + * purged - op has to start a timer since client-core + exited uncleanly before servicing op + * given up - submitter has given up waiting for it + +When some arbitrary userspace program needs to perform a +filesystem operation on Orangefs (readdir, I/O, create, whatever) +an op structure is initialized and tagged with a distinguishing ID +number. The upcall part of the op is filled out, and the op is +passed to the "service_operation" function. + +Service_operation changes the op's state to "waiting", puts +it on the request list, and signals the Orangefs file_operations.poll +function through a wait queue. Userspace is polling the pseudo-device +and thus becomes aware of the upcall request that needs to be read. + +When the Orangefs file_operations.read function is triggered, the +request list is searched for an op that seems ready-to-process. +The op is removed from the request list. The tag from the op and +the filled-out upcall struct are copy_to_user'ed back to userspace. + +If any of these (and some additional protocol) copy_to_users fail, +the op's state is set to "waiting" and the op is added back to +the request list. Otherwise, the op's state is changed to "in progress", +and the op is hashed on its tag and put onto the end of a list in the +in_progress hash table at the index the tag hashed to. + +When userspace has assembled the response to the upcall, it +writes the response, which includes the distinguishing tag, back to +the pseudo device in a series of io_vecs. This triggers the Orangefs +file_operations.write_iter function to find the op with the associated +tag and remove it from the in_progress hash table. As long as the op's +state is not "canceled" or "given up", its state is set to "serviced". +The file_operations.write_iter function returns to the waiting vfs, +and back to service_operation through wait_for_matching_downcall. + +Service operation returns to its caller with the op's downcall +part (the response to the upcall) filled out. + +The "client-core" is the bridge between the kernel module and +userspace. The client-core is a daemon. The client-core has an +associated watchdog daemon. If the client-core is ever signaled +to die, the watchdog daemon restarts the client-core. Even though +the client-core is restarted "right away", there is a period of +time during such an event that the client-core is dead. A dead client-core +can't be triggered by the Orangefs file_operations.poll function. +Ops that pass through service_operation during a "dead spell" can timeout +on the wait queue and one attempt is made to recycle them. Obviously, +if the client-core stays dead too long, the arbitrary userspace processes +trying to use Orangefs will be negatively affected. Waiting ops +that can't be serviced will be removed from the request list and +have their states set to "given up". In-progress ops that can't +be serviced will be removed from the in_progress hash table and +have their states set to "given up". + +Readdir and I/O ops are atypical with respect to their payloads. + + - readdir ops use the smaller of the two pre-allocated pre-partitioned + memory buffers. The readdir buffer is only available to userspace. + The kernel module obtains an index to a free partition before launching + a readdir op. Userspace deposits the results into the indexed partition + and then writes them to back to the pvfs device. + + - io (read and write) ops use the larger of the two pre-allocated + pre-partitioned memory buffers. The IO buffer is accessible from + both userspace and the kernel module. The kernel module obtains an + index to a free partition before launching an io op. The kernel module + deposits write data into the indexed partition, to be consumed + directly by userspace. Userspace deposits the results of read + requests into the indexed partition, to be consumed directly + by the kernel module. + +Responses to kernel requests are all packaged in pvfs2_downcall_t +structs. Besides a few other members, pvfs2_downcall_t contains a +union of structs, each of which is associated with a particular +response type. + +The several members outside of the union are: + - int32_t type - type of operation. + - int32_t status - return code for the operation. + - int64_t trailer_size - 0 unless readdir operation. + - char *trailer_buf - initialized to NULL, used during readdir operations. + +The appropriate member inside the union is filled out for any +particular response. + + PVFS2_VFS_OP_FILE_IO + fill a pvfs2_io_response_t + + PVFS2_VFS_OP_LOOKUP + fill a PVFS_object_kref + + PVFS2_VFS_OP_CREATE + fill a PVFS_object_kref + + PVFS2_VFS_OP_SYMLINK + fill a PVFS_object_kref + + PVFS2_VFS_OP_GETATTR + fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) + fill in a string with the link target when the object is a symlink. + + PVFS2_VFS_OP_MKDIR + fill a PVFS_object_kref + + PVFS2_VFS_OP_STATFS + fill a pvfs2_statfs_response_t with useless info <g>. It is hard for + us to know, in a timely fashion, these statistics about our + distributed network filesystem. + + PVFS2_VFS_OP_FS_MOUNT + fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref + except its members are in a different order and "__pad1" is replaced + with "id". + + PVFS2_VFS_OP_GETXATTR + fill a pvfs2_getxattr_response_t + + PVFS2_VFS_OP_LISTXATTR + fill a pvfs2_listxattr_response_t + + PVFS2_VFS_OP_PARAM + fill a pvfs2_param_response_t + + PVFS2_VFS_OP_PERF_COUNT + fill a pvfs2_perf_count_response_t + + PVFS2_VFS_OP_FSKEY + file a pvfs2_fs_key_response_t + + PVFS2_VFS_OP_READDIR + jamb everything needed to represent a pvfs2_readdir_response_t into + the readdir buffer descriptor specified in the upcall. + +Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests +made by the kernel side. + +A buffer_list containing: + - a pointer to the prepared response to the request from the + kernel (struct pvfs2_downcall_t). + - and also, in the case of a readdir request, a pointer to a + buffer containing descriptors for the objects in the target + directory. +... is sent to the function (PINT_dev_write_list) which performs +the writev. + +PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; + +The first four elements of io_array are initialized like this for all +responses: + + io_array[0].iov_base = address of local variable "proto_ver" (int32_t) + io_array[0].iov_len = sizeof(int32_t) + + io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) + io_array[1].iov_len = sizeof(int32_t) + + io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) + io_array[2].iov_len = sizeof(int64_t) + + io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) + of global variable vfs_request (vfs_request_t) + io_array[3].iov_len = sizeof(pvfs2_downcall_t) + +Readdir responses initialize the fifth element io_array like this: + + io_array[4].iov_base = contents of member trailer_buf (char *) + from out_downcall member of global variable + vfs_request + io_array[4].iov_len = contents of member trailer_size (PVFS_size) + from out_downcall member of global variable + vfs_request + + diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt index aa1e0c91e368..7dd95b35cd7c 100644 --- a/Documentation/kasan.txt +++ b/Documentation/kasan.txt @@ -12,8 +12,7 @@ KASAN uses compile-time instrumentation for checking every memory access, therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is required for detection of out-of-bounds accesses to stack or global variables. -Currently KASAN is supported only for x86_64 architecture and requires the -kernel to be built with the SLUB allocator. +Currently KASAN is supported only for x86_64 architecture. 1. Usage ======== @@ -27,7 +26,7 @@ inline are compiler instrumentation types. The former produces smaller binary the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC version 5.0 or later. -Currently KASAN works only with the SLUB memory allocator. +KASAN works with both SLUB and SLAB memory allocators. For better bug detection and nicer reporting, enable CONFIG_STACKTRACE. To disable instrumentation for specific files or directories, add a line |