diff options
Diffstat (limited to 'Documentation/block/data-integrity.txt')
-rw-r--r-- | Documentation/block/data-integrity.txt | 281 |
1 files changed, 0 insertions, 281 deletions
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt deleted file mode 100644 index 934c44ea0c57..000000000000 --- a/Documentation/block/data-integrity.txt +++ /dev/null @@ -1,281 +0,0 @@ ----------------------------------------------------------------------- -1. INTRODUCTION - -Modern filesystems feature checksumming of data and metadata to -protect against data corruption. However, the detection of the -corruption is done at read time which could potentially be months -after the data was written. At that point the original data that the -application tried to write is most likely lost. - -The solution is to ensure that the disk is actually storing what the -application meant it to. Recent additions to both the SCSI family -protocols (SBC Data Integrity Field, SCC protection proposal) as well -as SATA/T13 (External Path Protection) try to remedy this by adding -support for appending integrity metadata to an I/O. The integrity -metadata (or protection information in SCSI terminology) includes a -checksum for each sector as well as an incrementing counter that -ensures the individual sectors are written in the right order. And -for some protection schemes also that the I/O is written to the right -place on disk. - -Current storage controllers and devices implement various protective -measures, for instance checksumming and scrubbing. But these -technologies are working in their own isolated domains or at best -between adjacent nodes in the I/O path. The interesting thing about -DIF and the other integrity extensions is that the protection format -is well defined and every node in the I/O path can verify the -integrity of the I/O and reject it if corruption is detected. This -allows not only corruption prevention but also isolation of the point -of failure. - ----------------------------------------------------------------------- -2. THE DATA INTEGRITY EXTENSIONS - -As written, the protocol extensions only protect the path between -controller and storage device. However, many controllers actually -allow the operating system to interact with the integrity metadata -(IMD). We have been working with several FC/SAS HBA vendors to enable -the protection information to be transferred to and from their -controllers. - -The SCSI Data Integrity Field works by appending 8 bytes of protection -information to each sector. The data + integrity metadata is stored -in 520 byte sectors on disk. Data + IMD are interleaved when -transferred between the controller and target. The T13 proposal is -similar. - -Because it is highly inconvenient for operating systems to deal with -520 (and 4104) byte sectors, we approached several HBA vendors and -encouraged them to allow separation of the data and integrity metadata -scatter-gather lists. - -The controller will interleave the buffers on write and split them on -read. This means that Linux can DMA the data buffers to and from -host memory without changes to the page cache. - -Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs -is somewhat heavy to compute in software. Benchmarks found that -calculating this checksum had a significant impact on system -performance for a number of workloads. Some controllers allow a -lighter-weight checksum to be used when interfacing with the operating -system. Emulex, for instance, supports the TCP/IP checksum instead. -The IP checksum received from the OS is converted to the 16-bit CRC -when writing and vice versa. This allows the integrity metadata to be -generated by Linux or the application at very low cost (comparable to -software RAID5). - -The IP checksum is weaker than the CRC in terms of detecting bit -errors. However, the strength is really in the separation of the data -buffers and the integrity metadata. These two distinct buffers must -match up for an I/O to complete. - -The separation of the data and integrity metadata buffers as well as -the choice in checksums is referred to as the Data Integrity -Extensions. As these extensions are outside the scope of the protocol -bodies (T10, T13), Oracle and its partners are trying to standardize -them within the Storage Networking Industry Association. - ----------------------------------------------------------------------- -3. KERNEL CHANGES - -The data integrity framework in Linux enables protection information -to be pinned to I/Os and sent to/received from controllers that -support it. - -The advantage to the integrity extensions in SCSI and SATA is that -they enable us to protect the entire path from application to storage -device. However, at the same time this is also the biggest -disadvantage. It means that the protection information must be in a -format that can be understood by the disk. - -Generally Linux/POSIX applications are agnostic to the intricacies of -the storage devices they are accessing. The virtual filesystem switch -and the block layer make things like hardware sector size and -transport protocols completely transparent to the application. - -However, this level of detail is required when preparing the -protection information to send to a disk. Consequently, the very -concept of an end-to-end protection scheme is a layering violation. -It is completely unreasonable for an application to be aware whether -it is accessing a SCSI or SATA disk. - -The data integrity support implemented in Linux attempts to hide this -from the application. As far as the application (and to some extent -the kernel) is concerned, the integrity metadata is opaque information -that's attached to the I/O. - -The current implementation allows the block layer to automatically -generate the protection information for any I/O. Eventually the -intent is to move the integrity metadata calculation to userspace for -user data. Metadata and other I/O that originates within the kernel -will still use the automatic generation interface. - -Some storage devices allow each hardware sector to be tagged with a -16-bit value. The owner of this tag space is the owner of the block -device. I.e. the filesystem in most cases. The filesystem can use -this extra space to tag sectors as they see fit. Because the tag -space is limited, the block interface allows tagging bigger chunks by -way of interleaving. This way, 8*16 bits of information can be -attached to a typical 4KB filesystem block. - -This also means that applications such as fsck and mkfs will need -access to manipulate the tags from user space. A passthrough -interface for this is being worked on. - - ----------------------------------------------------------------------- -4. BLOCK LAYER IMPLEMENTATION DETAILS - -4.1 BIO - -The data integrity patches add a new field to struct bio when -CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a -pointer to a struct bip which contains the bio integrity payload. -Essentially a bip is a trimmed down struct bio which holds a bio_vec -containing the integrity metadata and the required housekeeping -information (bvec pool, vector count, etc.) - -A kernel subsystem can enable data integrity protection on a bio by -calling bio_integrity_alloc(bio). This will allocate and attach the -bip to the bio. - -Individual pages containing integrity metadata can subsequently be -attached using bio_integrity_add_page(). - -bio_free() will automatically free the bip. - - -4.2 BLOCK DEVICE - -Because the format of the protection data is tied to the physical -disk, each block device has been extended with a block integrity -profile (struct blk_integrity). This optional profile is registered -with the block layer using blk_integrity_register(). - -The profile contains callback functions for generating and verifying -the protection data, as well as getting and setting application tags. -The profile also contains a few constants to aid in completing, -merging and splitting the integrity metadata. - -Layered block devices will need to pick a profile that's appropriate -for all subdevices. blk_integrity_compare() can help with that. DM -and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 -will require extra work due to the application tag. - - ----------------------------------------------------------------------- -5.0 BLOCK LAYER INTEGRITY API - -5.1 NORMAL FILESYSTEM - - The normal filesystem is unaware that the underlying block device - is capable of sending/receiving integrity metadata. The IMD will - be automatically generated by the block layer at submit_bio() time - in case of a WRITE. A READ request will cause the I/O integrity - to be verified upon completion. - - IMD generation and verification can be toggled using the - - /sys/block/<bdev>/integrity/write_generate - - and - - /sys/block/<bdev>/integrity/read_verify - - flags. - - -5.2 INTEGRITY-AWARE FILESYSTEM - - A filesystem that is integrity-aware can prepare I/Os with IMD - attached. It can also use the application tag space if this is - supported by the block device. - - - bool bio_integrity_prep(bio); - - To generate IMD for WRITE and to set up buffers for READ, the - filesystem must call bio_integrity_prep(bio). - - Prior to calling this function, the bio data direction and start - sector must be set, and the bio should have all data pages - added. It is up to the caller to ensure that the bio does not - change while I/O is in progress. - Complete bio with error if prepare failed for some reson. - - -5.3 PASSING EXISTING INTEGRITY METADATA - - Filesystems that either generate their own integrity metadata or - are capable of transferring IMD from user space can use the - following calls: - - - struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); - - Allocates the bio integrity payload and hangs it off of the bio. - nr_pages indicate how many pages of protection data need to be - stored in the integrity bio_vec list (similar to bio_alloc()). - - The integrity payload will be freed at bio_free() time. - - - int bio_integrity_add_page(bio, page, len, offset); - - Attaches a page containing integrity metadata to an existing - bio. The bio must have an existing bip, - i.e. bio_integrity_alloc() must have been called. For a WRITE, - the integrity metadata in the pages must be in a format - understood by the target device with the notable exception that - the sector numbers will be remapped as the request traverses the - I/O stack. This implies that the pages added using this call - will be modified during I/O! The first reference tag in the - integrity metadata must have a value of bip->bip_sector. - - Pages can be added using bio_integrity_add_page() as long as - there is room in the bip bio_vec array (nr_pages). - - Upon completion of a READ operation, the attached pages will - contain the integrity metadata received from the storage device. - It is up to the receiver to process them and verify data - integrity upon completion. - - -5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY - METADATA - - To enable integrity exchange on a block device the gendisk must be - registered as capable: - - int blk_integrity_register(gendisk, blk_integrity); - - The blk_integrity struct is a template and should contain the - following: - - static struct blk_integrity my_profile = { - .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", - .generate_fn = my_generate_fn, - .verify_fn = my_verify_fn, - .tuple_size = sizeof(struct my_tuple_size), - .tag_size = <tag bytes per hw sector>, - }; - - 'name' is a text string which will be visible in sysfs. This is - part of the userland API so chose it carefully and never change - it. The format is standards body-type-variant. - E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. - - 'generate_fn' generates appropriate integrity metadata (for WRITE). - - 'verify_fn' verifies that the data buffer matches the integrity - metadata. - - 'tuple_size' must be set to match the size of the integrity - metadata per sector. I.e. 8 for DIF and EPP. - - 'tag_size' must be set to identify how many bytes of tag space - are available per hardware sector. For DIF this is either 2 or - 0 depending on the value of the Control Mode Page ATO bit. - ----------------------------------------------------------------------- -2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> |