From b8d834488fd7c0c5a79cd2bab112c37a3d3292b9 Mon Sep 17 00:00:00 2001 From: Goldwyn Rodrigues Date: Tue, 10 Jun 2014 16:31:01 -0500 Subject: md-cluster: Design Documentation Signed-off-by: Goldwyn Rodrigues --- Documentation/md-cluster.txt | 176 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 Documentation/md-cluster.txt (limited to 'Documentation') diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt new file mode 100644 index 000000000000..de1af7db3355 --- /dev/null +++ b/Documentation/md-cluster.txt @@ -0,0 +1,176 @@ +The cluster MD is a shared-device RAID for a cluster. + + +1. On-disk format + +Separate write-intent-bitmap are used for each cluster node. +The bitmaps record all writes that may have been started on that node, +and may not yet have finished. The on-disk layout is: + +0 4k 8k 12k +------------------------------------------------------------------- +| idle | md super | bm super [0] + bits | +| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | +| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | +| bm bits [3, contd] | | | + +During "normal" functioning we assume the filesystem ensures that only one +node writes to any given block at a time, so a write +request will + - set the appropriate bit (if not already set) + - commit the write to all mirrors + - schedule the bit to be cleared after a timeout. + +Reads are just handled normally. It is up to the filesystem to +ensure one node doesn't read from a location where another node (or the same +node) is writing. + + +2. DLM Locks for management + +There are two locks for managing the device: + +2.1 Bitmap lock resource (bm_lockres) + + The bm_lockres protects individual node bitmaps. They are named in the + form bitmap001 for node 1, bitmap002 for node and so on. When a node + joins the cluster, it acquires the lock in PW mode and it stays so + during the lifetime the node is part of the cluster. The lock resource + number is based on the slot number returned by the DLM subsystem. Since + DLM starts node count from one and bitmap slots start from zero, one is + subtracted from the DLM slot number to arrive at the bitmap slot number. + +3. Communication + +Each node has to communicate with other nodes when starting or ending +resync, and metadata superblock updates. + +3.1 Message Types + + There are 3 types, of messages which are passed + + 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been + updated, and the node must re-read the md superblock. This is performed + synchronously. + + 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended + so that each node may suspend or resume the region. + +3.2 Communication mechanism + + The DLM LVB is used to communicate within nodes of the cluster. There + are three resources used for the purpose: + + 3.2.1 Token: The resource which protects the entire communication + system. The node having the token resource is allowed to + communicate. + + 3.2.2 Message: The lock resource which carries the data to + communicate. + + 3.2.3 Ack: The resource, acquiring which means the message has been + acknowledged by all nodes in the cluster. The BAST of the resource + is used to inform the receive node that a node wants to communicate. + +The algorithm is: + + 1. receive status + + sender receiver receiver + ACK:CR ACK:CR ACK:CR + + 2. sender get EX of TOKEN + sender get EX of MESSAGE + sender receiver receiver + TOKEN:EX ACK:CR ACK:CR + MESSAGE:EX + ACK:CR + + Sender checks that it still needs to send a message. Messages received + or other events that happened while waiting for the TOKEN may have made + this message inappropriate or redundant. + + 3. sender write LVB. + sender down-convert MESSAGE from EX to CR + sender try to get EX of ACK + [ wait until all receiver has *processed* the MESSAGE ] + + [ triggered by bast of ACK ] + receiver get CR of MESSAGE + receiver read LVB + receiver processes the message + [ wait finish ] + receiver release ACK + + sender receiver receiver + TOKEN:EX MESSAGE:CR MESSAGE:CR + MESSAGE:CR + ACK:EX + + 4. triggered by grant of EX on ACK (indicating all receivers have processed + message) + sender down-convert ACK from EX to CR + sender release MESSAGE + sender release TOKEN + receiver upconvert to EX of MESSAGE + receiver get CR of ACK + receiver release MESSAGE + + sender receiver receiver + ACK:CR ACK:CR ACK:CR + + +4. Handling Failures + +4.1 Node Failure + When a node fails, the DLM informs the cluster with the slot. The node + starts a cluster recovery thread. The cluster recovery thread: + - acquires the bitmap lock of the failed node + - opens the bitmap + - reads the bitmap of the failed node + - copies the set bitmap to local node + - cleans the bitmap of the failed node + - releases bitmap lock of the failed node + - initiates resync of the bitmap on the current node + + The resync process, is the regular md resync. However, in a clustered + environment when a resync is performed, it needs to tell other nodes + of the areas which are suspended. Before a resync starts, the node + send out RESYNC_START with the (lo,hi) range of the area which needs + to be suspended. Each node maintains a suspend_list, which contains + the list of ranges which are currently suspended. On receiving + RESYNC_START, the node adds the range to the suspend_list. Similarly, + when the node performing resync finishes, it send RESYNC_FINISHED + to other nodes and other nodes remove the corresponding entry from + the suspend_list. + + A helper function, should_suspend() can be used to check if a particular + I/O range should be suspended or not. + +4.2 Device Failure + Device failures are handled and communicated with the metadata update + routine. + +5. Adding a new Device +For adding a new device, it is necessary that all nodes "see" the new device +to be added. For this, the following algorithm is used: + + 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues + ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) + 2. Node 1 sends NEWDISK with uuid and slot number + 3. Other nodes issue kobject_uevent_env with uuid and slot number + (Steps 4,5 could be a udev rule) + 4. In userspace, the node searches for the disk, perhaps + using blkid -t SUB_UUID="" + 5. Other nodes issue either of the following depending on whether the disk + was found: + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and + disc.number set to slot number) + ioctl(CLUSTERED_DISK_NACK) + 6. Other nodes drop lock on no-new-devs (CR) if device is found + 7. Node 1 attempts EX lock on no-new-devs + 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk + as SpareLocal + 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED + 10. Other nodes get the information whether a disk is added or not + by the following METADATA_UPDATED. -- cgit v1.2.3-59-g8ed1b