diff options
Diffstat (limited to 'Documentation/networking/rds.txt')
-rw-r--r-- | Documentation/networking/rds.txt | 423 |
1 files changed, 0 insertions, 423 deletions
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt deleted file mode 100644 index eec61694e894..000000000000 --- a/Documentation/networking/rds.txt +++ /dev/null @@ -1,423 +0,0 @@ - -Overview -======== - -This readme tries to provide some background on the hows and whys of RDS, -and will hopefully help you find your way around the code. - -In addition, please see this email about RDS origins: -http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html - -RDS Architecture -================ - -RDS provides reliable, ordered datagram delivery by using a single -reliable connection between any two nodes in the cluster. This allows -applications to use a single socket to talk to any other process in the -cluster - so in a cluster with N processes you need N sockets, in contrast -to N*N if you use a connection-oriented socket transport like TCP. - -RDS is not Infiniband-specific; it was designed to support different -transports. The current implementation used to support RDS over TCP as well -as IB. - -The high-level semantics of RDS from the application's point of view are - - * Addressing - RDS uses IPv4 addresses and 16bit port numbers to identify - the end point of a connection. All socket operations that involve - passing addresses between kernel and user space generally - use a struct sockaddr_in. - - The fact that IPv4 addresses are used does not mean the underlying - transport has to be IP-based. In fact, RDS over IB uses a - reliable IB connection; the IP address is used exclusively to - locate the remote node's GID (by ARPing for the given IP). - - The port space is entirely independent of UDP, TCP or any other - protocol. - - * Socket interface - RDS sockets work *mostly* as you would expect from a BSD - socket. The next section will cover the details. At any rate, - all I/O is performed through the standard BSD socket API. - Some additions like zerocopy support are implemented through - control messages, while other extensions use the getsockopt/ - setsockopt calls. - - Sockets must be bound before you can send or receive data. - This is needed because binding also selects a transport and - attaches it to the socket. Once bound, the transport assignment - does not change. RDS will tolerate IPs moving around (eg in - a active-active HA scenario), but only as long as the address - doesn't move to a different transport. - - * sysctls - RDS supports a number of sysctls in /proc/sys/net/rds - - -Socket Interface -================ - - AF_RDS, PF_RDS, SOL_RDS - AF_RDS and PF_RDS are the domain type to be used with socket(2) - to create RDS sockets. SOL_RDS is the socket-level to be used - with setsockopt(2) and getsockopt(2) for RDS specific socket - options. - - fd = socket(PF_RDS, SOCK_SEQPACKET, 0); - This creates a new, unbound RDS socket. - - setsockopt(SOL_SOCKET): send and receive buffer size - RDS honors the send and receive buffer size socket options. - You are not allowed to queue more than SO_SNDSIZE bytes to - a socket. A message is queued when sendmsg is called, and - it leaves the queue when the remote system acknowledges - its arrival. - - The SO_RCVSIZE option controls the maximum receive queue length. - This is a soft limit rather than a hard limit - RDS will - continue to accept and queue incoming messages, even if that - takes the queue length over the limit. However, it will also - mark the port as "congested" and send a congestion update to - the source node. The source node is supposed to throttle any - processes sending to this congested port. - - bind(fd, &sockaddr_in, ...) - This binds the socket to a local IP address and port, and a - transport, if one has not already been selected via the - SO_RDS_TRANSPORT socket option - - sendmsg(fd, ...) - Sends a message to the indicated recipient. The kernel will - transparently establish the underlying reliable connection - if it isn't up yet. - - An attempt to send a message that exceeds SO_SNDSIZE will - return with -EMSGSIZE - - An attempt to send a message that would take the total number - of queued bytes over the SO_SNDSIZE threshold will return - EAGAIN. - - An attempt to send a message to a destination that is marked - as "congested" will return ENOBUFS. - - recvmsg(fd, ...) - Receives a message that was queued to this socket. The sockets - recv queue accounting is adjusted, and if the queue length - drops below SO_SNDSIZE, the port is marked uncongested, and - a congestion update is sent to all peers. - - Applications can ask the RDS kernel module to receive - notifications via control messages (for instance, there is a - notification when a congestion update arrived, or when a RDMA - operation completes). These notifications are received through - the msg.msg_control buffer of struct msghdr. The format of the - messages is described in manpages. - - poll(fd) - RDS supports the poll interface to allow the application - to implement async I/O. - - POLLIN handling is pretty straightforward. When there's an - incoming message queued to the socket, or a pending notification, - we signal POLLIN. - - POLLOUT is a little harder. Since you can essentially send - to any destination, RDS will always signal POLLOUT as long as - there's room on the send queue (ie the number of bytes queued - is less than the sendbuf size). - - However, the kernel will refuse to accept messages to - a destination marked congested - in this case you will loop - forever if you rely on poll to tell you what to do. - This isn't a trivial problem, but applications can deal with - this - by using congestion notifications, and by checking for - ENOBUFS errors returned by sendmsg. - - setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) - This allows the application to discard all messages queued to a - specific destination on this particular socket. - - This allows the application to cancel outstanding messages if - it detects a timeout. For instance, if it tried to send a message, - and the remote host is unreachable, RDS will keep trying forever. - The application may decide it's not worth it, and cancel the - operation. In this case, it would use RDS_CANCEL_SENT_TO to - nuke any pending messages. - - setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) - getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) - Set or read an integer defining the underlying - encapsulating transport to be used for RDS packets on the - socket. When setting the option, integer argument may be - one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the - value, RDS_TRANS_NONE will be returned on an unbound socket. - This socket option may only be set exactly once on the socket, - prior to binding it via the bind(2) system call. Attempts to - set SO_RDS_TRANSPORT on a socket for which the transport has - been previously attached explicitly (by SO_RDS_TRANSPORT) or - implicitly (via bind(2)) will return an error of EOPNOTSUPP. - An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will - always return EINVAL. - -RDMA for RDS -============ - - see rds-rdma(7) manpage (available in rds-tools) - - -Congestion Notifications -======================== - - see rds(7) manpage - - -RDS Protocol -============ - - Message header - - The message header is a 'struct rds_header' (see rds.h): - Fields: - h_sequence: - per-packet sequence number - h_ack: - piggybacked acknowledgment of last packet received - h_len: - length of data, not including header - h_sport: - source port - h_dport: - destination port - h_flags: - CONG_BITMAP - this is a congestion update bitmap - ACK_REQUIRED - receiver must ack this packet - RETRANSMITTED - packet has previously been sent - h_credit: - indicate to other end of connection that - it has more credits available (i.e. there is - more send room) - h_padding[4]: - unused, for future use - h_csum: - header checksum - h_exthdr: - optional data can be passed here. This is currently used for - passing RDMA-related information. - - ACK and retransmit handling - - One might think that with reliable IB connections you wouldn't need - to ack messages that have been received. The problem is that IB - hardware generates an ack message before it has DMAed the message - into memory. This creates a potential message loss if the HCA is - disabled for any reason between when it sends the ack and before - the message is DMAed and processed. This is only a potential issue - if another HCA is available for fail-over. - - Sending an ack immediately would allow the sender to free the sent - message from their send queue quickly, but could cause excessive - traffic to be used for acks. RDS piggybacks acks on sent data - packets. Ack-only packets are reduced by only allowing one to be - in flight at a time, and by the sender only asking for acks when - its send buffers start to fill up. All retransmissions are also - acked. - - Flow Control - - RDS's IB transport uses a credit-based mechanism to verify that - there is space in the peer's receive buffers for more data. This - eliminates the need for hardware retries on the connection. - - Congestion - - Messages waiting in the receive queue on the receiving socket - are accounted against the sockets SO_RCVBUF option value. Only - the payload bytes in the message are accounted for. If the - number of bytes queued equals or exceeds rcvbuf then the socket - is congested. All sends attempted to this socket's address - should return block or return -EWOULDBLOCK. - - Applications are expected to be reasonably tuned such that this - situation very rarely occurs. An application encountering this - "back-pressure" is considered a bug. - - This is implemented by having each node maintain bitmaps which - indicate which ports on bound addresses are congested. As the - bitmap changes it is sent through all the connections which - terminate in the local address of the bitmap which changed. - - The bitmaps are allocated as connections are brought up. This - avoids allocation in the interrupt handling path which queues - sages on sockets. The dense bitmaps let transports send the - entire bitmap on any bitmap change reasonably efficiently. This - is much easier to implement than some finer-grained - communication of per-port congestion. The sender does a very - inexpensive bit test to test if the port it's about to send to - is congested or not. - - -RDS Transport Layer -================== - - As mentioned above, RDS is not IB-specific. Its code is divided - into a general RDS layer and a transport layer. - - The general layer handles the socket API, congestion handling, - loopback, stats, usermem pinning, and the connection state machine. - - The transport layer handles the details of the transport. The IB - transport, for example, handles all the queue pairs, work requests, - CM event handlers, and other Infiniband details. - - -RDS Kernel Structures -===================== - - struct rds_message - aka possibly "rds_outgoing", the generic RDS layer copies data to - be sent and sets header fields as needed, based on the socket API. - This is then queued for the individual connection and sent by the - connection's transport. - struct rds_incoming - a generic struct referring to incoming data that can be handed from - the transport to the general code and queued by the general code - while the socket is awoken. It is then passed back to the transport - code to handle the actual copy-to-user. - struct rds_socket - per-socket information - struct rds_connection - per-connection information - struct rds_transport - pointers to transport-specific functions - struct rds_statistics - non-transport-specific statistics - struct rds_cong_map - wraps the raw congestion bitmap, contains rbnode, waitq, etc. - -Connection management -===================== - - Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and - ERROR states. - - The first time an attempt is made by an RDS socket to send data to - a node, a connection is allocated and connected. That connection is - then maintained forever -- if there are transport errors, the - connection will be dropped and re-established. - - Dropping a connection while packets are queued will cause queued or - partially-sent datagrams to be retransmitted when the connection is - re-established. - - -The send path -============= - - rds_sendmsg() - struct rds_message built from incoming data - CMSGs parsed (e.g. RDMA ops) - transport connection alloced and connected if not already - rds_message placed on send queue - send worker awoken - rds_send_worker() - calls rds_send_xmit() until queue is empty - rds_send_xmit() - transmits congestion map if one is pending - may set ACK_REQUIRED - calls transport to send either non-RDMA or RDMA message - (RDMA ops never retransmitted) - rds_ib_xmit() - allocs work requests from send ring - adds any new send credits available to peer (h_credits) - maps the rds_message's sg list - piggybacks ack - populates work requests - post send to connection's queue pair - -The recv path -============= - - rds_ib_recv_cq_comp_handler() - looks at write completions - unmaps recv buffer from device - no errors, call rds_ib_process_recv() - refill recv ring - rds_ib_process_recv() - validate header checksum - copy header to rds_ib_incoming struct if start of a new datagram - add to ibinc's fraglist - if competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required - rds_recv_incoming() - drop duplicate packets - respond to pings - find the sock associated with this datagram - add to sock queue - wake up sock - do some congestion calculations - rds_recvmsg - copy data into user iovec - handle CMSGs - return to application - -Multipath RDS (mprds) -===================== - Mprds is multipathed-RDS, primarily intended for RDS-over-TCP - (though the concept can be extended to other transports). The classical - implementation of RDS-over-TCP is implemented by demultiplexing multiple - PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, - port]) over a single TCP socket between the 2 IP addresses involved. This - has the limitation that it ends up funneling multiple RDS flows over a - single TCP flow, thus it is - (a) upper-bounded to the single-flow bandwidth, - (b) suffers from head-of-line blocking for all the RDS sockets. - - Better throughput (for a fixed small packet size, MTU) can be achieved - by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed - RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp - connection. RDS sockets will be attached to a path based on some hash - (e.g., of local address and RDS port number) and packets for that RDS - socket will be sent over the attached path using TCP to segment/reassemble - RDS datagrams on that path. - - Multipathed RDS is implemented by splitting the struct rds_connection into - a common (to all paths) part, and a per-path struct rds_conn_path. All - I/O workqs and reconnect threads are driven from the rds_conn_path. - Transports such as TCP that are multipath capable may then set up a - TCP socket per rds_conn_path, and this is managed by the transport via - the transport privatee cp_transport_data pointer. - - Transports announce themselves as multipath capable by setting the - t_mp_capable bit during registration with the rds core module. When the - transport is multipath-capable, rds_sendmsg() hashes outgoing traffic - across multiple paths. The outgoing hash is computed based on the - local address and port that the PF_RDS socket is bound to. - - Additionally, even if the transport is MP capable, we may be - peering with some node that does not support mprds, or supports - a different number of paths. As a result, the peering nodes need - to agree on the number of paths to be used for the connection. - This is done by sending out a control packet exchange before the - first data packet. The control packet exchange must have completed - prior to outgoing hash completion in rds_sendmsg() when the transport - is mutlipath capable. - - The control packet is an RDS ping packet (i.e., packet to rds dest - port 0) with the ping packet having a rds extension header option of - type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the - number of paths supported by the sender. The "probe" ping packet will - get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) - The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately - be able to compute the min(sender_paths, rcvr_paths). The pong - sent in response to a probe-ping should contain the rcvr's npaths - when the rcvr is mprds-capable. - - If the rcvr is not mprds-capable, the exthdr in the ping will be - ignored. In this case the pong will not have any exthdrs, so the sender - of the probe-ping can default to single-path mprds. - |