From bad5b6e223e8409c860c0574d5239ee4348f06b3 Mon Sep 17 00:00:00 2001
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Date: Thu, 30 Apr 2020 18:04:19 +0200
Subject: docs: networking: convert rds.txt to ReST

- add SPDX header;
- add a document title;
- mark code blocks and literals as such;
- mark tables as such;
- mark lists as such;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 Documentation/networking/index.rst |   1 +
 Documentation/networking/rds.rst   | 448 +++++++++++++++++++++++++++++++++++++
 Documentation/networking/rds.txt   | 423 ----------------------------------
 3 files changed, 449 insertions(+), 423 deletions(-)
 create mode 100644 Documentation/networking/rds.rst
 delete mode 100644 Documentation/networking/rds.txt

(limited to 'Documentation')

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index b7e35b0d905c..e63a2cb2e4cb 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -97,6 +97,7 @@ Contents:
    proc_net_tcp
    radiotap-headers
    ray_cs
+   rds
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst
new file mode 100644
index 000000000000..44936c27ab3a
--- /dev/null
+++ b/Documentation/networking/rds.rst
@@ -0,0 +1,448 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+RDS
+===
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports.  The current implementation used to support RDS over TCP as well
+as IB.
+
+The high-level semantics of RDS from the application's point of view are
+
+ *	Addressing
+
+	RDS uses IPv4 addresses and 16bit port numbers to identify
+	the end point of a connection. All socket operations that involve
+	passing addresses between kernel and user space generally
+	use a struct sockaddr_in.
+
+	The fact that IPv4 addresses are used does not mean the underlying
+	transport has to be IP-based. In fact, RDS over IB uses a
+	reliable IB connection; the IP address is used exclusively to
+	locate the remote node's GID (by ARPing for the given IP).
+
+	The port space is entirely independent of UDP, TCP or any other
+	protocol.
+
+ *	Socket interface
+
+	RDS sockets work *mostly* as you would expect from a BSD
+	socket. The next section will cover the details. At any rate,
+	all I/O is performed through the standard BSD socket API.
+	Some additions like zerocopy support are implemented through
+	control messages, while other extensions use the getsockopt/
+	setsockopt calls.
+
+	Sockets must be bound before you can send or receive data.
+	This is needed because binding also selects a transport and
+	attaches it to the socket. Once bound, the transport assignment
+	does not change. RDS will tolerate IPs moving around (eg in
+	a active-active HA scenario), but only as long as the address
+	doesn't move to a different transport.
+
+ *	sysctls
+
+	RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+  AF_RDS, PF_RDS, SOL_RDS
+	AF_RDS and PF_RDS are the domain type to be used with socket(2)
+	to create RDS sockets. SOL_RDS is the socket-level to be used
+	with setsockopt(2) and getsockopt(2) for RDS specific socket
+	options.
+
+  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	This creates a new, unbound RDS socket.
+
+  setsockopt(SOL_SOCKET): send and receive buffer size
+	RDS honors the send and receive buffer size socket options.
+	You are not allowed to queue more than SO_SNDSIZE bytes to
+	a socket. A message is queued when sendmsg is called, and
+	it leaves the queue when the remote system acknowledges
+	its arrival.
+
+	The SO_RCVSIZE option controls the maximum receive queue length.
+	This is a soft limit rather than a hard limit - RDS will
+	continue to accept and queue incoming messages, even if that
+	takes the queue length over the limit. However, it will also
+	mark the port as "congested" and send a congestion update to
+	the source node. The source node is supposed to throttle any
+	processes sending to this congested port.
+
+  bind(fd, &sockaddr_in, ...)
+	This binds the socket to a local IP address and port, and a
+	transport, if one has not already been selected via the
+	SO_RDS_TRANSPORT socket option
+
+  sendmsg(fd, ...)
+	Sends a message to the indicated recipient. The kernel will
+	transparently establish the underlying reliable connection
+	if it isn't up yet.
+
+	An attempt to send a message that exceeds SO_SNDSIZE will
+	return with -EMSGSIZE
+
+	An attempt to send a message that would take the total number
+	of queued bytes over the SO_SNDSIZE threshold will return
+	EAGAIN.
+
+	An attempt to send a message to a destination that is marked
+	as "congested" will return ENOBUFS.
+
+  recvmsg(fd, ...)
+	Receives a message that was queued to this socket. The sockets
+	recv queue accounting is adjusted, and if the queue length
+	drops below SO_SNDSIZE, the port is marked uncongested, and
+	a congestion update is sent to all peers.
+
+	Applications can ask the RDS kernel module to receive
+	notifications via control messages (for instance, there is a
+	notification when a congestion update arrived, or when a RDMA
+	operation completes). These notifications are received through
+	the msg.msg_control buffer of struct msghdr. The format of the
+	messages is described in manpages.
+
+  poll(fd)
+	RDS supports the poll interface to allow the application
+	to implement async I/O.
+
+	POLLIN handling is pretty straightforward. When there's an
+	incoming message queued to the socket, or a pending notification,
+	we signal POLLIN.
+
+	POLLOUT is a little harder. Since you can essentially send
+	to any destination, RDS will always signal POLLOUT as long as
+	there's room on the send queue (ie the number of bytes queued
+	is less than the sendbuf size).
+
+	However, the kernel will refuse to accept messages to
+	a destination marked congested - in this case you will loop
+	forever if you rely on poll to tell you what to do.
+	This isn't a trivial problem, but applications can deal with
+	this - by using congestion notifications, and by checking for
+	ENOBUFS errors returned by sendmsg.
+
+  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+	This allows the application to discard all messages queued to a
+	specific destination on this particular socket.
+
+	This allows the application to cancel outstanding messages if
+	it detects a timeout. For instance, if it tried to send a message,
+	and the remote host is unreachable, RDS will keep trying forever.
+	The application may decide it's not worth it, and cancel the
+	operation. In this case, it would use RDS_CANCEL_SENT_TO to
+	nuke any pending messages.
+
+  ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
+	Set or read an integer defining  the underlying
+	encapsulating transport to be used for RDS packets on the
+	socket. When setting the option, integer argument may be
+	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
+	value, RDS_TRANS_NONE will be returned on an unbound socket.
+	This socket option may only be set exactly once on the socket,
+	prior to binding it via the bind(2) system call. Attempts to
+	set SO_RDS_TRANSPORT on a socket for which the transport has
+	been previously attached explicitly (by SO_RDS_TRANSPORT) or
+	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
+	An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
+	always return EINVAL.
+
+RDMA for RDS
+============
+
+  see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+  see rds(7) manpage
+
+
+RDS Protocol
+============
+
+  Message header
+
+    The message header is a 'struct rds_header' (see rds.h):
+
+    Fields:
+
+      h_sequence:
+	  per-packet sequence number
+      h_ack:
+	  piggybacked acknowledgment of last packet received
+      h_len:
+	  length of data, not including header
+      h_sport:
+	  source port
+      h_dport:
+	  destination port
+      h_flags:
+	  Can be:
+
+	  =============  ==================================
+	  CONG_BITMAP    this is a congestion update bitmap
+	  ACK_REQUIRED   receiver must ack this packet
+	  RETRANSMITTED  packet has previously been sent
+	  =============  ==================================
+
+      h_credit:
+	  indicate to other end of connection that
+	  it has more credits available (i.e. there is
+	  more send room)
+      h_padding[4]:
+	  unused, for future use
+      h_csum:
+	  header checksum
+      h_exthdr:
+	  optional data can be passed here. This is currently used for
+	  passing RDMA-related information.
+
+  ACK and retransmit handling
+
+      One might think that with reliable IB connections you wouldn't need
+      to ack messages that have been received.  The problem is that IB
+      hardware generates an ack message before it has DMAed the message
+      into memory.  This creates a potential message loss if the HCA is
+      disabled for any reason between when it sends the ack and before
+      the message is DMAed and processed.  This is only a potential issue
+      if another HCA is available for fail-over.
+
+      Sending an ack immediately would allow the sender to free the sent
+      message from their send queue quickly, but could cause excessive
+      traffic to be used for acks. RDS piggybacks acks on sent data
+      packets.  Ack-only packets are reduced by only allowing one to be
+      in flight at a time, and by the sender only asking for acks when
+      its send buffers start to fill up. All retransmissions are also
+      acked.
+
+  Flow Control
+
+      RDS's IB transport uses a credit-based mechanism to verify that
+      there is space in the peer's receive buffers for more data. This
+      eliminates the need for hardware retries on the connection.
+
+  Congestion
+
+      Messages waiting in the receive queue on the receiving socket
+      are accounted against the sockets SO_RCVBUF option value.  Only
+      the payload bytes in the message are accounted for.  If the
+      number of bytes queued equals or exceeds rcvbuf then the socket
+      is congested.  All sends attempted to this socket's address
+      should return block or return -EWOULDBLOCK.
+
+      Applications are expected to be reasonably tuned such that this
+      situation very rarely occurs.  An application encountering this
+      "back-pressure" is considered a bug.
+
+      This is implemented by having each node maintain bitmaps which
+      indicate which ports on bound addresses are congested.  As the
+      bitmap changes it is sent through all the connections which
+      terminate in the local address of the bitmap which changed.
+
+      The bitmaps are allocated as connections are brought up.  This
+      avoids allocation in the interrupt handling path which queues
+      sages on sockets.  The dense bitmaps let transports send the
+      entire bitmap on any bitmap change reasonably efficiently.  This
+      is much easier to implement than some finer-grained
+      communication of per-port congestion.  The sender does a very
+      inexpensive bit test to test if the port it's about to send to
+      is congested or not.
+
+
+RDS Transport Layer
+===================
+
+  As mentioned above, RDS is not IB-specific. Its code is divided
+  into a general RDS layer and a transport layer.
+
+  The general layer handles the socket API, congestion handling,
+  loopback, stats, usermem pinning, and the connection state machine.
+
+  The transport layer handles the details of the transport. The IB
+  transport, for example, handles all the queue pairs, work requests,
+  CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+  struct rds_message
+    aka possibly "rds_outgoing", the generic RDS layer copies data to
+    be sent and sets header fields as needed, based on the socket API.
+    This is then queued for the individual connection and sent by the
+    connection's transport.
+
+  struct rds_incoming
+    a generic struct referring to incoming data that can be handed from
+    the transport to the general code and queued by the general code
+    while the socket is awoken. It is then passed back to the transport
+    code to handle the actual copy-to-user.
+
+  struct rds_socket
+    per-socket information
+
+  struct rds_connection
+    per-connection information
+
+  struct rds_transport
+    pointers to transport-specific functions
+
+  struct rds_statistics
+    non-transport-specific statistics
+
+  struct rds_cong_map
+    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+  ERROR states.
+
+  The first time an attempt is made by an RDS socket to send data to
+  a node, a connection is allocated and connected. That connection is
+  then maintained forever -- if there are transport errors, the
+  connection will be dropped and re-established.
+
+  Dropping a connection while packets are queued will cause queued or
+  partially-sent datagrams to be retransmitted when the connection is
+  re-established.
+
+
+The send path
+=============
+
+  rds_sendmsg()
+    - struct rds_message built from incoming data
+    - CMSGs parsed (e.g. RDMA ops)
+    - transport connection alloced and connected if not already
+    - rds_message placed on send queue
+    - send worker awoken
+
+  rds_send_worker()
+    - calls rds_send_xmit() until queue is empty
+
+  rds_send_xmit()
+    - transmits congestion map if one is pending
+    - may set ACK_REQUIRED
+    - calls transport to send either non-RDMA or RDMA message
+      (RDMA ops never retransmitted)
+
+  rds_ib_xmit()
+    - allocs work requests from send ring
+    - adds any new send credits available to peer (h_credits)
+    - maps the rds_message's sg list
+    - piggybacks ack
+    - populates work requests
+    - post send to connection's queue pair
+
+The recv path
+=============
+
+  rds_ib_recv_cq_comp_handler()
+    - looks at write completions
+    - unmaps recv buffer from device
+    - no errors, call rds_ib_process_recv()
+    - refill recv ring
+
+  rds_ib_process_recv()
+    - validate header checksum
+    - copy header to rds_ib_incoming struct if start of a new datagram
+    - add to ibinc's fraglist
+    - if competed datagram:
+	 - update cong map if datagram was cong update
+	 - call rds_recv_incoming() otherwise
+	 - note if ack is required
+
+  rds_recv_incoming()
+    - drop duplicate packets
+    - respond to pings
+    - find the sock associated with this datagram
+    - add to sock queue
+    - wake up sock
+    - do some congestion calculations
+  rds_recvmsg
+    - copy data into user iovec
+    - handle CMSGs
+    - return to application
+
+Multipath RDS (mprds)
+=====================
+  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
+  (though the concept can be extended to other transports). The classical
+  implementation of RDS-over-TCP is implemented by demultiplexing multiple
+  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
+  port]) over a single TCP socket between the 2 IP addresses involved. This
+  has the limitation that it ends up funneling multiple RDS flows over a
+  single TCP flow, thus it is
+  (a) upper-bounded to the single-flow bandwidth,
+  (b) suffers from head-of-line blocking for all the RDS sockets.
+
+  Better throughput (for a fixed small packet size, MTU) can be achieved
+  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
+  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
+  connection. RDS sockets will be attached to a path based on some hash
+  (e.g., of local address and RDS port number) and packets for that RDS
+  socket will be sent over the attached path using TCP to segment/reassemble
+  RDS datagrams on that path.
+
+  Multipathed RDS is implemented by splitting the struct rds_connection into
+  a common (to all paths) part, and a per-path struct rds_conn_path. All
+  I/O workqs and reconnect threads are driven from the rds_conn_path.
+  Transports such as TCP that are multipath capable may then set up a
+  TCP socket per rds_conn_path, and this is managed by the transport via
+  the transport privatee cp_transport_data pointer.
+
+  Transports announce themselves as multipath capable by setting the
+  t_mp_capable bit during registration with the rds core module. When the
+  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
+  across multiple paths. The outgoing hash is computed based on the
+  local address and port that the PF_RDS socket is bound to.
+
+  Additionally, even if the transport is MP capable, we may be
+  peering with some node that does not support mprds, or supports
+  a different number of paths. As a result, the peering nodes need
+  to agree on the number of paths to be used for the connection.
+  This is done by sending out a control packet exchange before the
+  first data packet. The control packet exchange must have completed
+  prior to outgoing hash completion in rds_sendmsg() when the transport
+  is mutlipath capable.
+
+  The control packet is an RDS ping packet (i.e., packet to rds dest
+  port 0) with the ping packet having a rds extension header option  of
+  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
+  number of paths supported by the sender. The "probe" ping packet will
+  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
+  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
+  be able to compute the min(sender_paths, rcvr_paths). The pong
+  sent in response to a probe-ping should contain the rcvr's npaths
+  when the rcvr is mprds-capable.
+
+  If the rcvr is not mprds-capable, the exthdr in the ping will be
+  ignored.  In this case the pong will not have any exthdrs, so the sender
+  of the probe-ping can default to single-path mprds.
+
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
deleted file mode 100644
index eec61694e894..000000000000
--- a/Documentation/networking/rds.txt
+++ /dev/null
@@ -1,423 +0,0 @@
-
-Overview
-========
-
-This readme tries to provide some background on the hows and whys of RDS,
-and will hopefully help you find your way around the code.
-
-In addition, please see this email about RDS origins:
-http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
-
-RDS Architecture
-================
-
-RDS provides reliable, ordered datagram delivery by using a single
-reliable connection between any two nodes in the cluster. This allows
-applications to use a single socket to talk to any other process in the
-cluster - so in a cluster with N processes you need N sockets, in contrast
-to N*N if you use a connection-oriented socket transport like TCP.
-
-RDS is not Infiniband-specific; it was designed to support different
-transports.  The current implementation used to support RDS over TCP as well
-as IB.
-
-The high-level semantics of RDS from the application's point of view are
-
- *	Addressing
-        RDS uses IPv4 addresses and 16bit port numbers to identify
-        the end point of a connection. All socket operations that involve
-        passing addresses between kernel and user space generally
-        use a struct sockaddr_in.
-
-        The fact that IPv4 addresses are used does not mean the underlying
-        transport has to be IP-based. In fact, RDS over IB uses a
-        reliable IB connection; the IP address is used exclusively to
-        locate the remote node's GID (by ARPing for the given IP).
-
-        The port space is entirely independent of UDP, TCP or any other
-        protocol.
-
- *	Socket interface
-        RDS sockets work *mostly* as you would expect from a BSD
-        socket. The next section will cover the details. At any rate,
-        all I/O is performed through the standard BSD socket API.
-        Some additions like zerocopy support are implemented through
-        control messages, while other extensions use the getsockopt/
-        setsockopt calls.
-
-        Sockets must be bound before you can send or receive data.
-        This is needed because binding also selects a transport and
-        attaches it to the socket. Once bound, the transport assignment
-        does not change. RDS will tolerate IPs moving around (eg in
-        a active-active HA scenario), but only as long as the address
-        doesn't move to a different transport.
-
- *	sysctls
-        RDS supports a number of sysctls in /proc/sys/net/rds
-
-
-Socket Interface
-================
-
-  AF_RDS, PF_RDS, SOL_RDS
-	AF_RDS and PF_RDS are the domain type to be used with socket(2)
-	to create RDS sockets. SOL_RDS is the socket-level to be used
-	with setsockopt(2) and getsockopt(2) for RDS specific socket
-	options.
-
-  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
-        This creates a new, unbound RDS socket.
-
-  setsockopt(SOL_SOCKET): send and receive buffer size
-        RDS honors the send and receive buffer size socket options.
-        You are not allowed to queue more than SO_SNDSIZE bytes to
-        a socket. A message is queued when sendmsg is called, and
-        it leaves the queue when the remote system acknowledges
-        its arrival.
-
-        The SO_RCVSIZE option controls the maximum receive queue length.
-        This is a soft limit rather than a hard limit - RDS will
-        continue to accept and queue incoming messages, even if that
-        takes the queue length over the limit. However, it will also
-        mark the port as "congested" and send a congestion update to
-        the source node. The source node is supposed to throttle any
-        processes sending to this congested port.
-
-  bind(fd, &sockaddr_in, ...)
-        This binds the socket to a local IP address and port, and a
-        transport, if one has not already been selected via the
-	SO_RDS_TRANSPORT socket option
-
-  sendmsg(fd, ...)
-        Sends a message to the indicated recipient. The kernel will
-        transparently establish the underlying reliable connection
-        if it isn't up yet.
-
-        An attempt to send a message that exceeds SO_SNDSIZE will
-        return with -EMSGSIZE
-
-        An attempt to send a message that would take the total number
-        of queued bytes over the SO_SNDSIZE threshold will return
-        EAGAIN.
-
-        An attempt to send a message to a destination that is marked
-        as "congested" will return ENOBUFS.
-
-  recvmsg(fd, ...)
-        Receives a message that was queued to this socket. The sockets
-        recv queue accounting is adjusted, and if the queue length
-        drops below SO_SNDSIZE, the port is marked uncongested, and
-        a congestion update is sent to all peers.
-
-        Applications can ask the RDS kernel module to receive
-        notifications via control messages (for instance, there is a
-        notification when a congestion update arrived, or when a RDMA
-        operation completes). These notifications are received through
-        the msg.msg_control buffer of struct msghdr. The format of the
-        messages is described in manpages.
-
-  poll(fd)
-        RDS supports the poll interface to allow the application
-        to implement async I/O.
-
-        POLLIN handling is pretty straightforward. When there's an
-        incoming message queued to the socket, or a pending notification,
-        we signal POLLIN.
-
-        POLLOUT is a little harder. Since you can essentially send
-        to any destination, RDS will always signal POLLOUT as long as
-        there's room on the send queue (ie the number of bytes queued
-        is less than the sendbuf size).
-
-        However, the kernel will refuse to accept messages to
-        a destination marked congested - in this case you will loop
-        forever if you rely on poll to tell you what to do.
-        This isn't a trivial problem, but applications can deal with
-        this - by using congestion notifications, and by checking for
-        ENOBUFS errors returned by sendmsg.
-
-  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
-        This allows the application to discard all messages queued to a
-        specific destination on this particular socket.
-
-        This allows the application to cancel outstanding messages if
-        it detects a timeout. For instance, if it tried to send a message,
-        and the remote host is unreachable, RDS will keep trying forever.
-        The application may decide it's not worth it, and cancel the
-        operation. In this case, it would use RDS_CANCEL_SENT_TO to
-        nuke any pending messages.
-
-  setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
-  getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
-	Set or read an integer defining  the underlying
-	encapsulating transport to be used for RDS packets on the
-	socket. When setting the option, integer argument may be
-	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
-	value, RDS_TRANS_NONE will be returned on an unbound socket.
-	This socket option may only be set exactly once on the socket,
-	prior to binding it via the bind(2) system call. Attempts to
-	set SO_RDS_TRANSPORT on a socket for which the transport has
-	been previously attached explicitly (by SO_RDS_TRANSPORT) or
-	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
-	An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
-	always return EINVAL.
-
-RDMA for RDS
-============
-
-  see rds-rdma(7) manpage (available in rds-tools)
-
-
-Congestion Notifications
-========================
-
-  see rds(7) manpage
-
-
-RDS Protocol
-============
-
-  Message header
-
-    The message header is a 'struct rds_header' (see rds.h):
-    Fields:
-      h_sequence:
-          per-packet sequence number
-      h_ack:
-          piggybacked acknowledgment of last packet received
-      h_len:
-          length of data, not including header
-      h_sport:
-          source port
-      h_dport:
-          destination port
-      h_flags:
-          CONG_BITMAP - this is a congestion update bitmap
-          ACK_REQUIRED - receiver must ack this packet
-          RETRANSMITTED - packet has previously been sent
-      h_credit:
-          indicate to other end of connection that
-          it has more credits available (i.e. there is
-          more send room)
-      h_padding[4]:
-          unused, for future use
-      h_csum:
-          header checksum
-      h_exthdr:
-          optional data can be passed here. This is currently used for
-          passing RDMA-related information.
-
-  ACK and retransmit handling
-
-      One might think that with reliable IB connections you wouldn't need
-      to ack messages that have been received.  The problem is that IB
-      hardware generates an ack message before it has DMAed the message
-      into memory.  This creates a potential message loss if the HCA is
-      disabled for any reason between when it sends the ack and before
-      the message is DMAed and processed.  This is only a potential issue
-      if another HCA is available for fail-over.
-
-      Sending an ack immediately would allow the sender to free the sent
-      message from their send queue quickly, but could cause excessive
-      traffic to be used for acks. RDS piggybacks acks on sent data
-      packets.  Ack-only packets are reduced by only allowing one to be
-      in flight at a time, and by the sender only asking for acks when
-      its send buffers start to fill up. All retransmissions are also
-      acked.
-
-  Flow Control
-
-      RDS's IB transport uses a credit-based mechanism to verify that
-      there is space in the peer's receive buffers for more data. This
-      eliminates the need for hardware retries on the connection.
-
-  Congestion
-
-      Messages waiting in the receive queue on the receiving socket
-      are accounted against the sockets SO_RCVBUF option value.  Only
-      the payload bytes in the message are accounted for.  If the
-      number of bytes queued equals or exceeds rcvbuf then the socket
-      is congested.  All sends attempted to this socket's address
-      should return block or return -EWOULDBLOCK.
-
-      Applications are expected to be reasonably tuned such that this
-      situation very rarely occurs.  An application encountering this
-      "back-pressure" is considered a bug.
-
-      This is implemented by having each node maintain bitmaps which
-      indicate which ports on bound addresses are congested.  As the
-      bitmap changes it is sent through all the connections which
-      terminate in the local address of the bitmap which changed.
-
-      The bitmaps are allocated as connections are brought up.  This
-      avoids allocation in the interrupt handling path which queues
-      sages on sockets.  The dense bitmaps let transports send the
-      entire bitmap on any bitmap change reasonably efficiently.  This
-      is much easier to implement than some finer-grained
-      communication of per-port congestion.  The sender does a very
-      inexpensive bit test to test if the port it's about to send to
-      is congested or not.
-
-
-RDS Transport Layer
-==================
-
-  As mentioned above, RDS is not IB-specific. Its code is divided
-  into a general RDS layer and a transport layer.
-
-  The general layer handles the socket API, congestion handling,
-  loopback, stats, usermem pinning, and the connection state machine.
-
-  The transport layer handles the details of the transport. The IB
-  transport, for example, handles all the queue pairs, work requests,
-  CM event handlers, and other Infiniband details.
-
-
-RDS Kernel Structures
-=====================
-
-  struct rds_message
-    aka possibly "rds_outgoing", the generic RDS layer copies data to
-    be sent and sets header fields as needed, based on the socket API.
-    This is then queued for the individual connection and sent by the
-    connection's transport.
-  struct rds_incoming
-    a generic struct referring to incoming data that can be handed from
-    the transport to the general code and queued by the general code
-    while the socket is awoken. It is then passed back to the transport
-    code to handle the actual copy-to-user.
-  struct rds_socket
-    per-socket information
-  struct rds_connection
-    per-connection information
-  struct rds_transport
-    pointers to transport-specific functions
-  struct rds_statistics
-    non-transport-specific statistics
-  struct rds_cong_map
-    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
-
-Connection management
-=====================
-
-  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
-  ERROR states.
-
-  The first time an attempt is made by an RDS socket to send data to
-  a node, a connection is allocated and connected. That connection is
-  then maintained forever -- if there are transport errors, the
-  connection will be dropped and re-established.
-
-  Dropping a connection while packets are queued will cause queued or
-  partially-sent datagrams to be retransmitted when the connection is
-  re-established.
-
-
-The send path
-=============
-
-  rds_sendmsg()
-    struct rds_message built from incoming data
-    CMSGs parsed (e.g. RDMA ops)
-    transport connection alloced and connected if not already
-    rds_message placed on send queue
-    send worker awoken
-  rds_send_worker()
-    calls rds_send_xmit() until queue is empty
-  rds_send_xmit()
-    transmits congestion map if one is pending
-    may set ACK_REQUIRED
-    calls transport to send either non-RDMA or RDMA message
-    (RDMA ops never retransmitted)
-  rds_ib_xmit()
-    allocs work requests from send ring
-    adds any new send credits available to peer (h_credits)
-    maps the rds_message's sg list
-    piggybacks ack
-    populates work requests
-    post send to connection's queue pair
-
-The recv path
-=============
-
-  rds_ib_recv_cq_comp_handler()
-    looks at write completions
-    unmaps recv buffer from device
-    no errors, call rds_ib_process_recv()
-    refill recv ring
-  rds_ib_process_recv()
-    validate header checksum
-    copy header to rds_ib_incoming struct if start of a new datagram
-    add to ibinc's fraglist
-    if competed datagram:
-      update cong map if datagram was cong update
-      call rds_recv_incoming() otherwise
-      note if ack is required
-  rds_recv_incoming()
-    drop duplicate packets
-    respond to pings
-    find the sock associated with this datagram
-    add to sock queue
-    wake up sock
-    do some congestion calculations
-  rds_recvmsg
-    copy data into user iovec
-    handle CMSGs
-    return to application
-
-Multipath RDS (mprds)
-=====================
-  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
-  (though the concept can be extended to other transports). The classical
-  implementation of RDS-over-TCP is implemented by demultiplexing multiple
-  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
-  port]) over a single TCP socket between the 2 IP addresses involved. This
-  has the limitation that it ends up funneling multiple RDS flows over a
-  single TCP flow, thus it is
-  (a) upper-bounded to the single-flow bandwidth,
-  (b) suffers from head-of-line blocking for all the RDS sockets.
-
-  Better throughput (for a fixed small packet size, MTU) can be achieved
-  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
-  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
-  connection. RDS sockets will be attached to a path based on some hash
-  (e.g., of local address and RDS port number) and packets for that RDS
-  socket will be sent over the attached path using TCP to segment/reassemble
-  RDS datagrams on that path.
-
-  Multipathed RDS is implemented by splitting the struct rds_connection into
-  a common (to all paths) part, and a per-path struct rds_conn_path. All
-  I/O workqs and reconnect threads are driven from the rds_conn_path.
-  Transports such as TCP that are multipath capable may then set up a
-  TCP socket per rds_conn_path, and this is managed by the transport via
-  the transport privatee cp_transport_data pointer.
-
-  Transports announce themselves as multipath capable by setting the
-  t_mp_capable bit during registration with the rds core module. When the
-  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
-  across multiple paths. The outgoing hash is computed based on the
-  local address and port that the PF_RDS socket is bound to.
-
-  Additionally, even if the transport is MP capable, we may be
-  peering with some node that does not support mprds, or supports
-  a different number of paths. As a result, the peering nodes need
-  to agree on the number of paths to be used for the connection.
-  This is done by sending out a control packet exchange before the
-  first data packet. The control packet exchange must have completed
-  prior to outgoing hash completion in rds_sendmsg() when the transport
-  is mutlipath capable.
-
-  The control packet is an RDS ping packet (i.e., packet to rds dest
-  port 0) with the ping packet having a rds extension header option  of
-  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
-  number of paths supported by the sender. The "probe" ping packet will
-  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
-  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
-  be able to compute the min(sender_paths, rcvr_paths). The pong
-  sent in response to a probe-ping should contain the rcvr's npaths
-  when the rcvr is mprds-capable.
-
-  If the rcvr is not mprds-capable, the exthdr in the ping will be
-  ignored.  In this case the pong will not have any exthdrs, so the sender
-  of the probe-ping can default to single-path mprds.
-
-- 
cgit v1.2.3-59-g8ed1b