1 files changed, 0 insertions, 423 deletions
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
deleted file mode 100644
index eec61694e894..000000000000
--- a/Documentation/networking/rds.txt
+++ /dev/null
@@ -1,423 +0,0 @@
-
-Overview
-========
-
-This readme tries to provide some background on the hows and whys of RDS,
-and will hopefully help you find your way around the code.
-
-In addition, please see this email about RDS origins:
-http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
-
-RDS Architecture
-================
-
-RDS provides reliable, ordered datagram delivery by using a single
-reliable connection between any two nodes in the cluster. This allows
-applications to use a single socket to talk to any other process in the
-cluster - so in a cluster with N processes you need N sockets, in contrast
-to N*N if you use a connection-oriented socket transport like TCP.
-
-RDS is not Infiniband-specific; it was designed to support different
-transports.  The current implementation used to support RDS over TCP as well
-as IB.
-
-The high-level semantics of RDS from the application's point of view are
-
- *	Addressing
-        RDS uses IPv4 addresses and 16bit port numbers to identify
-        the end point of a connection. All socket operations that involve
-        passing addresses between kernel and user space generally
-        use a struct sockaddr_in.
-
-        The fact that IPv4 addresses are used does not mean the underlying
-        transport has to be IP-based. In fact, RDS over IB uses a
-        reliable IB connection; the IP address is used exclusively to
-        locate the remote node's GID (by ARPing for the given IP).
-
-        The port space is entirely independent of UDP, TCP or any other
-        protocol.
-
- *	Socket interface
-        RDS sockets work *mostly* as you would expect from a BSD
-        socket. The next section will cover the details. At any rate,
-        all I/O is performed through the standard BSD socket API.
-        Some additions like zerocopy support are implemented through
-        control messages, while other extensions use the getsockopt/
-        setsockopt calls.
-
-        Sockets must be bound before you can send or receive data.
-        This is needed because binding also selects a transport and
-        attaches it to the socket. Once bound, the transport assignment
-        does not change. RDS will tolerate IPs moving around (eg in
-        a active-active HA scenario), but only as long as the address
-        doesn't move to a different transport.
-
- *	sysctls
-        RDS supports a number of sysctls in /proc/sys/net/rds
-
-
-Socket Interface
-================
-
-  AF_RDS, PF_RDS, SOL_RDS
-	AF_RDS and PF_RDS are the domain type to be used with socket(2)
-	to create RDS sockets. SOL_RDS is the socket-level to be used
-	with setsockopt(2) and getsockopt(2) for RDS specific socket
-	options.
-
-  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
-        This creates a new, unbound RDS socket.
-
-  setsockopt(SOL_SOCKET): send and receive buffer size
-        RDS honors the send and receive buffer size socket options.
-        You are not allowed to queue more than SO_SNDSIZE bytes to
-        a socket. A message is queued when sendmsg is called, and
-        it leaves the queue when the remote system acknowledges
-        its arrival.
-
-        The SO_RCVSIZE option controls the maximum receive queue length.
-        This is a soft limit rather than a hard limit - RDS will
-        continue to accept and queue incoming messages, even if that
-        takes the queue length over the limit. However, it will also
-        mark the port as "congested" and send a congestion update to
-        the source node. The source node is supposed to throttle any
-        processes sending to this congested port.
-
-  bind(fd, &sockaddr_in, ...)
-        This binds the socket to a local IP address and port, and a
-        transport, if one has not already been selected via the
-	SO_RDS_TRANSPORT socket option
-
-  sendmsg(fd, ...)
-        Sends a message to the indicated recipient. The kernel will
-        transparently establish the underlying reliable connection
-        if it isn't up yet.
-
-        An attempt to send a message that exceeds SO_SNDSIZE will
-        return with -EMSGSIZE
-
-        An attempt to send a message that would take the total number
-        of queued bytes over the SO_SNDSIZE threshold will return
-        EAGAIN.
-
-        An attempt to send a message to a destination that is marked
-        as "congested" will return ENOBUFS.
-
-  recvmsg(fd, ...)
-        Receives a message that was queued to this socket. The sockets
-        recv queue accounting is adjusted, and if the queue length
-        drops below SO_SNDSIZE, the port is marked uncongested, and
-        a congestion update is sent to all peers.
-
-        Applications can ask the RDS kernel module to receive
-        notifications via control messages (for instance, there is a
-        notification when a congestion update arrived, or when a RDMA
-        operation completes). These notifications are received through
-        the msg.msg_control buffer of struct msghdr. The format of the
-        messages is described in manpages.
-
-  poll(fd)
-        RDS supports the poll interface to allow the application
-        to implement async I/O.
-
-        POLLIN handling is pretty straightforward. When there's an
-        incoming message queued to the socket, or a pending notification,
-        we signal POLLIN.
-
-        POLLOUT is a little harder. Since you can essentially send
-        to any destination, RDS will always signal POLLOUT as long as
-        there's room on the send queue (ie the number of bytes queued
-        is less than the sendbuf size).
-
-        However, the kernel will refuse to accept messages to
-        a destination marked congested - in this case you will loop
-        forever if you rely on poll to tell you what to do.
-        This isn't a trivial problem, but applications can deal with
-        this - by using congestion notifications, and by checking for
-        ENOBUFS errors returned by sendmsg.
-
-  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
-        This allows the application to discard all messages queued to a
-        specific destination on this particular socket.
-
-        This allows the application to cancel outstanding messages if
-        it detects a timeout. For instance, if it tried to send a message,
-        and the remote host is unreachable, RDS will keep trying forever.
-        The application may decide it's not worth it, and cancel the
-        operation. In this case, it would use RDS_CANCEL_SENT_TO to
-        nuke any pending messages.
-
-  setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
-  getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
-	Set or read an integer defining  the underlying
-	encapsulating transport to be used for RDS packets on the
-	socket. When setting the option, integer argument may be
-	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
-	value, RDS_TRANS_NONE will be returned on an unbound socket.
-	This socket option may only be set exactly once on the socket,
-	prior to binding it via the bind(2) system call. Attempts to
-	set SO_RDS_TRANSPORT on a socket for which the transport has
-	been previously attached explicitly (by SO_RDS_TRANSPORT) or
-	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
-	An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
-	always return EINVAL.
-
-RDMA for RDS
-============
-
-  see rds-rdma(7) manpage (available in rds-tools)
-
-
-Congestion Notifications
-========================
-
-  see rds(7) manpage
-
-
-RDS Protocol
-============
-
-  Message header
-
-    The message header is a 'struct rds_header' (see rds.h):
-    Fields:
-      h_sequence:
-          per-packet sequence number
-      h_ack:
-          piggybacked acknowledgment of last packet received
-      h_len:
-          length of data, not including header
-      h_sport:
-          source port
-      h_dport:
-          destination port
-      h_flags:
-          CONG_BITMAP - this is a congestion update bitmap
-          ACK_REQUIRED - receiver must ack this packet
-          RETRANSMITTED - packet has previously been sent
-      h_credit:
-          indicate to other end of connection that
-          it has more credits available (i.e. there is
-          more send room)
-      h_padding[4]:
-          unused, for future use
-      h_csum:
-          header checksum
-      h_exthdr:
-          optional data can be passed here. This is currently used for
-          passing RDMA-related information.
-
-  ACK and retransmit handling
-
-      One might think that with reliable IB connections you wouldn't need
-      to ack messages that have been received.  The problem is that IB
-      hardware generates an ack message before it has DMAed the message
-      into memory.  This creates a potential message loss if the HCA is
-      disabled for any reason between when it sends the ack and before
-      the message is DMAed and processed.  This is only a potential issue
-      if another HCA is available for fail-over.
-
-      Sending an ack immediately would allow the sender to free the sent
-      message from their send queue quickly, but could cause excessive
-      traffic to be used for acks. RDS piggybacks acks on sent data
-      packets.  Ack-only packets are reduced by only allowing one to be
-      in flight at a time, and by the sender only asking for acks when
-      its send buffers start to fill up. All retransmissions are also
-      acked.
-
-  Flow Control
-
-      RDS's IB transport uses a credit-based mechanism to verify that
-      there is space in the peer's receive buffers for more data. This
-      eliminates the need for hardware retries on the connection.
-
-  Congestion
-
-      Messages waiting in the receive queue on the receiving socket
-      are accounted against the sockets SO_RCVBUF option value.  Only
-      the payload bytes in the message are accounted for.  If the
-      number of bytes queued equals or exceeds rcvbuf then the socket
-      is congested.  All sends attempted to this socket's address
-      should return block or return -EWOULDBLOCK.
-
-      Applications are expected to be reasonably tuned such that this
-      situation very rarely occurs.  An application encountering this
-      "back-pressure" is considered a bug.
-
-      This is implemented by having each node maintain bitmaps which
-      indicate which ports on bound addresses are congested.  As the
-      bitmap changes it is sent through all the connections which
-      terminate in the local address of the bitmap which changed.
-
-      The bitmaps are allocated as connections are brought up.  This
-      avoids allocation in the interrupt handling path which queues
-      sages on sockets.  The dense bitmaps let transports send the
-      entire bitmap on any bitmap change reasonably efficiently.  This
-      is much easier to implement than some finer-grained
-      communication of per-port congestion.  The sender does a very
-      inexpensive bit test to test if the port it's about to send to
-      is congested or not.
-
-
-RDS Transport Layer
-==================
-
-  As mentioned above, RDS is not IB-specific. Its code is divided
-  into a general RDS layer and a transport layer.
-
-  The general layer handles the socket API, congestion handling,
-  loopback, stats, usermem pinning, and the connection state machine.
-
-  The transport layer handles the details of the transport. The IB
-  transport, for example, handles all the queue pairs, work requests,
-  CM event handlers, and other Infiniband details.
-
-
-RDS Kernel Structures
-=====================
-
-  struct rds_message
-    aka possibly "rds_outgoing", the generic RDS layer copies data to
-    be sent and sets header fields as needed, based on the socket API.
-    This is then queued for the individual connection and sent by the
-    connection's transport.
-  struct rds_incoming
-    a generic struct referring to incoming data that can be handed from
-    the transport to the general code and queued by the general code
-    while the socket is awoken. It is then passed back to the transport
-    code to handle the actual copy-to-user.
-  struct rds_socket
-    per-socket information
-  struct rds_connection
-    per-connection information
-  struct rds_transport
-    pointers to transport-specific functions
-  struct rds_statistics
-    non-transport-specific statistics
-  struct rds_cong_map
-    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
-
-Connection management
-=====================
-
-  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
-  ERROR states.
-
-  The first time an attempt is made by an RDS socket to send data to
-  a node, a connection is allocated and connected. That connection is
-  then maintained forever -- if there are transport errors, the
-  connection will be dropped and re-established.
-
-  Dropping a connection while packets are queued will cause queued or
-  partially-sent datagrams to be retransmitted when the connection is
-  re-established.
-
-
-The send path
-=============
-
-  rds_sendmsg()
-    struct rds_message built from incoming data
-    CMSGs parsed (e.g. RDMA ops)
-    transport connection alloced and connected if not already
-    rds_message placed on send queue
-    send worker awoken
-  rds_send_worker()
-    calls rds_send_xmit() until queue is empty
-  rds_send_xmit()
-    transmits congestion map if one is pending
-    may set ACK_REQUIRED
-    calls transport to send either non-RDMA or RDMA message
-    (RDMA ops never retransmitted)
-  rds_ib_xmit()
-    allocs work requests from send ring
-    adds any new send credits available to peer (h_credits)
-    maps the rds_message's sg list
-    piggybacks ack
-    populates work requests
-    post send to connection's queue pair
-
-The recv path
-=============
-
-  rds_ib_recv_cq_comp_handler()
-    looks at write completions
-    unmaps recv buffer from device
-    no errors, call rds_ib_process_recv()
-    refill recv ring
-  rds_ib_process_recv()
-    validate header checksum
-    copy header to rds_ib_incoming struct if start of a new datagram
-    add to ibinc's fraglist
-    if competed datagram:
-      update cong map if datagram was cong update
-      call rds_recv_incoming() otherwise
-      note if ack is required
-  rds_recv_incoming()
-    drop duplicate packets
-    respond to pings
-    find the sock associated with this datagram
-    add to sock queue
-    wake up sock
-    do some congestion calculations
-  rds_recvmsg
-    copy data into user iovec
-    handle CMSGs
-    return to application
-
-Multipath RDS (mprds)
-=====================
-  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
-  (though the concept can be extended to other transports). The classical
-  implementation of RDS-over-TCP is implemented by demultiplexing multiple
-  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
-  port]) over a single TCP socket between the 2 IP addresses involved. This
-  has the limitation that it ends up funneling multiple RDS flows over a
-  single TCP flow, thus it is
-  (a) upper-bounded to the single-flow bandwidth,
-  (b) suffers from head-of-line blocking for all the RDS sockets.
-
-  Better throughput (for a fixed small packet size, MTU) can be achieved
-  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
-  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
-  connection. RDS sockets will be attached to a path based on some hash
-  (e.g., of local address and RDS port number) and packets for that RDS
-  socket will be sent over the attached path using TCP to segment/reassemble
-  RDS datagrams on that path.
-
-  Multipathed RDS is implemented by splitting the struct rds_connection into
-  a common (to all paths) part, and a per-path struct rds_conn_path. All
-  I/O workqs and reconnect threads are driven from the rds_conn_path.
-  Transports such as TCP that are multipath capable may then set up a
-  TCP socket per rds_conn_path, and this is managed by the transport via
-  the transport privatee cp_transport_data pointer.
-
-  Transports announce themselves as multipath capable by setting the
-  t_mp_capable bit during registration with the rds core module. When the
-  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
-  across multiple paths. The outgoing hash is computed based on the
-  local address and port that the PF_RDS socket is bound to.
-
-  Additionally, even if the transport is MP capable, we may be
-  peering with some node that does not support mprds, or supports
-  a different number of paths. As a result, the peering nodes need
-  to agree on the number of paths to be used for the connection.
-  This is done by sending out a control packet exchange before the
-  first data packet. The control packet exchange must have completed
-  prior to outgoing hash completion in rds_sendmsg() when the transport
-  is mutlipath capable.
-
-  The control packet is an RDS ping packet (i.e., packet to rds dest
-  port 0) with the ping packet having a rds extension header option  of
-  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
-  number of paths supported by the sender. The "probe" ping packet will
-  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
-  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
-  be able to compute the min(sender_paths, rcvr_paths). The pong
-  sent in response to a probe-ping should contain the rcvr's npaths
-  when the rcvr is mprds-capable.
-
-  If the rcvr is not mprds-capable, the exthdr in the ping will be
-  ignored.  In this case the pong will not have any exthdrs, so the sender
-  of the probe-ping can default to single-path mprds.
-