Merge branch 'kcm'

Tom Herbert says: ==================== kcm: Kernel Connection Multiplexor (KCM) Kernel Connection Multiplexor (KCM) is a facility that provides a message based interface over TCP for generic application protocols. The motivation for this is based on the observation that although TCP is byte stream transport protocol with no concept of message boundaries, a common use case is to implement a framed application layer protocol running over TCP. To date, most TCP stacks offer byte stream API for applications, which places the burden of message delineation, message I/O operation atomicity, and load balancing in the application. With KCM an application can efficiently send and receive application protocol messages over TCP using a datagram interface. In order to delineate message in a TCP stream for receive in KCM, the kernel implements a message parser. For this we chose to employ BPF which is applied to the TCP stream. BPF code parses application layer messages and returns a message length. Nearly all binary application protocols are parsable in this manner, so KCM should be applicable across a wide range of applications. Other than message length determination in receive, KCM does not require any other application specific awareness. KCM does not implement any other application protocol semantics-- these are are provided in userspace or could be implemented in a kernel module layered above KCM. KCM implements an NxM multiplexor in the kernel as diagrammed below: +------------+ +------------+ +------------+ +------------+ | KCM socket | | KCM socket | | KCM socket | | KCM socket | +------------+ +------------+ +------------+ +------------+ | | | | +-----------+ | | +----------+ | | | | +----------------------------------+ | Multiplexor | +----------------------------------+ | | | | | +---------+ | | | ------------+ | | | | | +----------+ +----------+ +----------+ +----------+ +----------+ | Psock | | Psock | | Psock | | Psock | | Psock | +----------+ +----------+ +----------+ +----------+ +----------+ | | | | | +----------+ +----------+ +----------+ +----------+ +----------+ | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | +----------+ +----------+ +----------+ +----------+ +----------+ The KCM sockets provide the datagram interface to applications, Psocks are the state for each attached TCP connection (i.e. where message delineation is performed on receive). A description of the APIs and design can be found in the included Documentation/networking/kcm.txt. In this patch set: - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to indicate that more messages will be sent on the socket. The stack may batch messages up if it is beneficial for transmission. - In sendmmsg, set MSG_BATCH in all sub messages except for the last one. - In order to allow sendmmsg to contain multiple messages with SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR. - Add KCM module - This supports SOCK_DGRAM and SOCK_SEQPACKET. - KCM documentation v2: - Added splice and page operations. - Assemble receive messages in place on TCP socket (don't have a separate assembly queue. - Based on above, enforce maxmimum receive message to be the size of the recceive socket buffer. - Support message assembly timeout. Use the timeout value in sk_rcvtimeo on the TCP socket. - Tested some with a couple of other production applications, see ~5% improvement in application latency. Testing: Dave Watson has integrated KCM into Thrift and we intend to put these changes into open source. Example of this is in: https://github.com/djwatson/fbthrift/commit/ dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc Some initial KCM Thrift benchmark numbers (comment from Dave) Thrift by default ties a single connection to a single thread. KCM is instead able to load balance multiple connections across multiple epoll loops easily. A test sending ~5k bytes of data to a kcm thrift server, dropping the bytes on recv: QPS Latency / std dev Latency without KCM 70336 209/123 with KCM 70353 191/124 A test sending a small request, then doing work in the epoll thread, before serving more requests: QPS Latency / std dev Latency without KCM 14282 559/602 with KCM 23192 344/234 At the high end, there's definitely some additional kernel overhead: Cranking the pipelining way up, with lots of small requests QPS Latency / std dev Latency without KCM 1863429 127/119 with KCM 1337713 192/241 --- So for a "realistic" workload, KCM performs pretty well (second case). Under extreme conditions of highest tps we still have some work to do. In its nature a multiplexor will spread work between CPUs which is logically good for load balancing but coan conflict with the goal promoting affinity. Batching messages on both send and receive are the means to recoup performance. Future support: - Integration with TLS (TLS-in-kernel is a separate initiative). - Page operations/splice support - Unconnected KCM sockets. Will be able to attach sockets to different destinations, AF_KCM addresses with be used in sendmsg and recvmsg to indicate destination - Explore more utility in performing BPF inline with a TCP data stream (setting SO_MARK, rxhash for messages being sent received on KCM sockets). - Performance work - Diagnose performance issues under high message load FAQ (Questions posted on LWN) Q: Why do this in the kernel? A: Because the kernel is good at scheduling threads and steering packets to threads. KCM fits well into this model since it allows the unit of work for scheduling and steering to be the application layer messages themselves. KCM should be thought of as generic application protocol acceleration. It to the philosophy that the kernel provides generic and extensible interfaces. Q: How can adding code in the path yield better performance? A: It is true that for just sending receiving a single message there would be some performance loss since the code path is longer (for instance comparing netperf to KCM). But for real production applications performance takes on many dynamics. Parallelism, context switching, affinity, granularity of locking, and load balancing are all relevant. The theory of KCM is that by an application-centric interface, the kernel can provide better support for these performance characteristics. Q: Why not use an existing message-oriented protocol such as RUDP, DCCP, SCTP, RDS, and others? A: Because that would entail using a completely new transport protocol. Deploying a new protocol at scale is either a huge undertaking or fundamentally infeasible. This is true in either the Internet and in the data center due in a large part to protocol ossification. Besides, KCM we want KCM to work existing, well deployed application protocols that we couldn't change even if we wanted to (e.g. http/2). KCM simply defines a new interface method, it does not redefine any aspect of the transport protocol nor application protocol, nor set any new requirements on these. Neither does KCM attempt to implement any application protocol logic other than message deliniation in the stream. These are fundamental requirement of KCM. Q: How does this affect TCP? A: It doesn't, not in the slightest. The use of KCM can be one-sided, KCM has no effect on the wire. Q: Why force TCP into doing something it's not designed for? A: TCP is defined as transport protocol and there is no standard that says the API into TCP must be stream based sockets, or for that matter sockets at all (or even that TCP needs to be implemented in a kernel). KCM is not inconsistent with the design of TCP just because to makes an message based interface over TCP, if it were then every application protocol sending messages over TCP would also be! :-) Q: What about the problem of a connections with very slow rate of incoming data? As a result your application can get storms of very short reads. And it actually happens a lot with connection from mobile devices and it is a problem for servers handling a lot of connections. A: The storm of short reads will occur regardless of whether KCM is used or not. KCM does have one advantage in this scenario though, it will only wake up the application when a full message has been received, not for each packet that makes up part of a bigger messages. If a bunch of small messages are received, the application can receive messages in batches using recvmmsg. Q: Why not just use DPDK, or at least provide KCM like functionality in DPDK? A: DPDK, or more generally OS bypass presumably with a TCP stack in userland, presents a different model of load balancing than that of KCM (and the kernel). KCM implements load balancing of messages across the threads of an application, whereas DPDK load balances based on queues which are more static and coarse-grained since multiple connections are bound to queues. DPDK works best when processing of packets is silo'ed in a thread on the CPU processing a queue, and packet processing (for both the stack and application) is fairly uniform. KCM works well for applications where the amount of work to process messages varies an application work is commonly delegated to worker threads often on different CPUs. The message based interface over TCP is something that could be provide by a DPDK or OS bypass library. Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web sockets? A: Yes. KCM is most appropriate for message based protocols over TCP where is easy to deduce the message length (e.g. a length field) and the protocol implements its own message ordering semantics. Fortunately this encompasses many modern protocols. Q: How is memory limited and controlled? A: In v2 all data for messages is now kept in socket buffers, either those for TCP or KCM, so socket buffer limits are applicable. This includes receive messages assembly which is now done ont teh TCP socket buffer instead of a separate queue-- this has the consequence that the TCP socket buffer limit provides an enforceable maxmimum message size. Additionally, a timeout may be set for messages assembly. The value used for this is taken from sk_rcvtimeo of the TCP socket. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
author: David S. Miller <davem@davemloft.net> 2016-03-09 16:36:16 -0500
committer: David S. Miller <davem@davemloft.net> 2016-03-09 16:36:16 -0500
commit: 9531ab65f4ec066a6e6617a08a293c60397a161b (patch)
tree: 18b025fb9daf230bf9d0be894c24aab69361748f /include
parent: samples/bpf: add map performance test (diff)
parent: kcm: Add description in Documentation (diff)
download: linux-dev-9531ab65f4ec066a6e6617a08a293c60397a161b.tar.xz
linux-dev-9531ab65f4ec066a6e6617a08a293c60397a161b.zip
6 files changed, 318 insertions, 1 deletions
diff --git a/include/linux/net.h b/include/linux/net.h
index 0b4ac7da583a..49175e4ced11 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -215,6 +215,7 @@ int __sock_create(struct net *net, int family, int type, int proto,
 int sock_create(int family, int type, int proto, struct socket **res);
 int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res);
 int sock_create_lite(int family, int type, int proto, struct socket **res);
+struct socket *sock_alloc(void);
 void sock_release(struct socket *sock);
 int sock_sendmsg(struct socket *sock, struct msghdr *msg);
 int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 14ec1652daf4..17d4f849c65e 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -319,6 +319,27 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 })
 
 /**
+ * list_next_or_null_rcu - get the first element from a list
+ * @head:	the head for the list.
+ * @ptr:        the list head to take the next element from.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ *
+ * Note that if the ptr is at the end of the list, NULL is returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ */
+#define list_next_or_null_rcu(head, ptr, type, member) \
+({ \
+	struct list_head *__head = (head); \
+	struct list_head *__ptr = (ptr); \
+	struct list_head *__next = READ_ONCE(__ptr->next); \
+	likely(__next != __head) ? list_entry_rcu(__next, type, \
+						  member) : NULL; \
+})
+
+/**
  * list_for_each_entry_rcu	-	iterate over rcu list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5bf59c8493b7..73bf6c6a833b 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -200,7 +200,9 @@ struct ucred {
 #define AF_ALG		38	/* Algorithm sockets		*/
 #define AF_NFC		39	/* NFC sockets			*/
 #define AF_VSOCK	40	/* vSockets			*/
-#define AF_MAX		41	/* For now.. */
+#define AF_KCM		41	/* Kernel Connection Multiplexor*/
+
+#define AF_MAX		42	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -246,6 +248,7 @@ struct ucred {
 #define PF_ALG		AF_ALG
 #define PF_NFC		AF_NFC
 #define PF_VSOCK	AF_VSOCK
+#define PF_KCM		AF_KCM
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -274,6 +277,7 @@ struct ucred {
 #define MSG_MORE	0x8000	/* Sender will send more */
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
+#define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
@@ -322,6 +326,7 @@ struct ucred {
 #define SOL_CAIF	278
 #define SOL_ALG		279
 #define SOL_NFC		280
+#define SOL_KCM		281
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/include/net/kcm.h b/include/net/kcm.h
new file mode 100644
index 000000000000..95c425ca97b6
--- /dev/null
+++ b/include/net/kcm.h
@@ -0,0 +1,226 @@
+/*
+ * Kernel Connection Multiplexor
+ *
+ * Copyright (c) 2016 Tom Herbert <tom@herbertland.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef __NET_KCM_H_
+#define __NET_KCM_H_
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <uapi/linux/kcm.h>
+
+extern unsigned int kcm_net_id;
+
+#define KCM_STATS_ADD(stat, count) ((stat) += (count))
+#define KCM_STATS_INCR(stat) ((stat)++)
+
+struct kcm_psock_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+	unsigned int rx_aborts;
+	unsigned int rx_mem_fail;
+	unsigned int rx_need_more_hdr;
+	unsigned int rx_msg_too_big;
+	unsigned int rx_msg_timeouts;
+	unsigned int rx_bad_hdr_len;
+	unsigned long long reserved;
+	unsigned long long unreserved;
+	unsigned int tx_aborts;
+};
+
+struct kcm_mux_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+	unsigned int rx_ready_drops;
+	unsigned int tx_retries;
+	unsigned int psock_attach;
+	unsigned int psock_unattach_rsvd;
+	unsigned int psock_unattach;
+};
+
+struct kcm_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+};
+
+struct kcm_tx_msg {
+	unsigned int sent;
+	unsigned int fragidx;
+	unsigned int frag_offset;
+	unsigned int msg_flags;
+	struct sk_buff *frag_skb;
+	struct sk_buff *last_skb;
+};
+
+struct kcm_rx_msg {
+	int full_len;
+	int accum_len;
+	int offset;
+	int early_eaten;
+};
+
+/* Socket structure for KCM client sockets */
+struct kcm_sock {
+	struct sock sk;
+	struct kcm_mux *mux;
+	struct list_head kcm_sock_list;
+	int index;
+	u32 done : 1;
+	struct work_struct done_work;
+
+	struct kcm_stats stats;
+
+	/* Transmit */
+	struct kcm_psock *tx_psock;
+	struct work_struct tx_work;
+	struct list_head wait_psock_list;
+	struct sk_buff *seq_skb;
+
+	/* Don't use bit fields here, these are set under different locks */
+	bool tx_wait;
+	bool tx_wait_more;
+
+	/* Receive */
+	struct kcm_psock *rx_psock;
+	struct list_head wait_rx_list; /* KCMs waiting for receiving */
+	bool rx_wait;
+	u32 rx_disabled : 1;
+};
+
+struct bpf_prog;
+
+/* Structure for an attached lower socket */
+struct kcm_psock {
+	struct sock *sk;
+	struct kcm_mux *mux;
+	int index;
+
+	u32 tx_stopped : 1;
+	u32 rx_stopped : 1;
+	u32 done : 1;
+	u32 unattaching : 1;
+
+	void (*save_state_change)(struct sock *sk);
+	void (*save_data_ready)(struct sock *sk);
+	void (*save_write_space)(struct sock *sk);
+
+	struct list_head psock_list;
+
+	struct kcm_psock_stats stats;
+
+	/* Receive */
+	struct sk_buff *rx_skb_head;
+	struct sk_buff **rx_skb_nextp;
+	struct sk_buff *ready_rx_msg;
+	struct list_head psock_ready_list;
+	struct work_struct rx_work;
+	struct delayed_work rx_delayed_work;
+	struct bpf_prog *bpf_prog;
+	struct kcm_sock *rx_kcm;
+	unsigned long long saved_rx_bytes;
+	unsigned long long saved_rx_msgs;
+	struct timer_list rx_msg_timer;
+	unsigned int rx_need_bytes;
+
+	/* Transmit */
+	struct kcm_sock *tx_kcm;
+	struct list_head psock_avail_list;
+	unsigned long long saved_tx_bytes;
+	unsigned long long saved_tx_msgs;
+};
+
+/* Per net MUX list */
+struct kcm_net {
+	struct mutex mutex;
+	struct kcm_psock_stats aggregate_psock_stats;
+	struct kcm_mux_stats aggregate_mux_stats;
+	struct list_head mux_list;
+	int count;
+};
+
+/* Structure for a MUX */
+struct kcm_mux {
+	struct list_head kcm_mux_list;
+	struct rcu_head rcu;
+	struct kcm_net *knet;
+
+	struct list_head kcm_socks;	/* All KCM sockets on MUX */
+	int kcm_socks_cnt;		/* Total KCM socket count for MUX */
+	struct list_head psocks;	/* List of all psocks on MUX */
+	int psocks_cnt;		/* Total attached sockets */
+
+	struct kcm_mux_stats stats;
+	struct kcm_psock_stats aggregate_psock_stats;
+
+	/* Receive */
+	spinlock_t rx_lock ____cacheline_aligned_in_smp;
+	struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */
+	struct list_head psocks_ready;	/* List of psocks with a msg ready */
+	struct sk_buff_head rx_hold_queue;
+
+	/* Transmit */
+	spinlock_t  lock ____cacheline_aligned_in_smp;	/* TX and mux locking */
+	struct list_head psocks_avail;	/* List of available psocks */
+	struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */
+};
+
+#ifdef CONFIG_PROC_FS
+int kcm_proc_init(void);
+void kcm_proc_exit(void);
+#else
+static int kcm_proc_init(void) { return 0; }
+static void kcm_proc_exit(void) { }
+#endif
+
+static inline void aggregate_psock_stats(struct kcm_psock_stats *stats,
+					 struct kcm_psock_stats *agg_stats)
+{
+	/* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_PSOCK_STATS(_stat) (agg_stats->_stat += stats->_stat)
+	SAVE_PSOCK_STATS(rx_msgs);
+	SAVE_PSOCK_STATS(rx_bytes);
+	SAVE_PSOCK_STATS(rx_aborts);
+	SAVE_PSOCK_STATS(rx_mem_fail);
+	SAVE_PSOCK_STATS(rx_need_more_hdr);
+	SAVE_PSOCK_STATS(rx_msg_too_big);
+	SAVE_PSOCK_STATS(rx_msg_timeouts);
+	SAVE_PSOCK_STATS(rx_bad_hdr_len);
+	SAVE_PSOCK_STATS(tx_msgs);
+	SAVE_PSOCK_STATS(tx_bytes);
+	SAVE_PSOCK_STATS(reserved);
+	SAVE_PSOCK_STATS(unreserved);
+	SAVE_PSOCK_STATS(tx_aborts);
+#undef SAVE_PSOCK_STATS
+}
+
+static inline void aggregate_mux_stats(struct kcm_mux_stats *stats,
+				       struct kcm_mux_stats *agg_stats)
+{
+	/* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_MUX_STATS(_stat) (agg_stats->_stat += stats->_stat)
+	SAVE_MUX_STATS(rx_msgs);
+	SAVE_MUX_STATS(rx_bytes);
+	SAVE_MUX_STATS(tx_msgs);
+	SAVE_MUX_STATS(tx_bytes);
+	SAVE_MUX_STATS(rx_ready_drops);
+	SAVE_MUX_STATS(psock_attach);
+	SAVE_MUX_STATS(psock_unattach_rsvd);
+	SAVE_MUX_STATS(psock_unattach);
+#undef SAVE_MUX_STATS
+}
+
+#endif /* __NET_KCM_H_ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e90db8546806..0302636af98c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1816,4 +1816,28 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff *skb)
 	skb->truesize = 2;
 }
 
+static inline int tcp_inq(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int answ;
+
+	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+		answ = 0;
+	} else if (sock_flag(sk, SOCK_URGINLINE) ||
+		   !tp->urg_data ||
+		   before(tp->urg_seq, tp->copied_seq) ||
+		   !before(tp->urg_seq, tp->rcv_nxt)) {
+
+		answ = tp->rcv_nxt - tp->copied_seq;
+
+		/* Subtract 1, if FIN was received */
+		if (answ && sock_flag(sk, SOCK_DONE))
+			answ--;
+	} else {
+		answ = tp->urg_seq - tp->copied_seq;
+	}
+
+	return answ;
+}
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/kcm.h b/include/uapi/linux/kcm.h
new file mode 100644
index 000000000000..a5a530940b99
--- /dev/null
+++ b/include/uapi/linux/kcm.h
@@ -0,0 +1,40 @@
+/*
+ * Kernel Connection Multiplexor
+ *
+ * Copyright (c) 2016 Tom Herbert <tom@herbertland.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * User API to clone KCM sockets and attach transport socket to a KCM
+ * multiplexor.
+ */
+
+#ifndef KCM_KERNEL_H
+#define KCM_KERNEL_H
+
+struct kcm_attach {
+	int fd;
+	int bpf_fd;
+};
+
+struct kcm_unattach {
+	int fd;
+};
+
+struct kcm_clone {
+	int fd;
+};
+
+#define SIOCKCMATTACH	(SIOCPROTOPRIVATE + 0)
+#define SIOCKCMUNATTACH	(SIOCPROTOPRIVATE + 1)
+#define SIOCKCMCLONE	(SIOCPROTOPRIVATE + 2)
+
+#define KCMPROTO_CONNECTED	0
+
+/* Socket options */
+#define KCM_RECV_DISABLE	1
+
+#endif
+
author	David S. Miller <davem@davemloft.net>	2016-03-09 16:36:16 -0500
committer	David S. Miller <davem@davemloft.net>	2016-03-09 16:36:16 -0500
commit	9531ab65f4ec066a6e6617a08a293c60397a161b (patch)
tree	18b025fb9daf230bf9d0be894c24aab69361748f /include
parent	samples/bpf: add map performance test (diff)
parent	kcm: Add description in Documentation (diff)
download	linux-dev-9531ab65f4ec066a6e6617a08a293c60397a161b.tar.xz linux-dev-9531ab65f4ec066a6e6617a08a293c60397a161b.zip