From afe4fd062416b158a8a8538b23adc1930a9b88dc Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 29 Aug 2013 15:49:55 -0700 Subject: pkt_sched: fq: Fair Queue packet scheduler - Uses perfect flow match (not stochastic hash like SFQ/FQ_codel) - Uses the new_flow/old_flow separation from FQ_codel - New flows get an initial credit allowing IW10 without added delay. - Special FIFO queue for high prio packets (no need for PRIO + FQ) - Uses a hash table of RB trees to locate the flows at enqueue() time - Smart on demand gc (at enqueue() time, RB tree lookup evicts old unused flows) - Dynamic memory allocations. - Designed to allow millions of concurrent flows per Qdisc. - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow. - Single high resolution timer for throttled flows (if any). - One RB tree to link throttled flows. - Ability to have a max rate per flow. We might add a socket option to add per socket limitation. Attempts have been made to add TCP pacing in TCP stack, but this seems to add complex code to an already complex stack. TCP pacing is welcomed for flows having idle times, as the cwnd permits TCP stack to queue a possibly large number of packets. This removes the 'slow start after idle' choice, hitting badly large BDP flows, and applications delivering chunks of data as video streams. Nicely spaced packets : Here interface is 10Gbit, but flow bottleneck is ~20Mbit cwin is big, yet FQ avoids the typical bursts generated by TCP (as in netperf TCP_RR -- -r 100000,100000) 15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 15:01:23.545394 IP B > A: . ack 81089 win 3668 15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 15:01:23.546565 IP B > A: . ack 83985 win 3668 15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 15:01:23.547778 IP B > A: . ack 86881 win 3668 15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 15:01:23.548949 IP B > A: . ack 89777 win 3668 15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 15:01:23.550182 IP B > A: . ack 92673 win 3668 15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 15:01:23.551406 IP B > A: . ack 95569 win 3668 15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 15:01:23.552576 IP B > A: . ack 98465 win 3668 15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 15:01:23.554204 IP B > A: . ack 100001 win 3668 15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 TCP timestamps show that most packets from B were queued in the same ms timeframe (TSval 1159799{3,4}), but FQ managed to send them right in time to avoid a big burst. In slow start or steady state, very few packets are throttled [1] FQ gets a bunch of tunables as : limit : max number of packets on whole Qdisc (default 10000) flow_limit : max number of packets per flow (default 100) quantum : the credit per RR round (default is 2 MTU) initial_quantum : initial credit for new flows (default is 10 MTU) maxrate : max per flow rate (default : unlimited) buckets : number of RB trees (default : 1024) in hash table. (consumes 8 bytes per bucket) [no]pacing : disable/enable pacing (default is enable) All of them can be changed on a live qdisc. $ tc qd add dev eth0 root fq help Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ] [ quantum BYTES ] [ initial_quantum BYTES ] [ maxrate RATE ] [ buckets NUMBER ] [ [no]pacing ] $ tc -s -d qd qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 511 flows, 511 inactive, 0 throttled 110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit [1] Except if initial srtt is overestimated, as if using cached srtt in tcp metrics. We'll provide a fix for this issue. Signed-off-by: Eric Dumazet Cc: Yuchung Cheng Cc: Neal Cardwell Signed-off-by: David S. Miller --- net/sched/Makefile | 1 + 1 file changed, 1 insertion(+) (limited to 'net/sched/Makefile') diff --git a/net/sched/Makefile b/net/sched/Makefile index 978cbf004e80..e5f9abe9a5db 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -39,6 +39,7 @@ obj-$(CONFIG_NET_SCH_CHOKE) += sch_choke.o obj-$(CONFIG_NET_SCH_QFQ) += sch_qfq.o obj-$(CONFIG_NET_SCH_CODEL) += sch_codel.o obj-$(CONFIG_NET_SCH_FQ_CODEL) += sch_fq_codel.o +obj-$(CONFIG_NET_SCH_FQ) += sch_fq.o obj-$(CONFIG_NET_CLS_U32) += cls_u32.o obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o -- cgit v1.2.3-59-g8ed1b