aboutsummaryrefslogtreecommitdiffstats
path: root/include/uapi (follow)
AgeCommit message (Collapse)AuthorFilesLines
2013-08-29pkt_sched: fq: Fair Queue packet schedulerEric Dumazet1-0/+41
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel) - Uses the new_flow/old_flow separation from FQ_codel - New flows get an initial credit allowing IW10 without added delay. - Special FIFO queue for high prio packets (no need for PRIO + FQ) - Uses a hash table of RB trees to locate the flows at enqueue() time - Smart on demand gc (at enqueue() time, RB tree lookup evicts old unused flows) - Dynamic memory allocations. - Designed to allow millions of concurrent flows per Qdisc. - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow. - Single high resolution timer for throttled flows (if any). - One RB tree to link throttled flows. - Ability to have a max rate per flow. We might add a socket option to add per socket limitation. Attempts have been made to add TCP pacing in TCP stack, but this seems to add complex code to an already complex stack. TCP pacing is welcomed for flows having idle times, as the cwnd permits TCP stack to queue a possibly large number of packets. This removes the 'slow start after idle' choice, hitting badly large BDP flows, and applications delivering chunks of data as video streams. Nicely spaced packets : Here interface is 10Gbit, but flow bottleneck is ~20Mbit cwin is big, yet FQ avoids the typical bursts generated by TCP (as in netperf TCP_RR -- -r 100000,100000) 15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115> 15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115> 15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115> 15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115> 15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115> 15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115> 15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115> 15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> TCP timestamps show that most packets from B were queued in the same ms timeframe (TSval 1159799{3,4}), but FQ managed to send them right in time to avoid a big burst. In slow start or steady state, very few packets are throttled [1] FQ gets a bunch of tunables as : limit : max number of packets on whole Qdisc (default 10000) flow_limit : max number of packets per flow (default 100) quantum : the credit per RR round (default is 2 MTU) initial_quantum : initial credit for new flows (default is 10 MTU) maxrate : max per flow rate (default : unlimited) buckets : number of RB trees (default : 1024) in hash table. (consumes 8 bytes per bucket) [no]pacing : disable/enable pacing (default is enable) All of them can be changed on a live qdisc. $ tc qd add dev eth0 root fq help Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ] [ quantum BYTES ] [ initial_quantum BYTES ] [ maxrate RATE ] [ buckets NUMBER ] [ [no]pacing ] $ tc -s -d qd qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 511 flows, 511 inactive, 0 throttled 110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit [1] Except if initial srtt is overestimated, as if using cached srtt in tcp metrics. We'll provide a fix for this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: packet: add randomized fanout schedulerDaniel Borkmann1-0/+1
We currently allow for different fanout scheduling policies in pf_packet such as scheduling by skb's rxhash, round-robin, by cpu, and rollover. Also allow for a random, equidistributed selection of the socket from the fanout process group. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29ipv6: drop fragmented ndisc packets by default (RFC 6980)Hannes Frederic Sowa1-0/+1
This patch implements RFC6980: Drop fragmented ndisc packets by default. If a fragmented ndisc packet is received the user is informed that it is possible to disable the check. Cc: Fernando Gont <fernando@gont.com.ar> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitchDavid S. Miller1-1/+14
Jesse Gross says: ==================== A number of significant new features and optimizations for net-next/3.12. Highlights are: * "Megaflows", an optimization that allows userspace to specify which flow fields were used to compute the results of the flow lookup. This allows for a major reduction in flow setups (the major performance bottleneck in Open vSwitch) without reducing flexibility. * Converting netlink dump operations to use RCU, allowing for additional parallelism in userspace. * Matching and modifying SCTP protocol fields. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28netfilter: add SYNPROXY core/targetPatrick McHardy1-0/+16
Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy core with common functions and an address family specific target. The SYNPROXY receives the connection request from the client, responds with a SYN/ACK containing a SYN cookie and announcing a zero window and checks whether the final ACK from the client contains a valid cookie. It then establishes a connection to the original destination and, if successful, sends a window update to the client with the window size announced by the server. Support for timestamps, SACK, window scaling and MSS options can be statically configured as target parameters if the features of the server are known. If timestamps are used, the timestamp value sent back to the client in the SYN/ACK will be different from the real timestamp of the server. In order to now break PAWS, the timestamps are translated in the direction server->client. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-28netfilter: nf_conntrack: make sequence number adjustments usuable without NATPatrick McHardy2-3/+15
Split out sequence number adjustments from NAT and move them to the conntrack core to make them usable for SYN proxying. The sequence number adjustment information is moved to a seperate extend. The extend is added to new conntracks when a NAT mapping is set up for a connection using a helper. As a side effect, this saves 24 bytes per connection with NAT in the common case that a connection does not have a helper assigned. Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Martin Topholm <mph@one.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-26openvswitch: Add SCTP supportJoe Stringer1-0/+6
This patch adds support for rewriting SCTP src,dst ports similar to the functionality already available for TCP/UDP. Rewriting SCTP ports is expensive due to double-recalculation of the SCTP checksums; this is performed to ensure that packets traversing OVS with invalid checksums will continue to the destination with any checksum corruption intact. Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Ben Pfaff <blp@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+36
Conflicts: drivers/net/wireless/iwlwifi/pcie/trans.c include/linux/inetdevice.h The inetdevice.h conflict involves moving the IPV4_DEVCONF values into a UAPI header, overlapping additions of some new entries. The iwlwifi conflict is a context overlap. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-23openvswitch: Mega flow implementationAndy Zhou1-1/+8
Add wildcarded flow support in kernel datapath. Wildcarded flow can improve OVS flow set up performance by avoid sending matching new flows to the user space program. The exact performance boost will largely dependent on wildcarded flow hit rate. In case all new flows hits wildcard flows, the flow set up rate is within 5% of that of linux bridge module. Pravin has made significant contributions to this patch. Including API clean ups and bug fixes. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-22ipv4: expose IPV4_DEVCONFstephen hemminger1-0/+34
IP sends device configuration (see inet_fill_link_af) as an array in the netlink information, but the indices in that array are not exposed to userspace through any current santized header file. It was available back in 2.6.32 (in /usr/include/linux/sysctl.h) but was broken by: commit 02291680ffba92e5b5865bc0c5e7d1f3056b80ec Author: Eric W. Biederman <ebiederm@xmission.com> Date: Sun Feb 14 03:25:51 2010 +0000 net ipv4: Decouple ipv4 interface parameters from binary sysctl numbers Eric was solving the sysctl problem but then the indices were re-exposed by a later addition of devconf support for IPV4 commit 9f0f7272ac9506f4c8c05cc597b7e376b0b9f3e4 Author: Thomas Graf <tgraf@infradead.org> Date: Tue Nov 16 04:32:48 2010 +0000 ipv4: AF_INET link address family Putting them in /usr/include/linux/ip.h seemed the logical match for the DEVCONF_ definitions for IPV6 in /usr/include/linux/ip6.h Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21tun: Get skfilter layoutPavel Emelyanov1-0/+1
The only thing we may have from tun device is the fprog, whic contains the number of filter elements and a pointer to (user-space) memory where the elements are. The program itself may not be available if the device is persistent and detached. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21tun: Allow to skip filter on attachPavel Emelyanov1-0/+1
There's a small problem with sk-filters on tun devices. Consider an application doing this sequence of steps: fd = open("/dev/net/tun"); ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" }); ioctl(fd, TUNATTACHFILTER, &my_filter); ioctl(fd, TUNSETPERSIST, 1); close(fd); At that point the tun0 will remain in the system and will keep in mind that there should be a socket filter at address '&my_filter'. If after that we do fd = open("/dev/net/tun"); ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" }); we most likely receive the -EFAULT error, since tun_attach() would try to connect the filter back. But (!) if we provide a filter at address &my_filter, then tun0 will be created and the "new" filter would be attached, but application may not know about that. This may create certain problems to anyone using tun-s, but it's critical problem for c/r -- if we meet a persistent tun device with a filter in mind, we will not be able to attach to it to dump its state (flags, owner, address, vnethdr size, etc.). The proposal is to allow to attach to tun device (with TUNSETIFF) w/o attaching the filter to the tun-file's socket. After this attach app may e.g clean the device by dropping the filter, it doesn't want to have one, or (in case of c/r) get information about the device with tun ioctls. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21tun: Add ability to create tun device with given indexPavel Emelyanov1-0/+1
Tun devices cannot be created with ifidex user wants, but it's required by checkpoint-restore project. Long time ago such ability was implemented for rtnl_ops-based interface for creating links (9c7dafbf net: Allow to create links with given ifindex), but the only API for creating and managing tuntap devices is ioctl-based and is evolving with adding new ones (cde8b15f tuntap: add ioctl to attach or detach a file form tuntap device). Following that trend, here's how a new ioctl that sets the ifindex for device, that _will_ be created by TUNSETIFF ioctl looks like. So those who want a tuntap device with the ifindex N, should open the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF. If the index N is busy, then the register_netdev will find this out and the ioctl would be failed with -EBUSY. If setifindex is not called, then it will be generated as before. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller4-0/+76
Conflicts: net/netfilter/nf_conntrack_proto_tcp.c The conflict had to do with overlapping changes dealing with fixing the use of an "s32" to hold the value returned by NAT_OFFSET(). Pablo Neira Ayuso says: ==================== The following batch contains Netfilter/IPVS updates for your net-next tree. More specifically, they are: * Trivial typo fix in xt_addrtype, from Phil Oester. * Remove net_ratelimit in the conntrack logging for consistency with other logging subsystem, from Patrick McHardy. * Remove unneeded includes from the recently added xt_connlabel support, from Florian Westphal. * Allow to update conntracks via nfqueue, don't need NFQA_CFG_F_CONNTRACK for this, from Florian Westphal. * Remove tproxy core, now that we have socket early demux, from Florian Westphal. * A couple of patches to refactor conntrack event reporting to save a good bunch of lines, from Florian Westphal. * Fix missing locking in NAT sequence adjustment, it did not manifested in any known bug so far, from Patrick McHardy. * Change sequence number adjustment variable to 32 bits, to delay the possible early overflow in long standing connections, also from Patrick. * Comestic cleanups for IPVS, from Dragos Foianu. * Fix possible null dereference in IPVS in the SH scheduler, from Daniel Borkmann. * Allow to attach conntrack expectations via nfqueue. Before this patch, you had to use ctnetlink instead, thus, we save the conntrack lookup. * Export xt_rpfilter and xt_HMARK header files, from Nicolas Dichtel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20openvswitch: Add vxlan tunneling support.Pravin B Shelar1-0/+11
Following patch adds vxlan vport type for openvswitch using vxlan api. So now there is vxlan dependency for openvswitch. CC: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-2/+10
2013-08-15net_sched: restore "linklayer atm" handlingJesper Dangaard Brouer1-1/+9
commit 56b765b79 ("htb: improved accuracy at high rates") broke the "linklayer atm" handling. tc class add ... htb rate X ceil Y linklayer atm The linklayer setting is implemented by modifying the rate table which is send to the kernel. No direct parameter were transferred to the kernel indicating the linklayer setting. The commit 56b765b79 ("htb: improved accuracy at high rates") removed the use of the rate table system. To keep compatible with older iproute2 utils, this patch detects the linklayer by parsing the rate table. It also supports future versions of iproute2 to send this linklayer parameter to the kernel directly. This is done by using the __reserved field in struct tc_ratespec, to convey the choosen linklayer option, but only using the lower 4 bits of this field. Linklayer detection is limited to speeds below 100Mbit/s, because at high rates the rtab is gets too inaccurate, so bad that several fields contain the same values, this resembling the ATM detect. Fields even start to contain "0" time to send, e.g. at 1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have been more broken than we first realized. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-14netfilter: export xt_HMARK.h to userlandNicolas Dichtel2-0/+51
This file contains the API for the target "HMARK", hence it should be exported to userland. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-14netfilter: export xt_rpfilter.h to userlandNicolas Dichtel2-0/+24
This file contains the API for the match "rpfilter", hence it should be exported to userland. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-13ipv6: make unsolicited report intervals configurable for mldHannes Frederic Sowa1-0/+2
Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3") and 2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space configuration of igmp unsolicited report interval") by William Manley made igmp unsolicited report intervals configurable per interface and corrected the interval of unsolicited igmpv3 report messages resendings to 1s. Same needs to be done for IPv6: MLDv1 (RFC2710 7.10.): 10 seconds MLDv2 (RFC3810 9.11.): 1 second Both intervals are configurable via new procfs knobs mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval. (also added .force_mld_version to ipv6_devconf_dflt to bring structs in line without semantic changes) v2: a) Joined documentation update for IPv4 and IPv6 MLD/IGMP unsolicited_report_interval procfs knobs. b) incorporate stylistic feedback from William Manley v3: a) add new DEVCONF_* values to the end of the enum (thanks to David Miller) Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: William Manley <william.manley@youview.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-nextDavid S. Miller1-20/+163
John W. Linville says: ==================== This is a batch of updates intended for 3.12. It is mostly driver stuff, although Johannes Berg and Simon Wunderlich make a good showing with mac80211 bits (particularly some work on 5/10 MHz channel support). The usual suspects are mostly represented. There are lots of updates to iwlwifi, ath9k, ath10k, mwifiex, rt2x00, wil6210, as usual. The bcma bus gets some love this time, as do cw1200, iwl4965, and a few other bits here and there. I don't think there is much unusual here, FWIW. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13pptp: fix byte order warningsstephen hemminger1-1/+1
Pptp driver has lots of byte order warnings from sparse. This was because the on-the-wire header is in network byte order (obviously) but the definition did not reflect that. Also, the address structure to user space actually put the call id in host order. Rather than break ABI compatibility, just acknowledge the existing design. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13netfilter: nfnetlink_queue: allow to attach expectations to conntracksPablo Neira Ayuso1-0/+1
This patch adds the capability to attach expectations via nfnetlink_queue. This is required by conntrack helpers that trigger expectations based on the first packet seen like the TFTP and the DHCPv6 user-space helpers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-08-12Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davemJohn W. Linville1-20/+163
Conflicts: drivers/net/ethernet/broadcom/Kconfig
2013-08-09Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-nextJohn W. Linville1-0/+30
2013-08-09Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wirelessJohn W. Linville1-3/+3
2013-08-09net: rename busy poll MIB counterEliezer Tamir1-1/+1
Rename mib counter from "low latency" to "busy poll" v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer) Eric Dumazet suggested that the current location is better. So v2 just renames the counter to fit the new naming convention. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08net: add SNMP counters tracking incoming ECN bitsEric Dumazet1-0/+4
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP counters do not count the number of frames, but number of aggregated segments. Its probably too late to change this now. This patch adds four new counters, tracking number of frames, regardless of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6. Ip[6]NoECTPkts : Number of packets received with NOECT Ip[6]ECT1Pkts : Number of packets received with ECT(1) Ip[6]ECT0Pkts : Number of packets received with ECT(0) Ip[6]CEPkts : Number of packets received with Congestion Experienced lph37:~# nstat | egrep "Pkts|InReceive" IpInReceives 1634137 0.0 Ip6InReceives 3714107 0.0 Ip6InNoECTPkts 19205 0.0 Ip6InECT0Pkts 52651828 0.0 IpExtInNoECTPkts 33630 0.0 IpExtInECT0Pkts 15581379 0.0 IpExtInCEPkts 6 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-7/+18
Merge net into net-next to setup some infrastructure Eric Dumazet needs for usbnet changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03fib_rules: fix suppressor names and default valuesStefan Tomanek1-1/+1
This change brings the suppressor attribute names into line; it also changes the data types to provide a more consistent interface. While -1 indicates that the suppressor is not enabled, values >= 0 for suppress_prefixlen or suppress_ifgroup reject routing decisions violating the constraint. This changes the previously presented behaviour of suppress_prefixlen, where a prefix length _less_ than the attribute value was rejected. After this change, a prefix length less than *or* equal to the value is considered a violation of the rule constraint. It also changes the default values for default and newly added rules (disabling any suppression for those). Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02fib_rules: add route suppression based on ifgroupStefan Tomanek1-1/+1
This change adds the ability to suppress a routing decision based upon the interface group the selected interface belongs to. This allows it to exclude specific devices from a routing decision. Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-02uapi: Convert some uses of 6 to ETH_ALENJoe Perches6-8/+12
Use the #define where appropriate. Add #include <linux/if_ether.h> where appropriate too. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davemJohn W. Linville1-3/+3
2013-08-01nl80211/cfg80211: add channel switch commandSimon Wunderlich1-0/+30
To allow channel switch announcements within beacons, add the channel switch command to nl80211/cfg80211. This is implementation is intended for AP and (later) IBSS mode. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-07-31fib_rules: add .suppress operationStefan Tomanek1-1/+1
This change adds a new operation to the fib_rules_ops struct; it allows the suppression of routing decisions if certain criteria are not met by its results. The first implemented constraint is a minimum prefix length added to the structures of routing rules. If a rule is added with a minimum prefix length >0, only routes meeting this threshold will be considered. Any other (more general) routing table entries will be ignored. When configuring a system with multiple network uplinks and default routes, it is often convinient to reference the main routing table multiple times - but omitting the default route. Using this patch and a modified "ip" utility, this can be achieved by using the following command sequence: $ ip route add table secuplink default via 10.42.23.1 $ ip rule add pref 100 table main prefixlength 1 $ ip rule add pref 150 fwmark 0xA table secuplink With this setup, packets marked 0xA will be processed by the additional routing table "secuplink", but only if no suitable route in the main routing table can be found. By using a minimal prefixlength of 1, the default route (/0) of the table "main" is hidden to packets processed by rule 100; packets traveling to destinations with more specific routing entries are processed as usual. Signed-off-by: Stefan Tomanek <stefan.tomanek@wertarbyte.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30rtnl: export physical port id via RT netlinkJiri Pirko1-0/+1
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Narendra K <narendra_k@dell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOADSamuel Ortiz1-3/+3
Loading a firmware into a target is typically called firmware download, not firmware upload. So we rename the netlink API to NFC_CMD_FW_DOWNLOAD in order to avoid any terminology confusion from userspace. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2013-07-29Merge tag 'firewire-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394Linus Torvalds1-2/+2
Pull firewire regression fix from Stefan Richter: "This fixes corrupted video capture, seen with IIDC/DCAM video and certain buffer settings. (Regression since v3.4 inclusive.)" * tag 'firewire-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: firewire: fix libdc1394/FlyCap2 iso event regression
2013-07-27virtio-net: put virtio net header inline with dataMichael S. Tsirkin1-1/+3
For small packets we can simplify xmit processing by linearizing buffers with the header: most packets seem to have enough head room we can use for this purpose. Since existing hypervisors require that header is the first s/g element, we need a feature bit for this. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27firewire: fix libdc1394/FlyCap2 iso event regressionClemens Ladisch1-2/+2
Commit 18d627113b83 (firewire: prevent dropping of completed iso packet header data) was intended to be an obvious bug fix, but libdc1394 and FlyCap2 depend on the old behaviour by ignoring all returned information and thus not noticing that not all packets have been received yet. The result was that the video frame buffers would be saved before they contained the correct data. Reintroduce the old behaviour for old clients. Tested-by: Stepan Salenikovich <stepan.salenikovich@gmail.com> Tested-by: Josep Bosch <jep250@gmail.com> Cc: <stable@vger.kernel.org> # 3.4+ Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
2013-07-26Merge tag 'usb-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usbLinus Torvalds1-0/+11
Pull USB fixes from Greg KH: "Here are a number of USB fixes for 3.11-rc3. Lots of little things, nothing major. A number of new device ids, build fixes for DMA, and a bunch of other minor things. All of these have been in the linux-next tree" * tag 'usb-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (40 commits) usb: Clear both buffers when clearing a control transfer TT buffer. usb/gadget: free opts struct on error recovery USB: mos7840: fix memory leak in open usb: serial: option.c: remove ONDA MT825UP product ID fromdriver usb: serial: option: add Olivetti Olicard 200 usb: serial: option: blacklist ONDA MT689DC QMI interface xhci: fix null pointer dereference on ring_doorbell_for_active_rings usb: host: xhci: Enable XHCI_SPURIOUS_SUCCESS for all controllers with xhci 1.0 usb: fix build warning in pci-quirks.h when CONFIG_PCI is not enabled usb: xhci: Mark two functions __maybe_unused xhci: Avoid NULL pointer deref when host dies. usb: serial: option: Add ONYX 3G device support USB: ti_usb_3410_5052: fix dynamic-id matching usb: option: add TP-LINK MA260 USB: option: add D-Link DWM-152/C1 and DWM-156/C1 USB: EHCI: Fix resume signalling on remote wakeup USB: cp210x: add MMB and PI ZigBee USB Device Support usb: cp210x support SEL C662 Vendor/Device USB: option: append Petatel NP10T device to GSM modems list USB: misc: Add Manhattan Hi-Speed USB DVI Converter to sisusbvga ...
2013-07-24tcp: TCP_NOTSENT_LOWAT socket optionEric Dumazet1-0/+1
Idea of this patch is to add optional limitation of number of unsent bytes in TCP sockets, to reduce usage of kernel memory. TCP receiver might announce a big window, and TCP sender autotuning might allow a large amount of bytes in write queue, but this has little performance impact if a large part of this buffering is wasted : Write queue needs to be large only to deal with large BDP, not necessarily to cope with scheduling delays (incoming ACKS make room for the application to queue more bytes) For most workloads, using a value of 128 KB or less is OK to give applications enough time to react to POLLOUT events in time (or being awaken in a blocking sendmsg()) This patch adds two ways to set the limit : 1) Per socket option TCP_NOTSENT_LOWAT 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets not using TCP_NOTSENT_LOWAT socket option (or setting a zero value) Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect. This changes poll()/select()/epoll() to report POLLOUT only if number of unsent bytes is below tp->nosent_lowat Note this might increase number of sendmsg()/sendfile() calls when using non blocking sockets, and increase number of context switches for blocking sockets. Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is defined as : Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol) Tested: netperf sessions, and watching /proc/net/protocols "memory" column for TCP With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458) lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y Using 128KB has no bad effect on the throughput or cpu usage of a single flow, although there is an increase of context switches. A bonus is that we hold socket lock for a shorter amount of time and should improve latencies of ACK processing. lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 412,514 context-switches 200.034645535 seconds time elapsed lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 2,675,818 context-switches 200.029651391 seconds time elapsed Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-By: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24net: sctp: trivial: update mailing list addressDaniel Borkmann1-1/+1
The SCTP mailing list address to send patches or questions to is linux-sctp@vger.kernel.org and not lksctp-developers@lists.sourceforge.net anymore. Therefore, update all occurences. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22tun: Support software transmit time stamping.Richard Cochran1-0/+3
This patch adds transmit time stamping to the tun/tap driver. Similar support already exists for UDP, can, and raw packets. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-20allow O_TMPFILE to work with O_WRONLYAl Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-16USB: move the definition of USB_MAXCHILDRENAlan Stern1-0/+11
The USB_MAXCHILDREN symbol is used in include/uapi/linux/usb/ch11.h, a user-mode header, even though it is defined in include/linux/usb.h, which is kernel-only. This causes compile-time errors when user programs try to #include linux/usb/ch11.h. This patch fixes the problem by moving the definition of USB_MAXCHILDREN into ch11.h. It also gets rid of unneeded parentheses. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-16nl80211: add scan width to bss and scan request structsSimon Wunderlich1-0/+18
To allow scanning and working with 5 MHz and 10 MHz BSS, extend the inform bss commands and add wrappers to take 5 and 10 MHz bss into account. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2013-07-16cfg80211/nl80211: Add packet coalesce supportAmitkumar Karwar1-2/+88
In most cases, host that receives IPv4 and IPv6 multicast/broadcast packets does not do anything with these packets. Therefore the reception of these unwanted packets causes unnecessary processing and power consumption. Packet coalesce feature helps to reduce number of received interrupts to host by buffering these packets in firmware/hardware for some predefined time. Received interrupt will be generated when one of the following events occur. a) Expiration of hardware timer whose expiration time is set to maximum coalescing delay of matching coalesce rule. b) Coalescing buffer in hardware reaches it's limit. c) Packet doesn't match any of the configured coalesce rules. This patch adds set/get configuration support for packet coalesce. User needs to configure following parameters for creating a coalesce rule. a) Maximum coalescing delay b) List of packet patterns which needs to be matched c) Condition for coalescence. pattern 'match' or 'no match' Multiple such rules can be created. This feature needs to be advertised during driver initialization. Drivers are supposed to do required firmware/hardware settings based on user configuration. Signed-off-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Bing Zhao <bzhao@marvell.com> [fix kernel-doc, change free function, fix copy/paste error] Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2013-07-16cfg80211/nl80211: rename packet pattern related structures and enumsAmitkumar Karwar1-18/+27
Currently packet patterns and it's enum/structures are used only for WoWLAN feature. As we intend to reuse them for new feature packet coalesce, they are renamed in this patch. Older names are kept for backward compatibility purpose. Signed-off-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Bing Zhao <bzhao@marvell.com> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2013-07-14Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+6
Pull more vfs stuff from Al Viro: "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including making simple_lookup() usable for filesystems with non-NULL s_d_op, which allows us to get rid of quite a bit of ugliness" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sunrpc: now we can just set ->s_d_op cgroup: we can use simple_lookup() now efivarfs: we can use simple_lookup() now make simple_lookup() usable for filesystems that set ->s_d_op configfs: don't open-code d_alloc_name() __rpc_lookup_create_exclusive: pass string instead of qstr rpc_create_*_dir: don't bother with qstr llist: llist_add() can use llist_add_batch() llist: fix/simplify llist_add() and llist_add_batch() fput: turn "list_head delayed_fput_list" into llist_head fs/file_table.c:fput(): add comment Safer ABI for O_TMPFILE