aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv6 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2012-06-25tcp: heed result of security_inet_conn_request() in tcp_v6_conn_request()Neal Cardwell1-1/+2
If security_inet_conn_request() returns non-zero then TCP/IPv6 should drop the request, just as in TCP/IPv4 and DCCP in both IPv4 and IPv6. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-25ipv6: fib: fix fib dump restartEric Dumazet1-2/+2
Commit 2bec5a369ee79576a3 (ipv6: fib: fix crash when changing large fib while dumping it) introduced ability to restart the dump at tree root, but failed to skip correctly a count of already dumped entries. Code didn't match Patrick intent. We must skip exactly the number of already dumped entries. Note that like other /proc/net files or netlink producers, we could still dump some duplicates entries. Reported-by: Debabrata Banerjee <dbavatar@gmail.com> Reported-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-18ipv6: Move ipv6 proc file registration to end of init orderThomas Graf1-10/+31
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc handler is installed in ip6_route_net_init() whereas fib_table_hash is allocated in fib6_net_init() _after_ the proc handler has been installed. This opens up a short time frame to access fib_table_hash with its pants down. Move the registration of the proc files to a later point in the init order to avoid the race. Tested :-) Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-16Revert "ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route"David S. Miller2-22/+12
This reverts commit 2a0c451ade8e1783c5d453948289e4a978d417c9. It causes crashes, because now ip6_null_entry is used before it is initialized. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_routeThomas Graf2-12/+22
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc handler is installed in ip6_route_net_init() whereas fib_table_hash is allocated in fib6_net_init() _after_ the proc handler has been installed. This opens up a short time frame to access fib_table_hash with its pants down. fib6_init() as a whole can't be moved to an earlier position as it also registers the rtnetlink message handlers which should be registered at the end. Therefore split it into fib6_init() which is run early and fib6_init_late() to register the rtnetlink message handlers. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-07snmp: fix OutOctets counter to include forwarded datagramsVincent Bernat2-0/+3
RFC 4293 defines ipIfStatsOutOctets (similar definition for ipSystemStatsOutOctets): The total number of octets in IP datagrams delivered to the lower layers for transmission. Octets from datagrams counted in ipIfStatsOutTransmits MUST be counted here. And ipIfStatsOutTransmits: The total number of IP datagrams that this entity supplied to the lower layers for transmission. This includes datagrams generated locally and those forwarded by this entity. Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing IPSTATS_MIB_OUTFORWDATAGRAMS. IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not include forwarded datagrams: The total number of IP datagrams that local IP user-protocols (including ICMP) supplied to IP in requests for transmission. Note that this counter does not include any datagrams counted in ipIfStatsOutForwDatagrams. Signed-off-by: Vincent Bernat <bernat@luffy.cx> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-07ipv6: fib: Restore NTF_ROUTER exception in fib6_age()Thomas Graf1-1/+1
Commit 5339ab8b1dd82 (ipv6: fib: Convert fib6_age() to dst_neigh_lookup().) seems to have mistakenly inverted the exception for cached NTF_ROUTER routes. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-01tcp: reflect SYN queue_mapping into SYNACK packetsEric Dumazet1-3/+6
While testing how linux behaves on SYNFLOOD attack on multiqueue device (ixgbe), I found that SYNACK messages were dropped at Qdisc level because we send them all on a single queue. Obvious choice is to reflect incoming SYN packet @queue_mapping to SYNACK packet. Under stress, my machine could only send 25.000 SYNACK per second (for 200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues. After patch, not a single SYNACK is dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Schillstrom <hans.schillstrom@ericsson.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-27ipv6: fix incorrect ipsec fragmentGao feng1-18/+50
Since commit ad0081e43a "ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed" the fragment of packets is incorrect. because tunnel mode needs IPsec headers and trailer for all fragments, while on transport mode it is sufficient to add the headers to the first fragment and the trailer to the last. so modify mtu and maxfraglen base on ipsec mode and if fragment is first or last. with my test,it work well(every fragment's size is the mtu) and does not trigger slow fragment path. Changes from v1: though optimization, mtu_prev and maxfraglen_prev can be delete. replace xfrm mode codes with dst_entry's new frag DST_XFRM_TUNNEL. add fuction ip6_append_data_mtu to make codes clearer. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-27xfrm: take net hdr len into account for esp payload size calculationBenjamin Poirier1-11/+7
Corrects the function that determines the esp payload size. The calculations done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for certain mtu values and suboptimal frames for others. According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(), net_header_len must be taken into account before doing the alignment calculation. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-20ipv6/exthdrs: strict Pad1 and PadN checkEldad Zack1-1/+14
The following tightens the padding check from commit c1412fce7eccae62b4de22494f6ab3ff8a90c0c6 : * Take into account combinations of consecutive Pad1 and PadN. * Catch the corner case of when only padding is present in the header, when the extention header length is 0 (i.e., 8 bytes). In this case, the header would have exactly 6 bytes of padding: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : Next Header : Hdr Ext Len=0 : : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : Padding (Pad1 or PadN) : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19ipv6: use skb coalescing in reassemblyEric Dumazet1-6/+20
ip6_frag_reasm() can use skb_try_coalesce() to build optimized skb, reducing memory used by them (truesize), and reducing number of cache line misses and overhead for the consumer. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19net:ipv6:fixed space issues relating to operators.Jeffrin Jose1-2/+2
Fixed space issues relating to operators found by checkpatch.pl tool in net/ipv6/udp.c Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19net:ipv6:fixed a trailing white space issue.Jeffrin Jose1-1/+1
Fixed a trailing white space issue found by checkpatch.pl tool in net/ipv6/udp.c Signed-off-by: Jeffrin Jose <ahiliation@yahoo.co.in> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19ipv6: disable GSO on sockets hitting dst_allfragEric Dumazet1-1/+4
If the allfrag feature has been set on a host route (due to an ICMPv6 Packet Too Big received indicating a MTU of less than 1280), we hit a very slow behavior in TCP stack, because all big packets are dropped and only a retransmit timer is able to push one MSS frame every 200 ms. One way to handle this is to disable GSO on the socket the first time a super packet is dropped. Adding a specific dst_allfrag() in the fast path is probably overkill since the dst_allfrag() case almost never happen. Result on netperf TCP_STREAM, one flow : Before : 60 kbit/sec After : 1.6 Gbit/sec Reported-by: Tore Anderson <tore@fud.no> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19ipv6: bool/const conversions phase2Eric Dumazet13-118/+119
Mostly bool conversions, some inline removals and const additions. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-18ipv6: ip6_fragment() should check CHECKSUM_PARTIALEric Dumazet1-0/+4
Quoting Tore Anderson from : If the allfrag feature has been set on a host route (due to an ICMPv6 Packet Too Big received indicating a MTU of less than 1280), TCP SYN/ACK packets to that destination appears to get an incorrect TCP checksum. This in turn means they are thrown away as invalid. In the case of an IPv4 client behind a link with a MTU of less than 1260, accessing an IPv6 server through a stateless translator, this means that the client can only download a single large file from the server, because once it is in the server's routing cache with the allfrag feature set, new TCP connections can no longer be established. </endquote> It appears ip6_fragment() doesn't handle CHECKSUM_PARTIAL properly. As network drivers are not prepared to fetch correct transport header, a safe fix is to call skb_checksum_help() before fragmenting packet. Reported-by: Tore Anderson <tore@fud.no> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-18ipv6: remove csummode in ip6_append_data()Eric Dumazet1-3/+1
csummode variable is always CHECKSUM_NONE in ip6_append_data() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-18ipv6: bool conversions phase1Eric Dumazet2-6/+6
ipv6_opt_accepted() returns a bool, and can use const pointers ipv6_addr_equal(), ipv6_addr_any(), ipv6_addr_loopback(), ipv6_addr_orchid() return a bool. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-18ip_frag: struct inet_frags match() method returns a boolEric Dumazet1-4/+5
- match() method returns a boolean - return (A && B && C && D) -> return A && B && C && D - fix indentation Signed-off-by: Eric Dumazet <edumazet@google.com>
2012-05-17ipv6: correct the ipv6 option name - Pad0 to Pad1Eldad Zack3-5/+5
The padding destination or hop-by-hop option is called Pad1 and not Pad0. See RFC2460 (4.2) or the IANA ipv6-parameters registry: http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-17tcp: bool conversionsEric Dumazet1-2/+2
bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-17net: ipv6: ndisc: Neaten ND_PRINTx macrosJoe Perches1-125/+91
Why use several macros when one will do? Convert the multiple ND_PRINTKx macros to a single ND_PRINTK macro. Use the new net_<level>_ratelimited mechanism too. Add pr_fmt with "ICMPv6: " as prefix. Remove embedded ICMPv6 prefixes from messages. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16Merge branch 'delete-tokenring' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linuxDavid S. Miller2-19/+0
2012-05-16net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debugJoe Perches5-16/+16
Use the current debugging style and enable dynamic_debug. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16net: ipv6: Standardize prefixes for message loggingJoe Perches16-120/+116
Add #define pr_fmt(fmt) as appropriate. Add "IPv6: " to appropriate files. Convert printk(KERN_<LEVEL> to pr_<level> (but not KERN_DEBUG). Standardize on "%s: " not "%s(): " when emitting __func__. Use "%s: ", __func__ instead of embedding function name. Coalesce formats, align arguments. ADDRCONF output is now prefixed with "IPv6: " Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15net: delete all instances of special processing for token ringPaul Gortmaker2-19/+0
We are going to delete the Token ring support. This removes any special processing in the core networking for token ring, (aside from net/tr.c itself), leaving the drivers and remaining tokenring support present but inert. The mass removal of the drivers and net/tr.c will be in a separate commit, so that the history of these files that we still care about won't have the giant deletion tied into their history. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-05-15net: Convert net_ratelimit uses to net_<level>_ratelimitedJoe Perches14-75/+41
Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15xfrm: make xfrm_algo.c a moduleJan Beulich1-2/+2
By making this a standalone config option (auto-selected as needed), selecting CRYPTO from here rather than from XFRM (which is boolean) allows the core crypto code to become a module again even when XFRM=y. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-11net/ipv6/af_inet6.c: checkpatch cleanupEldad Zack1-18/+11
af_inet6.c:80: ERROR: do not initialise statics to 0 or NULL af_inet6.c:259: ERROR: spaces required around that '=' (ctx:VxV) af_inet6.c:394: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:412: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:422: ERROR: do not use assignment in if condition af_inet6.c:425: ERROR: do not use assignment in if condition af_inet6.c:433: ERROR: do not use assignment in if condition af_inet6.c:437: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:446: ERROR: spaces required around that '=' (ctx:VxV) af_inet6.c:478: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:485: ERROR: that open brace { should be on the previous line af_inet6.c:485: ERROR: space required before the open parenthesis '(' af_inet6.c:513: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:629: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:647: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:687: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:709: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable af_inet6.c:1073: ERROR: space required before the open parenthesis '(' Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-106lowpan: IPv6 link local addressalex.bluesman.smirnov@gmail.com1-1/+13
According to the RFC4944 (Transmission of IPv6 Packets over IEEE 802.15.4 Networks), chapter 7: The IPv6 link-local address [RFC4291] for an IEEE 802.15.4 interface is formed by appending the Interface Identifier, as defined above, to the prefix FE80::/64. 10 bits 54 bits 64 bits +----------+-----------------------+----------------------------+ |1111111010| (zeros) | Interface Identifier | +----------+-----------------------+----------------------------+ This patch adds IPv6 address generation support for the 6lowpan interfaces. Signed-off-by: Alexander Smirnov <alex.bluesman.smirnov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-09netfilter: ip6_tables: add flags parameter to ipv6_find_hdr()Hans Schillstrom5-13/+39
This patch adds the flags parameter to ipv6_find_hdr. This flags allows us to: * know if this is a fragment. * stop at the AH header, so the information contained in that header can be used for some specific packet handling. This patch also adds the offset parameter for inspection of one inner IPv6 header that is contained in error messages. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: remove ip_queue supportPablo Neira Ayuso3-664/+0
This patch removes ip_queue support which was marked as obsolete years ago. The nfnetlink_queue modules provides more advanced user-space packet queueing mechanism. This patch also removes capability code included in SELinux that refers to ip_queue. Otherwise, we break compilation. Several warning has been sent regarding this to the mailing list in the past month without anyone rising the hand to stop this with some strong argument. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-04tcp: be more strict before accepting ECN negociationEric Dumazet1-1/+1
It appears some networks play bad games with the two bits reserved for ECN. This can trigger false congestion notifications and very slow transferts. Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can disable TCP ECN negociation if it happens we receive mangled CT bits in the SYN packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Perry Lorier <perryl@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Wilmer van der Gaast <wilmer@google.com> Cc: Ankur Jain <jankur@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Dave Täht <dave.taht@bufferbloat.net> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01ipv6: Export ipv6 functions for use by other protocolsChris Elston4-0/+9
For implementing other protocols on top of IPv6, such as L2TPv3's IP encapsulation over ipv6, we'd like to call some IPv6 functions which are not currently exported. This patch exports them. Signed-off-by: Chris Elston <celston@katalix.com> Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-28net/ipv6/udp: UDP encapsulation: introduce encap_rcv hook into IPv6Benjamin LaHaise1-0/+39
Now that the sematics of udpv6_queue_rcv_skb() match IPv4's udp_queue_rcv_skb(), introduce the UDP encap_rcv() hook for IPv6. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-28net/ipv6/udp: UDP encapsulation: move socket locking into udpv6_queue_rcv_skb()Benjamin LaHaise1-53/+44
In order to make sure that when the encap_rcv() hook is introduced it is not called with the socket lock held, move socket locking from callers into udpv6_queue_rcv_skb(), matching what happens in IPv4. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-28net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skbBenjamin LaHaise1-15/+27
This is the first step in reworking the IPv6 UDP code to be structured more like the IPv4 UDP code. This patch creates __udpv6_queue_rcv_skb() with the equivalent sematics to __udp_queue_rcv_skb(), and wires it up to the backlog_rcv method. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-27ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizingEric Dumazet1-0/+1
Quoting Tore Anderson from : https://bugzilla.kernel.org/show_bug.cgi?id=42572 When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment size does not take into account the size of the IPv6 Fragmentation header that needs to be included in outbound packets, causing every transmitted TCP segment to be fragmented across two IPv6 packets, the latter of which will only contain 8 bytes of actual payload. RTAX_FEATURE_ALLFRAG is typically set on a route in response to receving a ICMPv6 Packet Too Big message indicating a Path MTU of less than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6 PTBs with MTU < 1280 are still valid, in particular when an IPv6 packet is sent to an IPv4 destination through a stateless translator. Any ICMPv4 Need To Fragment packets originated from the IPv4 part of the path will be translated to ICMPv6 PTB which may then indicate an MTU of less than 1280. The Linux kernel refuses to reduce the effective MTU to anything below 1280 bytes, instead it sets it to exactly 1280 bytes, and RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header), instead of 1232 (additionally taking into account the 8 bytes required by the IPv6 Fragmentation extension header). This in turn results in rather inefficient transmission, as every transmitted TCP segment now is split in two fragments containing 1232+8 bytes of payload. After this patch, all the outgoing packets that includes a Fragmentation header all are "atomic" or "non-fragmented" fragments, i.e., they both have Offset=0 and More Fragments=0. With help from David S. Miller Reported-by: Tore Anderson <tore@fud.no> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Tested-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-25ipv6: call consume_skb() in frag/reassemblyEric Dumazet2-3/+3
Some kfree_skb() calls should be replaced by consume_skb() to avoid drop_monitor/dropwatch false positives. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+4
Fix merge between commit 3adadc08cc1e ("net ax25: Reorder ax25_exit to remove races") and commit 0ca7a4c87d27 ("net ax25: Simplify and cleanup the ax25 sysctl handling") The former moved around the sysctl register/unregister calls, the later simply removed them. With help from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23tcp: sk_add_backlog() is too agressive for TCPEric Dumazet1-1/+2
While investigating TCP performance problems on 10Gb+ links, we found a tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends one ACK every two MSS segments. A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default value (87380), allowing a too small backlog. A TCP ACK, even being small, can consume nearly same truesize space than outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and is fast to compute. Performance results on netperf, single flow, receiver with disabled GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop increments at sender. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23net: add a limit parameter to sk_add_backlog()Eric Dumazet2-5/+5
sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the memory limit. We need to make this limit a parameter for TCP use. No functional change expected in this patch, all callers still using the old sk_rcvbuf limit. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23tcp: Fix build warning after tcp_{v4,v6}_init_sock consolidation.David S. Miller1-2/+1
net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock': net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable] net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock': net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable] Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-22tcp: fix TCP_MAXSEG for established IPv6 passive socketsNeal Cardwell1-0/+4
Commit f5fff5d forgot to fix TCP_MAXSEG behavior IPv6 sockets, so IPv6 TCP server sockets that used TCP_MAXSEG would find that the advmss of child sockets would be incorrect. This commit mirrors the advmss logic from tcp_v4_syn_recv_sock in tcp_v6_syn_recv_sock. Eventually this logic should probably be shared between IPv4 and IPv6, but this at least fixes this issue. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock()Neal Cardwell1-49/+1
This commit moves the (substantial) common code shared between tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family independent function, tcp_init_sock(). Centralizing this functionality should help avoid drift issues, e.g. where the IPv4 side is updated without a corresponding update to IPv6. There was already some drift: IPv4 initialized snd_cwnd to TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to 2 (in this case it should not matter, since snd_cwnd is also initialized in tcp_init_metrics(), but the general risks and maintenance overhead remain). When diffing the old and new code, note that new tcp_init_sock() function uses the order of steps from the tcp_v4_init_sock() implementation (the order is slightly different in tcp_v6_init_sock()). Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21sock: Introduce named constants for sk_reusePavel Emelyanov1-1/+1
Name them in a "backward compatible" manner, i.e. reuse or not are still 1 and 0 respectively. The reuse value of 2 means that the socket with it will forcibly reuse everyone else's port. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20net: Delete all remaining instances of ctl_pathEric W. Biederman1-7/+0
We don't use struct ctl_path anymore so delete the exported constants. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20net: Convert all sysctl registrations to register_net_sysctlEric W. Biederman4-6/+6
This results in code with less boiler plate that is a bit easier to read. Additionally stops us from using compatibility code in the sysctl core, hastening the day when the compatibility code can be removed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-20net ipv6: Convert addrconf to use register_net_sysctlEric W. Biederman1-28/+4
Using an ascii path to register_net_sysctl as opposed to the slightly awkward ctl_path allows for much simpler code. We no longer need to malloc dev_name to keep it alive the length of our sysctl register instead we can use a small temporary buffer on the stack. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>