aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2009-12-03tcp: diag: Dont report negative values for rx queueEric Dumazet2-3/+11
Both netlink and /proc/net/tcp interfaces can report transient negative values for rx queue. ss -> State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB -6 6 127.0.0.1:45956 127.0.0.1:3333 netstat -> tcp 4294967290 6 127.0.0.1:37784 127.0.0.1:3333 ESTABLISHED This is because we dont lock socket while computing tp->rcv_nxt - tp->copied_seq, and another CPU can update copied_seq before rcv_next in RX path. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6David S. Miller15-107/+102
2009-12-03net: Batch inet_twsk_purgeEric W. Biederman2-8/+13
This function walks the whole hashtable so there is no point in passing it a network namespace. Instead I purge all timewait sockets from dead network namespaces that I find. If the namespace is one of the once I am trying to purge I am guaranteed no new timewait sockets can be formed so this will get them all. If the namespace is one I am not acting for it might form a few more but I will call inet_twsk_purge again and shortly to get rid of them. In any even if the network namespace is dead timewait sockets are useless. Move the calls of inet_twsk_purge into batch_exit routines so that if I am killing a bunch of namespaces at once I will just call inet_twsk_purge once and save a lot of redundant unnecessary work. My simple 4k network namespace exit test the cleanup time dropped from roughly 8.2s to 1.6s. While the time spent running inet_twsk_purge fell to about 2ms. 1ms for ipv4 and 1ms for ipv6. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03net: Use rcu lookups in inet_twsk_purge.Eric W. Biederman1-15/+24
While we are looking up entries to free there is no reason to take the lock in inet_twsk_purge. We have to drop locks and restart occassionally anyway so adding a few more in case we get on the wrong list because of a timewait move is no big deal. At the same time not taking the lock for long periods of time is much more polite to the rest of the users of the hash table. In my test configuration of killing 4k network namespaces this change causes 4k back to back runs of inet_twsk_purge on an empty hash table to go from roughly 20.7s to 3.3s, and the total time to destroy 4k network namespaces goes from roughly 44s to 3.3s. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03net: Allow fib_rule_unregister to batchEric W. Biederman1-9/+3
Refactor the code so fib_rules_register always takes a template instead of the actual fib_rules_ops structure that will be used. This is required for network namespace support so 2 out of the 3 callers already do this, it allows the error handling to be made common, and it allows fib_rules_unregister to free the template for hte caller. Modify fib_rules_unregister to use call_rcu instead of syncrhonize_rcu to allw multiple namespaces to be cleaned up in the same rcu grace period. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03ipv4 05/05: add sysctl to accept packets with local source addressesPatrick McHardy2-4/+8
commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8 Author: Patrick McHardy <kaber@trash.net> Date: Thu Dec 3 12:16:35 2009 +0100 ipv4: add sysctl to accept packets with local source addresses Change fib_validate_source() to accept packets with a local source address when the "accept_local" sysctl is set for the incoming inet device. Combined with the previous patches, this allows to communicate between multiple local interfaces over the wire. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03net 04/05: fib_rules: allow to delete local rulePatrick McHardy1-1/+1
commit d124356ce314fff22a047ea334379d5105b2d834 Author: Patrick McHardy <kaber@trash.net> Date: Thu Dec 3 12:16:35 2009 +0100 net: fib_rules: allow to delete local rule Allow to delete the local rule and recreate it with a higher priority. This can be used to force packets with a local destination out on the wire instead of routing them to loopback. Additionally this patch allows to recreate rules with a priority of 0. Combined with the previous patch to allow oif classification, a socket can be bound to the desired interface and packets routed to the wire like this: # move local rule to lower priority ip rule add pref 1000 lookup local ip rule del pref 0 # route packets of sockets bound to eth0 to the wire independant # of the destination address ip rule add pref 100 oif eth0 lookup 100 ip route add default dev eth0 table 100 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02tcp: sysctl_tcp_cookie_size needs to be exported to modules.David S. Miller1-0/+1
Otherwise: ERROR: "sysctl_tcp_cookie_size" [net/ipv6/ipv6.ko] undefined! make[1]: *** [__modpost] Error 1 Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02tcp: Fix warning on 64-bit.David S. Miller1-1/+1
net/ipv4/tcp_output.c: In function ‘tcp_make_synack’: net/ipv4/tcp_output.c:2488: warning: cast from pointer to integer of different size Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1g: Responder Cookie => InitiatorWilliam Allen Simpson5-39/+205
Parse incoming TCP_COOKIE option(s). Calculate <SYN,ACK> TCP_COOKIE option. Send optional <SYN,ACK> data. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1f: Initiator Cookie => Responder Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1f: Initiator Cookie => ResponderWilliam Allen Simpson1-30/+163
Calculate and format <SYN> TCP_COOKIE option. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONSWilliam Allen Simpson1-2/+131
Provide per socket control of the TCP cookie option and SYN/SYNACK data. This is a straightforward re-implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 The principle difference is using a TCP option to carry the cookie nonce, instead of a user configured offset in the data. Allocations have been rearranged to avoid requiring GFP_ATOMIC. Requires: net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1d: define TCP cookie option, extend existing struct'sWilliam Allen Simpson2-8/+58
Data structures are carefully composed to require minimal additions. For example, the struct tcp_options_received cookie_plus variable fits between existing 16-bit and 8-bit variables, requiring no additional space (taking alignment into consideration). There are no additions to tcp_request_sock, and only 1 pointer in tcp_sock. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 The principle difference is using a TCP option to carry the cookie nonce, instead of a user configured offset in the data. This is more flexible and less subject to user configuration error. Such a cookie option has been suggested for many years, and is also useful without SYN data, allowing several related concepts to use the same extension option. "Re: SYN floods (was: does history repeat itself?)", September 9, 1996. http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html "Re: what a new TCP header might look like", May 12, 1998. ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail These functions will also be used in subsequent patches that implement additional features. Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONSWilliam Allen Simpson2-0/+11
Define sysctl (tcp_cookie_size) to turn on and off the cookie option default globally, instead of a compiled configuration option. Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant data values, retrieving variable cookie values, and other facilities. Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h, near its corresponding struct tcp_options_received (prior to changes). This is a straightforward re-implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 These functions will also be used in subsequent patches that implement additional features. Requires: net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1b: generate Responder Cookie secretWilliam Allen Simpson1-0/+140
Define (missing) hash message size for SHA1. Define hashing size constants specific to TCP cookies. Add new function: tcp_cookie_generator(). Maintain global secret values for tcp_cookie_generator(). This is a significantly revised implementation of earlier (15-year-old) Photuris [RFC-2522] code for the KA9Q cooperative multitasking platform. Linux RCU technique appears to be well-suited to this application, though neither of the circular queue items are freed. These functions will also be used in subsequent patches that implement additional features. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-02TCPCT part 1a: add request_values parameter for sending SYNACKWilliam Allen Simpson4-11/+14
Add optional function parameters associated with sending SYNACK. These parameters are not needed after sending SYNACK, and are not used for retransmission. Avoids extending struct tcp_request_sock, and avoids allocating kernel memory. Also affects DCCP as it uses common struct request_sock_ops, but this parameter is currently reserved for future use. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller3-2/+3
Conflicts: net/mac80211/ht.c
2009-12-01net: Simplify ipip pernet operations.Eric W. Biederman1-18/+6
Take advantage of the new pernet automatic storage management, and stop using compatibility network namespace functions. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01net: Simplify ip_gre pernet operations.Eric W. Biederman1-18/+6
Take advantage of the new pernet automatic storage management, and stop using compatibility network namespace functions. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCHEric W. Biederman2-1/+9
The motivation for an additional notifier in batched netdevice notification (rt_do_flush) only needs to be called once per batch not once per namespace. For further batching improvements I need a guarantee that the netdevices are unregistered in order allowing me to unregister an all of the network devices in a network namespace at the same time with the guarantee that the loopback device is really and truly unregistered last. Additionally it appears that we moved the route cache flush after the final synchronize_net, which seems wrong and there was no explanation. So I have restored the original location of the final synchronize_net. Cc: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-01ip_fragment: also adjust skb->truesize for packets not owned by a socketPatrick McHardy1-1/+1
When a large packet gets reassembled by ip_defrag(), the head skb accounts for all the fragments in skb->truesize. If this packet is refragmented again, skb->truesize is not re-adjusted to reflect only the head size since its not owned by a socket. If the head fragment then gets recycled and reused for another received fragment, it might exceed the defragmentation limits due to its large truesize value. skb_recycle_check() explicitly checks for linear skbs, so any recycled skb should reflect its true size in skb->truesize. Change ip_fragment() to also adjust the truesize value of skbs not owned by a socket. Reported-and-tested-by: Ben Menchaca <ben@bigfootnetworks.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-30tcp: tcp_disconnect() should clear window_clampEric Dumazet1-0/+1
NFS can reuse its TCP socket after calling tcp_disconnect(). We noticed window scaling was not negotiated in SYN packet of next connection request. Fix is to clear tp->window_clamp in tcp_disconnect(). Reported-by: Krzysztof Oledzki <ole@ans.pl> Tested-by: Krzysztof Oledzki <ole@ans.pl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPSDavid Ford1-1/+1
ipv4 ip_frag_reasm(), fully replace 'dev_net(dev)' with 'net', defined previously patched into 2.6.29. Between 2.6.28.10 and 2.6.29, net/ipv4/ip_fragment.c was patched, changing from dev_net(dev) to container_of(...). Unfortunately the goto section (out_fail) on oversized packets inside ip_frag_reasm() didn't get touched up as well. Oversized IP packets cause a NULL pointer dereference and immediate hang. I discovered this running openvasd and my previous email on this is titled: NULL pointer dereference at 2.6.32-rc8:net/ipv4/ip_fragment.c:566 Signed-off-by: David Ford <david@blue-labs.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29net: Move && and || to end of previous lineJoe Perches2-6/+7
Not including net/atm/ Compiled tested x86 allyesconfig only Added a > 80 column line or two, which I ignored. Existing checkpatch plaints willfully, cheerfully ignored. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25xfrm: Use the user specified truncation length in ESP and AHMartin Willi2-2/+2
Instead of using the hardcoded truncation for authentication algorithms, use the truncation length specified on xfrm_state. Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25net: convert /proc/net/rt_acct to seq_fileAlexey Dobriyan1-35/+33
Rewrite statistics accumulation to be in terms of structure fields, not raw u32 additions. Keep them in same order, though. This is the last user of create_proc_read_entry() in net/, please NAK all new ones as well as all new ->write_proc, ->read_proc and create_proc_entry() users. Cc me if there are problems. :-) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-25net: use net_eq to compare netsOctavian Purdila9-14/+15
Generated with the following semantic patch @@ struct net *n1; struct net *n2; @@ - n1 == n2 + net_eq(n1, n2) @@ struct net *n1; struct net *n2; @@ - n1 != n2 + !net_eq(n1, n2) applied over {include,net,drivers/net}. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-23netfilter: net/ipv[46]/netfilter: Move && and || to end of previous lineJoe Perches13-91/+91
Compile tested only. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-11-23net/ipv4: Move && and || to end of previous lineJoe Perches12-72/+73
On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote: > It should be of the form: > if (x && > y) > > or: > if (x && y) > > Fix patches, rather than complaints, for existing cases where things > do not follow this pattern are certainly welcome. Also collapsed some multiple tabs to single space. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-21tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTLDavid S. Miller1-7/+1
That's extremely non-intuitive, noticed by William Allen Simpson. And let's make the default be on, it's been suggested by a lot of people so we'll give it a try. Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18netns: net_identifiers should be read_mostlyEric Dumazet2-2/+2
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-18ipv4: factorize cache clearing for batched unregister operationsOctavian Purdila1-5/+6
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-17Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller2-8/+15
Conflicts: drivers/net/can/Kconfig
2009-11-13inetpeer: Optimize inet_getid()Eric Dumazet3-13/+10
While investigating for network latencies, I found inet_getid() was a contention point for some workloads, as inet_peer_idlock is shared by all inet_getid() users regardless of peers. One way to fix this is to make ip_id_count an atomic_t instead of __u16, and use atomic_add_return(). In order to keep sizeof(struct inet_peer) = 64 on 64bit arches tcp_ts_stamp is also converted to __u32 instead of "unsigned long". Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13ipv4: speedup inet_dump_ifaddr()Eric Dumazet1-23/+38
Stephen Hemminger a écrit : > On Thu, 12 Nov 2009 15:11:36 +0100 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> When handling large number of netdevices, inet_dump_ifaddr() >> is very slow because it has O(N^2) complexity. >> >> Instead of scanning one single list, we can use the NETDEV_HASHENTRIES >> sub lists of the dev_index hash table, and RCU lookups. >> >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > > You might be able to make RCU critical section smaller by moving > it into loop. > Indeed. But we dump at most one skb (<= 8192 bytes ?), so rcu_read_lock holding time is small, unless we meet many netdevices without addresses. I wonder if its really common... Thanks [PATCH net-next-2.6] ipv4: speedup inet_dump_ifaddr() When handling large number of netdevices, inet_dump_ifaddr() is very slow because it has O(N2) complexity. Instead of scanning one single list, we can use the NETDEV_HASHENTRIES sub lists of the dev_index hash table, and RCU lookups. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13igmp: Use next_net_device_rcu()Eric Dumazet1-16/+11
We need to use next_det_device_rcu() in RCU protected section. We also can avoid in_dev_get()/in_dev_put() overhead (code size mainly) in rcu_read_lock() sections. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13net: TCP_MSS_DEFAULT, TCP_MSS_DESIREDWilliam Allen Simpson3-6/+6
Define two symbols needed in both kernel and user space. Remove old (somewhat incorrect) kernel variant that wasn't used in most cases. Default should apply to both RMSS and SMSS (RFC2581). Replace numeric constants with defined symbols. Stand-alone patch, originally developed for TCPCT. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13ipmr: missing dev_put() on error path in vif_add()Dan Carpenter1-1/+3
The other error paths in front of this one have a dev_put() but this one got missed. Found by smatch static checker. Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Wang Chen <ellre923@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13tcp: provide more information on the tcp receive_queue bugsIlpo Järvinen1-7/+12
The addition of rcv_nxt allows to discern whether the skb was out of place or tp->copied. Also catch fancy combination of flags if necessary (sadly we might miss the actual causer flags as it might have already returned). Btw, we perhaps would want to forward copied_seq in somewhere or otherwise we might have some nice loop with WARN stuff within but where to do that safely I don't know at this stage until more is known (but it is not made significantly worse by this patch). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-10IPV4: use rcu to walk list of devices in IGMPstephen hemminger1-8/+10
This also needs to be optimized for large number of devices. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-10udp: bind() optimisationEric Dumazet1-8/+65
UDP bind() can be O(N^2) in some pathological cases. Thanks to secondary hash tables, we can make it O(N) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller1-15/+17
Conflicts: drivers/net/can/usb/ems_usb.c
2009-11-08udp: multicast RX should increment SNMP/sk_drops counter in allocation failuresEric Dumazet1-1/+11
When skb_clone() fails, we should increment sk_drops and SNMP counters. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08ipv4: udp: Optimise multicast receptionEric Dumazet1-26/+50
UDP multicast rx path is a bit complex and can hold a spinlock for a long time. Using a small (32 or 64 entries) stack of socket pointers can help to perform expensive operations (skb_clone(), udp_queue_rcv_skb()) outside of the lock, in most cases. It's also a base for a future RCU conversion of multicast recption. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08ipv4: udp: optimize unicast RX pathEric Dumazet1-3/+112
We first locate the (local port) hash chain head If few sockets are in this chain, we proceed with previous lookup algo. If too many sockets are listed, we take a look at the secondary (port, address) hash chain we added in previous patch. We choose the shortest chain and proceed with a RCU lookup on the elected chain. But, if we chose (port, address) chain, and fail to find a socket on given address, we must try another lookup on (port, INADDR_ANY) chain to find socket not bound to a particular IP. -> No extra cost for typical setups, where the first lookup will probabbly be performed. RCU lookups everywhere, we dont acquire spinlock. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08udp: secondary hash on (local port, local address)Eric Dumazet1-5/+26
Extends udp_table to contain a secondary hash table. socket anchor for this second hash is free, because UDP doesnt use skc_bind_node : We define an union to hold both skc_bind_node & a new hlist_nulls_node udp_portaddr_node udp_lib_get_port() inserts sockets into second hash chain (additional cost of one atomic op) udp_lib_unhash() deletes socket from second hash chain (additional cost of one atomic op) Note : No spinlock lockdep annotation is needed, because lock for the secondary hash chain is always get after lock for primary hash chain. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08udp: split sk_hash into two u16 hashesEric Dumazet1-6/+19
Union sk_hash with two u16 hashes for udp (no extra memory taken) One 16 bits hash on (local port) value (the previous udp 'hash') One 16 bits hash on (local address, local port) values, initialized but not yet used. This second hash is using jenkin hash for better distribution. Because the 'port' is xored later, a partial hash is performed on local address + net_hash_mix(net) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08udp: add a counter into udp_hslotEric Dumazet1-0/+3
Adds a counter in udp_hslot to keep an accurate count of sockets present in chain. This will permit to upcoming UDP lookup algo to chose the shortest chain when secondary hash is added. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-08net: Support specifying the network namespace upon device creation.Eric W. Biederman1-1/+1
There is no good reason to not support userspace specifying the network namespace during device creation, and it makes it easier to create a network device and pass it to a child network namespace with a well known name. We have to be careful to ensure that the target network namespace for the new device exists through the life of the call. To keep that logic clear I have factored out the network namespace grabbing logic into rtnl_link_get_net. In addtion we need to continue to pass the source network namespace to the rtnl_link_ops.newlink method so that we can find the base device source network namespace. Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2009-11-06ipip: Fix handling of DF packets when pmtudisc is OFFHerbert Xu1-15/+17
RFC 2003 requires the outer header to have DF set if DF is set on the inner header, even when PMTU discovery is off for the tunnel. Our implementation does exactly that. For this to work properly the IPIP gateway also needs to engate in PMTU when the inner DF bit is set. As otherwise the original host would not be able to carry out its PMTU successfully since part of the path is only visible to the gateway. Unfortunately when the tunnel PMTU discovery setting is off, we do not collect the necessary soft state, resulting in blackholes when the original host tries to perform PMTU discovery. This problem is not reproducible on the IPIP gateway itself as the inner packet usually has skb->local_df set. This is not correctly cleared (an unrelated bug) when the packet passes through the tunnel, which allows fragmentation to occur. For hosts behind the IPIP gateway it is readily visible with a simple ping. This patch fixes the problem by performing PMTU discovery for all packets with the inner DF bit set, regardless of the PMTU discovery setting on the tunnel itself. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>