aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2010-06-28tcp: tso_fragment() might avoid GFP_ATOMICEric Dumazet1-3/+3
We can pass a gfp argument to tso_fragment() and avoid GFP_ATOMIC allocations sometimes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28net: use this_cpu_ptr()Eric Dumazet2-2/+2
use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id()) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-26syncookies: add support for ECNFlorian Westphal2-9/+12
Allows use of ECN when syncookies are in effect by encoding ecn_ok into the syn-ack tcp timestamp. While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES. With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc removes the "if (0)". Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-26syncookies: do not store rcv_wscale in tcp timestampFlorian Westphal1-21/+14
As pointed out by Fernando Gont there is no need to encode rcv_wscale into the cookie. We did not use the restored rcv_wscale anyway; it is recomputed via tcp_select_initial_window(). Thus we can save 4 bits in the ts option space by removing rcv_wscale. In case window scaling was not supported, we set the (invalid) wscale value 0xf. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25snmp: add align parameter to snmp_mib_init()Eric Dumazet1-10/+17
In preparation for 64bit snmp counters for some mibs, add an 'align' parameter to snmp_mib_init(), instead of assuming mibs only contain 'unsigned long' fields. Callers can use __alignof__(type) to provide correct alignment. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25arp: RCU change in arp_solicit()Eric Dumazet1-5/+7
Avoid two atomic ops in arp_solicit() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24tcp: do not send reset to already closed socketsKonstantin Khorenko1-0/+4
i've found that tcp_close() can be called for an already closed socket, but still sends reset in this case (tcp_send_active_reset()) which seems to be incorrect. Moreover, a packet with reset is sent with different source port as original port number has been already cleared on socket. Besides that incrementing stat counter for LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case. Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same seems to be true for the current mainstream kernel (checked on 2.6.35-rc3). Please, correct me if i missed something. How that happens: 1) the server receives a packet for socket in TCP_CLOSE_WAIT state that triggers a tcp_reset(): Call Trace: <IRQ> [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8 [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08 [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43 [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259 [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4 [<ffffffff8003843c>] ip_rcv+0x64c/0x69f [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa [<ffffffff80032eca>] process_backlog+0x90/0xec [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1 [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0 [<ffffffff8006334c>] call_softirq+0x1c/0x28 [<ffffffff80070476>] do_softirq+0x2c/0x85 [<ffffffff80070441>] do_IRQ+0x149/0x152 [<ffffffff80062665>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303 [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303 [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc [<ffffffff80066487>] thread_return+0x89/0x174 [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c [<ffffffff80062e39>] error_exit+0x0/0x84 tcp_rcv_state_process() ... // (sk_state == TCP_CLOSE_WAIT here) ... /* step 2: check RST bit */ if(th->rst) { tcp_reset(sk); goto discard; } ... --------------------------------- tcp_rcv_state_process tcp_reset tcp_done tcp_set_state(sk, TCP_CLOSE); inet_put_port __inet_put_port inet_sk(sk)->num = 0; sk->sk_shutdown = SHUTDOWN_MASK; 2) After that the process (socket owner) tries to write something to that socket and "inet_autobind" sets a _new_ (which differs from the original!) port number for the socket: Call Trace: [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f [<ffffffff80257180>] inet_csk_get_port+0x216/0x268 [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f [<ffffffff80049140>] inet_sendmsg+0x27/0x57 [<ffffffff8003a9d9>] do_sock_write+0xae/0xea [<ffffffff80226ac7>] sock_writev+0xdc/0xf6 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe [<ffffffff8001fb49>] __pollwait+0x0/0xdd [<ffffffff8008d533>] default_wake_function+0x0/0xe [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e [<ffffffff800f0b49>] do_readv_writev+0x163/0x274 [<ffffffff80066538>] thread_return+0x13a/0x174 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4 [<ffffffff800622dd>] tracesys+0xd5/0xe0 3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace): F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3 Call Trace: [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87 [<ffffffff80033300>] release_sock+0x10/0xae [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7 [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f [<ffffffff8003a9d9>] do_sock_write+0xae/0xea [<ffffffff80226ac7>] sock_writev+0xdc/0xf6 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe [<ffffffff8001fb49>] __pollwait+0x0/0xdd [<ffffffff8008d533>] default_wake_function+0x0/0xe [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e [<ffffffff800f0b49>] do_readv_writev+0x163/0x274 [<ffffffff80066538>] thread_return+0x13a/0x174 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4 [<ffffffff800622dd>] tracesys+0xd5/0xe0 tcp_sendmsg() ... /* Wait for a connection to finish. */ if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { int old_state = sk->sk_state; if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) { if (f_d && (err == -EPIPE)) { printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, " "sk_err=%d, sk_shutdown=%d\n", sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state, sk->sk_err, sk->sk_shutdown); dump_stack(); } goto out_err; } } ... 4) Then the process (socket owner) understands that it's time to close that socket and does that (and thus triggers sending reset packet): Call Trace: ... [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6 [<ffffffff80034698>] ip_output+0x351/0x384 [<ffffffff80251ae9>] dst_output+0x0/0xe [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2 [<ffffffff80095700>] vprintk+0x21/0x33 [<ffffffff800070f0>] check_poison_obj+0x2e/0x206 [<ffffffff80013587>] poison_obj+0x36/0x45 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d [<ffffffff80023481>] dbg_redzone1+0x1c/0x25 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8 [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786 [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d [<ffffffff80258ff1>] tcp_close+0x39a/0x960 [<ffffffff8026be12>] inet_release+0x69/0x80 [<ffffffff80059b31>] sock_release+0x4f/0xcf [<ffffffff80059d4c>] sock_close+0x2c/0x30 [<ffffffff800133c9>] __fput+0xac/0x197 [<ffffffff800252bc>] filp_close+0x59/0x61 [<ffffffff8001eff6>] sys_close+0x85/0xc7 [<ffffffff800622dd>] tracesys+0xd5/0xe0 So, in brief: * a received packet for socket in TCP_CLOSE_WAIT state triggers tcp_reset() which clears inet_sk(sk)->num and put socket into TCP_CLOSE state * an attempt to write to that socket forces inet_autobind() to get a new port (but the write itself fails with -EPIPE) * tcp_close() called for socket in TCP_CLOSE state sends an active reset via socket with newly allocated port This adds an additional check in tcp_close() for already closed sockets. We do not want to send anything to closed sockets. Signed-off-by: Konstantin Khorenko <khorenko@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-23Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller1-3/+6
Conflicts: net/ipv4/ip_output.c
2010-06-23net - IP_NODEFRAG option for IPv4 socketJiri Olsa3-1/+15
this patch is implementing IP_NODEFRAG option for IPv4 socket. The reason is, there's no other way to send out the packet with user customized header of the reassembly part. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-21udp: Fix bogus UFO packet generationHerbert Xu1-3/+6
It has been reported that the new UFO software fallback path fails under certain conditions with NFS. I tracked the problem down to the generation of UFO packets that are smaller than the MTU. The software fallback path simply discards these packets. This patch fixes the problem by not generating such packets on the UFO path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-16syncookies: check decoded options against sysctl settingsFlorian Westphal1-6/+19
Discard the ACK if we find options that do not match current sysctl settings. Previously it was possible to create a connection with sack, wscale, etc. enabled even if the feature was disabled via sysctl. Also remove an unneeded call to tcp_sack_reset() in cookie_check_timestamp: Both call sites (cookie_v4_check, cookie_v6_check) zero "struct tcp_options_received", hand it to tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack) and then call cookie_check_timestamp(). Even if num_sacks/dsacks were changed, the structure is allocated on the stack and after cookie_check_timestamp returns only a few selected members are copied to the inet_request_sock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-16inetpeer: restore small inet_peer structuresEric Dumazet3-6/+10
Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches. Thats a bit unfortunate, since old size was exactly 64 bytes. This can be solved, using an union between this rcu_head an four fields, that are normally used only when a refcount is taken on inet_peer. rcu_head is used only when refcnt=-1, right before structure freeing. Add a inet_peer_refcheck() function to check this assertion for a while. We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15inetpeer: do not use zero refcnt for freed entriesEric Dumazet1-2/+8
Followup of commit aa1039e73cc2 (inetpeer: RCU conversion) Unused inet_peer entries have a null refcnt. Using atomic_inc_not_zero() in rcu lookups is not going to work for them, and slow path is taken. Fix this using -1 marker instead of 0 for deleted entries. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15ipfrag : frag_kfree_skb() cleanupEric Dumazet1-6/+3
Third param (work) is unused, remove it. Remove __inline__ and inline qualifiers. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15ip_frag: Remove some atomic opsEric Dumazet1-2/+1
Instead of doing one atomic operation per frag, we can factorize them. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15inetpeer: RCU conversionEric Dumazet1-69/+95
inetpeer currently uses an AVL tree protected by an rwlock. It's possible to make most lookups use RCU 1) Add a struct rcu_head to struct inet_peer 2) add a lookup_rcu_bh() helper to perform lockless and opportunistic lookup. This is a normal function, not a macro like lookup(). 3) Add a limit to number of links followed by lookup_rcu_bh(). This is needed in case we fall in a loop. 4) add an smp_wmb() in link_to_pool() right before node insert. 5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take last reference to an inet_peer, since lockless readers could increase refcount, even while we hold peers.lock. 6) Delay struct inet_peer freeing after rcu grace period so that lookup_rcu_bh() cannot crash. 7) inet_getpeer() first attempts lockless lookup. Note this lookup can fail even if target is in AVL tree, but a concurrent writer can let tree in a non correct form. If this attemps fails, lock is taken a regular lookup is performed again. 8) convert peers.lock from rwlock to a spinlock 9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 -> 128 bytes) In a future patch, this is probably possible to revert this part, if rcu field is put in an union to share space with rid, ip_id_count, tcp_ts & tcp_ts_stamp. These fields being manipulated only with refcnt > 0. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6David S. Miller7-62/+62
2010-06-15tcp: unify tcp flag macrosChangli Gao3-35/+34
unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH, TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced with the corresponding TCPHDR_*. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/tcp.h | 24 ++++++------- net/ipv4/tcp.c | 8 ++-- net/ipv4/tcp_input.c | 2 - net/ipv4/tcp_output.c | 59 ++++++++++++++++----------------- net/netfilter/nf_conntrack_proto_tcp.c | 32 ++++++----------- net/netfilter/xt_TCPMSS.c | 4 -- 6 files changed, 58 insertions(+), 71 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15Merge branch 'master' of /repos/git/net-next-2.6Patrick McHardy28-490/+601
Conflicts: include/net/netfilter/xt_rateest.h net/bridge/br_netfilter.c net/netfilter/nf_conntrack_core.c Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-15netfilter: CLUSTERIP: RCU conversionEric Dumazet1-19/+29
- clusterip_lock becomes a spinlock - lockless lookups - kfree() deferred after RCU grace period - rcu_barrier_bh() inserted in clusterip_tg_exit() v2) - As Patrick pointed out, we use atomic_inc_not_zero() in clusterip_config_find_get(). - list_add_rcu() and list_del_rcu() variants are used. - atomic_dec_and_lock() used in clusterip_config_entry_put() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-14inetpeer: various changesEric Dumazet1-38/+56
Try to reduce cache line contentions in peer management, to reduce IP defragmentation overhead. - peer_fake_node is marked 'const' to make sure its not modified. (tested with CONFIG_DEBUG_RODATA=y) - Group variables in two structures to reduce number of dirtied cache lines. One named "peers" for avl tree root, its number of entries, and associated lock. (candidate for RCU conversion) - A second one named "unused_peers" for unused list and its lock - Add a !list_empty() test in unlink_from_unused() to avoid taking lock when entry is not unused. - Use atomic_dec_and_lock() in inet_putpeer() to avoid taking lock in some cases. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-14netfilter: kill redundant check code in which setting ip_summed valueShan Wei1-3/+1
If the returned csum value is 0, We has set ip_summed with CHECKSUM_UNNECESSARY flag in __skb_checksum_complete_head(). So this patch kills the check and changes to return to upper caller directly. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-11Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller1-1/+3
2010-06-10net-next: remove useless union keywordChangli Gao19-305/+305
remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-10ip: ip_ra_control() rcu fixEric Dumazet1-4/+15
commit 66018506e15b (ip: Router Alert RCU conversion) introduced RCU lookups to ip_call_ra_chain(). It missed proper deinit phase : When ip_ra_control() deletes an ip_ra_chain, it should make sure ip_call_ra_chain() users can not start to use socket during the rcu grace period. It should also delay the sock_put() after the grace period, or we risk a premature socket freeing and corruptions, as raw sockets are not rcu protected yet. This delay avoids using expensive atomic_inc_not_zero() in ip_call_ra_chain(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-09icmp: RCU conversion in icmp_address_reply()Eric Dumazet1-7/+5
- rcu_read_lock() already held by caller - use __in_dev_get_rcu() instead of in_dev_get() / in_dev_put() - remove goto out; Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-09netfilter: ip_queue: rwlock to spinlock conversionEric Dumazet1-32/+25
Converts queue_lock rwlock to a spinlock. (readlocked part can be changed by reads of integer values) One atomic operation instead of four per ipq_enqueue_packet() call. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-08netfilter: nf_conntrack: IPS_UNTRACKED bitEric Dumazet2-2/+2
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked twice per packet. This is bad for performance. __read_mostly annotation is also a bad choice. This patch introduces IPS_UNTRACKED bit so that we can use later a per_cpu untrack structure more easily. A new helper, nf_ct_untracked_get() returns a pointer to nf_conntrack_untracked. Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add IPS_NAT_DONE_MASK bits to untracked status. nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-07net: avoid two atomic ops in ip_rcv_options()Eric Dumazet1-4/+2
in_dev_get() -> __in_dev_get_rcu() in a rcu protected function. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07ipv4: avoid two atomic ops in ip_rt_redirect()Eric Dumazet1-7/+3
in_dev_get() -> __in_dev_get_rcu() in a rcu protected function. [ Fix build with CONFIG_IP_ROUTE_VERBOSE disabled. -DaveM ] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07igmp: avoid two atomic ops in igmp_rcv()Eric Dumazet1-6/+4
in_dev_get() -> __in_dev_get_rcu() in a rcu protected function. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07ip: Router Alert RCU conversionEric Dumazet2-17/+17
Straightforward conversion to RCU. One rwlock becomes a spinlock, and is static. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07ipmr: dont corrupt listsEric Dumazet1-1/+3
ipmr_rules_exit() and ip6mr_rules_exit() free a list of items, but forget to properly remove these items from list. List head is not changed and still points to freed memory. This can trigger a fault later when icmpv6_sk_exit() is called. Fix is to either reinit list, or use list_del() to properly remove items from list before freeing them. bugzilla report : https://bugzilla.kernel.org/show_bug.cgi?id=16120 Introduced by commit d1db275dd3f6e4 (ipv6: ip6mr: support multiple tables) and commit f0ad0860d01e (ipv4: ipmr: support multiple tables) Reported-by: Alex Zhavnerchik <alex.vizor@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07raw: avoid two atomics in xmitEric Dumazet1-3/+5
Avoid two atomic ops per raw_send_hdrinc() call Avoid two atomic ops per raw6_send_hdrinc() call Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-07tcp: Fix slowness in read /proc/net/tcpTom Herbert1-8/+84
This patch address a serious performance issue in reading the TCP sockets table (/proc/net/tcp). Reading the full table is done by a number of sequential read operations. At each read operation, a seek is done to find the last socket that was previously read. This seek operation requires that the sockets in the table need to be counted up to the current file position, and to count each of these requires taking a lock for each non-empty bucket. The whole algorithm is O(n^2). The fix is to cache the last bucket value, offset within the bucket, and the file position returned by the last read operation. On the next sequential read, the bucket and offset are used to find the last read socket immediately without needing ot scan the previous buckets the table. This algorithm t read the whole table is O(n). The improvement offered by this patch is easily show by performing cat'ing /proc/net/tcp on a machine with a lot of connections. With about 182K connections in the table, I see the following: - Without patch time cat /proc/net/tcp > /dev/null real 1m56.729s user 0m0.214s sys 1m56.344s - With patch time cat /proc/net/tcp > /dev/null real 0m0.894s user 0m0.290s sys 0m0.594s Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-06Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller7-18/+17
Conflicts: drivers/net/sfc/net_driver.h drivers/net/sfc/siena.c
2010-06-05syncookies: update mss tablesFlorian Westphal1-19/+19
- ipv6 msstab: account for ipv6 header size - ipv4 msstab: add mss for Jumbograms. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-05syncookies: avoid unneeded tcp header flag double checkFlorian Westphal2-2/+2
caller: if (!th->rst && !th->syn && th->ack) callee: if (!th->ack) make the caller only check for !syn (common for 3whs), and move the !rst / ack test to the callee. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-05syncookies: make v4/v6 synflood warning behaviour the sameFlorian Westphal1-11/+13
both syn_flood_warning functions print a message, but ipv4 version only prints a warning if CONFIG_SYN_COOKIES=y. Make the v4 one behave like the v6 one. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-04tcp: use correct net ns in cookie_v4_check()Eric Dumazet1-1/+1
Its better to make a route lookup in appropriate namespace. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-04rps: tcp: fix rps_sock_flow_table table updatesEric Dumazet1-3/+4
I believe a moderate SYN flood attack can corrupt RFS flow table (rps_sock_flow_table), making RPS/RFS much less effective. Even in a normal situation, server handling short lived sessions suffer from bad steering for the first data packet of a session, if another SYN packet is received for another session. We do following action in tcp_v4_rcv() : sock_rps_save_rxhash(sk, skb->rxhash); We should _not_ do this if sk is a LISTEN socket, as about each packet received on a LISTEN socket has a different rxhash than previous one. -> RPS_NO_CPU markers are spread all over rps_sock_flow_table. Also, it makes sense to protect sk->rxhash field changes with socket lock (We currently can change it even if user thread owns the lock and might use rxhash) This patch moves sock_rps_save_rxhash() to a sock locked section, and only for non LISTEN sockets. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-04syncookies: remove Kconfig text line about disabled-by-defaultFlorian Westphal1-5/+5
syncookies default to on since e994b7c901ded7200b525a707c6da71f2cf6d4bb (tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-04netfilter: vmalloc_node cleanupEric Dumazet2-6/+5
Using vmalloc_node(size, numa_node_id()) for temporary storage is not needed. vmalloc(size) is more respectful of user NUMA policy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-06-03arp: RCU changesEric Dumazet1-7/+4
Avoid two atomic ops in arp_fwd_proxy() Avoid two atomic ops in arp_process() Valid optims since arp_rcv() is run under rcu_read_lock() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-03ipv4: RCU changes in __mkroute_input()Eric Dumazet1-5/+3
Avoid two atomic ops on output device refcount Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-03ipv4: RCU conversion of ip_route_input_slow/ip_route_input_mcEric Dumazet1-18/+17
Avoid two atomic ops on struct in_device refcount per incoming packet, if slow path taken, (or route cache disabled) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-03ipv4: add LINUX_MIB_IPRPFILTER snmp counterEric Dumazet4-15/+26
Christoph Lameter mentioned that packets could be dropped in input path because of rp_filter settings, without any SNMP counter being incremented. System administrator can have a hard time to track the problem. This patch introduces a new counter, LINUX_MIB_IPRPFILTER, incremented each time we drop a packet because Reverse Path Filter triggers. (We receive an IPv4 datagram on a given interface, and find the route to send an answer would use another interface) netstat -s | grep IPReversePathFilter IPReversePathFilter: 21714 Reported-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-02TCP: tcp_hybla: Fix integer overflow in slow start incrementDaniele Lacamera1-2/+2
For large values of rtt, 2^rho operation may overflow u32. Clamp down the increment to 2^16. Signed-off-by: Daniele Lacamera <root@danielinux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-02ipconfig: send host-name in DHCP requestsWu Fengguang1-0/+7
Normally dhclient can be configured to send the "host-name" option in DHCP requests to update the client's DNS record. However for an NFSROOT system, dhclient shall never be called (which may change the IP addr and therefore lose your root NFS mount connection). So enable updating the DNS record with kernel parameter ip=::::$HOST_NAME::dhcp Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-01net/ipv4/igmp.c: Remove unnecessary kmalloc castsJoe Perches1-2/+1
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>