aboutsummaryrefslogtreecommitdiffstats
path: root/net/core (follow)
AgeCommit message (Collapse)AuthorFilesLines
2010-10-11neigh: Protect neigh->ha[] with a seqlockEric Dumazet1-17/+30
Add a seqlock in struct neighbour to protect neigh->ha[], and avoid dirtying neighbour in stress situation (many different flows / dsts) Dirtying takes place because of read_lock(&n->lock) and n->used writes. Switching to a seqlock, and writing n->used only on jiffies changes permits less dirtying. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller1-3/+3
Conflicts: net/core/ethtool.c
2010-10-11net: clear heap allocations for privileged ethtool actionsKees Cook1-3/+3
Several other ethtool functions leave heap uncleared (potentially) by drivers. Some interfaces appear safe (eeprom, etc), in that the sizes are well controlled. In some situations (e.g. unchecked error conditions), the heap will remain unchanged in areas before copying back to userspace. Note that these are less of an issue since these all require CAP_NET_ADMIN. Cc: stable@kernel.org Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11neigh: speedup neigh_hh_init()Eric Dumazet2-40/+63
When a new dst is used to send a frame, neigh_resolve_output() tries to associate an struct hh_cache to this dst, calling neigh_hh_init() with the neigh rwlock write locked. Most of the time, hh_cache is already known and linked into neighbour, so we find it and increment its refcount. This patch changes the logic so that we call neigh_hh_init() with neighbour lock read locked only, so that fast path can be run in parallel by concurrent cpus. This brings part of the speedup we got with commit c7d4426a98a5f (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones, removing one of the contention point that routers hit on multiqueue enabled machines. Further improvements would need to use a seqlock instead of an rwlock to protect neigh->ha[], to not dirty neigh too often and remove two atomic ops. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08net: Fix rxq ref countingTom Herbert2-1/+2
The rx->count reference is used to track reference counts to the number of rx-queue kobjects created for the device. This patch eliminates initialization of the counter in netif_alloc_rx_queues and instead increments the counter each time a kobject is created. This is now symmetric with the decrement that is done when an object is released. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08net: clear heap allocation for ETHTOOL_GRXCLSRLALLKees Cook1-1/+1
Calling ETHTOOL_GRXCLSRLALL with a large rule_cnt will allocate kernel heap without clearing it. For the one driver (niu) that implements it, it will leave the unused portion of heap unchanged and copy the full contents back to userspace. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08net: Update kernel-doc for netif_set_real_num_rx_queues()Ben Hutchings1-2/+2
Synchronise the comment with the preceding implementation change. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-06net: netif_set_real_num_rx_queues may cap num_rx_queues at init timeJohn Fastabend1-2/+0
Do not set num_rx_queues in netif_set_real_num_rx_queues() some drivers will increase the real_num_rx_queues later due to a feature changes or available interrupts increasing. By setting num_rx_queues here this ends up creating a cap on the number of rx queues available. For example the ixgbe driver sets the max number of queues it intends to use ever then sets the current number in use with the netif_set_num_{rx|tx}_queues calls. With the current implementation the number of rx queues gets limited so when a feature such as DCB or FCoE is enabled the queues are no longer available. kobjects will only be allocated for real_num_rx_queues so the waste in memory is minimal. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-06neigh: RCU conversion of struct neighbourEric Dumazet1-52/+85
This is the second step for neighbour RCU conversion. (first was commit d6bf7817 : RCU conversion of neigh hash table) neigh_lookup() becomes lockless, but still take a reference on found neighbour. (no more read_lock()/read_unlock() on tbl->lock) struct neighbour gets an additional rcu_head field and is freed after an RCU grace period. Future work would need to eventually not take a reference on neighbour for temporary dst (DST_NOCACHE), but this would need dst->_neighbour to use a noref bit like we did for skb->_dst. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05fib: RCU conversion of fib_lookup()Eric Dumazet1-1/+2
fib_lookup() converted to be called in RCU protected context, no reference taken and released on a contended cache line (fib_clntref) fib_table_lookup() and fib_semantic_match() get an additional parameter. struct fib_info gets an rcu_head field, and is freed after an rcu grace period. Stress test : (Sending 160.000.000 UDP frames on same neighbour, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_HASH) (about same results for FIB_TRIE) Before patch : real 1m31.199s user 0m13.761s sys 23m24.780s After patch: real 1m5.375s user 0m14.997s sys 15m50.115s Before patch Profile : 13044.00 15.4% __ip_route_output_key vmlinux 8438.00 10.0% dst_destroy vmlinux 5983.00 7.1% fib_semantic_match vmlinux 5410.00 6.4% fib_rules_lookup vmlinux 4803.00 5.7% neigh_lookup vmlinux 4420.00 5.2% _raw_spin_lock vmlinux 3883.00 4.6% rt_set_nexthop vmlinux 3261.00 3.9% _raw_read_lock vmlinux 2794.00 3.3% fib_table_lookup vmlinux 2374.00 2.8% neigh_resolve_output vmlinux 2153.00 2.5% dst_alloc vmlinux 1502.00 1.8% _raw_read_lock_bh vmlinux 1484.00 1.8% kmem_cache_alloc vmlinux 1407.00 1.7% eth_header vmlinux 1406.00 1.7% ipv4_dst_destroy vmlinux 1298.00 1.5% __copy_from_user_ll vmlinux 1174.00 1.4% dev_queue_xmit vmlinux 1000.00 1.2% ip_output vmlinux After patch Profile : 13712.00 15.8% dst_destroy vmlinux 8548.00 9.9% __ip_route_output_key vmlinux 7017.00 8.1% neigh_lookup vmlinux 4554.00 5.3% fib_semantic_match vmlinux 4067.00 4.7% _raw_read_lock vmlinux 3491.00 4.0% dst_alloc vmlinux 3186.00 3.7% neigh_resolve_output vmlinux 3103.00 3.6% fib_table_lookup vmlinux 2098.00 2.4% _raw_read_lock_bh vmlinux 2081.00 2.4% kmem_cache_alloc vmlinux 2013.00 2.3% _raw_spin_lock vmlinux 1763.00 2.0% __copy_from_user_ll vmlinux 1763.00 2.0% ip_output vmlinux 1761.00 2.0% ipv4_dst_destroy vmlinux 1631.00 1.9% eth_header vmlinux 1440.00 1.7% _raw_read_unlock_bh vmlinux Reference results, if IP route cache is enabled : real 0m29.718s user 0m10.845s sys 7m37.341s 25213.00 29.5% __ip_route_output_key vmlinux 9011.00 10.5% dst_release vmlinux 4817.00 5.6% ip_push_pending_frames vmlinux 4232.00 5.0% ip_finish_output vmlinux 3940.00 4.6% udp_sendmsg vmlinux 3730.00 4.4% __copy_from_user_ll vmlinux 3716.00 4.4% ip_route_output_flow vmlinux 2451.00 2.9% __xfrm_lookup vmlinux 2221.00 2.6% ip_append_data vmlinux 1718.00 2.0% _raw_spin_lock_bh vmlinux 1655.00 1.9% __alloc_skb vmlinux 1572.00 1.8% sock_wfree vmlinux 1345.00 1.6% kfree vmlinux Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05net neigh: RCU conversion of neigh hash tableEric Dumazet1-82/+137
David This is the first step for RCU conversion of neigh code. Next patches will convert hash_buckets[] and "struct neighbour" to RCU protected objects. Thanks [PATCH net-next] net neigh: RCU conversion of neigh hash table Instead of storing hash_buckets, hash_mask and hash_rnd in "struct neigh_table", a new structure is defined : struct neigh_hash_table { struct neighbour **hash_buckets; unsigned int hash_mask; __u32 hash_rnd; struct rcu_head rcu; }; And "struct neigh_table" has an RCU protected pointer to such a neigh_hash_table. This means the signature of (*hash)() function changed: We need to add a third parameter with the actual hash_rnd value, since this is not anymore a neigh_table field. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05net neigh: neigh_delete() and neigh_add() changesEric Dumazet1-22/+17
neigh_delete() and neigh_add() dont need to touch device refcount, we hold RTNL when calling them, so device cannot disappear under us. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05net: add a core netdev->rx_dropped counterEric Dumazet1-8/+11
In various situations, a device provides a packet to our stack and we drop it before it enters protocol stack : - softnet backlog full (accounted in /proc/net/softnet_stat) - bad vlan tag (not accounted) - unknown/unregistered protocol (not accounted) We can handle a per-device counter of such dropped frames at core level, and automatically adds it to the device provided stats (rx_dropped), so that standard tools can be used (ifconfig, ip link, cat /proc/net/dev) This is a generalization of commit 8990f468a (net: rx_dropped accounting), thus reverting it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05fib: fib_rules_cleanup can be staticstephen hemminger1-2/+1
fib_rules_cleanup_ups is only defined and used in one place. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05net: dynamic ingress_queue allocationEric Dumazet1-8/+26
ingress being not used very much, and net_device->ingress_queue being quite a big object (128 or 256 bytes), use a dynamic allocation if needed (tc qdisc add dev eth0 ingress ...) dev_ingress_queue(dev) helper should be used only with RTNL taken. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-04Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller2-6/+7
Conflicts: net/ipv4/Kconfig net/ipv4/tcp_timer.c
2010-10-03net: introduce DST_NOCACHE flagEric Dumazet1-1/+3
While doing stress tests with IP route cache disabled, and multi queue devices, I noticed a very high contention on one rwlock used in neighbour code. When many cpus are trying to send frames (possibly using a high performance multiqueue device) to the same neighbour, they fight for the neigh->lock rwlock in order to call neigh_hh_init(), and fight on hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test()) But we dont need to call neigh_hh_init() for dst that are used only once. It costs four atomic operations at least, on two contended cache lines, plus the high contention on neigh->lock rwlock. Introduce a new dst flag, DST_NOCACHE, that is set when dst was not inserted in route cache. With the stress test bench, sending 160000000 frames on one neighbour, results are : Before patch: real 2m28.406s user 0m11.781s sys 36m17.964s After patch: real 1m26.532s user 0m12.185s sys 20m3.903s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-03net: Fix the condition passed to sk_wait_event()Nagendra Tomar1-4/+4
This patch fixes the condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory() causes the following soft lockup in tcp_sendmsg() when the global tcp memory pool has exhausted. >>> snip <<< localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429] localhost kernel: CPU 3: localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200] [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200 localhost kernel: localhost kernel: Call Trace: localhost kernel: [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200 localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40 localhost kernel: [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0 localhost kernel: [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140 localhost kernel: [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130 localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40 localhost kernel: [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170 localhost kernel: [vfs_write+0x185/0x190] vfs_write+0x185/0x190 localhost kernel: [sys_write+0x50/0x90] sys_write+0x50/0x90 localhost kernel: [system_call+0x7e/0x83] system_call+0x7e/0x83 >>> snip <<< What is happening is, that the sk_wait_event() condition passed from sk_stream_wait_memory() evaluates to true for the case of tcp global memory exhaustion. This is because both sk_stream_memory_free() and vm_wait are true which causes sk_wait_event() to *not* call schedule_timeout(). Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping. This causes the caller to again try allocation, which again fails and again calls sk_stream_wait_memory(), and so on. [ Bug introduced by commit c1cbe4b7ad0bc4b1d98ea708a3fecb7362aa4088 ("[NET]: Avoid atomic xchg() for non-error case") -DaveM ] Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29net: rename netdev rx_queue to ingress_queueEric Dumazet1-4/+4
There is some confusion with rx_queue name after RPS, and net drivers private rx_queue fields. I suggest to rename "struct net_device"->rx_queue to ingress_queue. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29net: add a recursion limit in xmit pathEric Dumazet1-1/+11
As tunnel devices are going to be lockless, we need to make sure a misconfigured machine wont enter an infinite loop. Add a percpu variable, and limit to three the number of stacked xmits. Reported-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-28ipv4: Allow configuring subnets as local addressesTom Herbert1-1/+2
This patch allows a host to be configured to respond to any address in a specified range as if it were local, without actually needing to configure the address on an interface. This is done through routing table configuration. For instance, to configure a host to respond to any address in 10.1/16 received on eth0 as a local address we can do: ip rule add from all iif eth0 lookup 200 ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200 This host is now reachable by any 10.1/16 address (route lookup on input for packets received on eth0 can find the route). On output, the rule will not be matched so that this host can still send packets to 10.1/16 (not sent on loopback). Presumably, external routing can be configured to make sense out of this. To make this work, we needed to modify the logic in finding the interface which is assigned a given source address for output (dev_ip_find). We perform a normal fib_lookup instead of just a lookup on the local table, and in the lookup we ignore the input interface for matching. This patch is useful to implement IP-anycast for subnets of virtual addresses. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27net: Allow changing number of RX queues after device allocationBen Hutchings3-18/+63
For RPS, we create a kobject for each RX queue based on the number of queues passed to alloc_netdev_mq(). However, drivers generally do not determine the numbers of hardware queues to use until much later, so this usually represents the maximum number the driver may use and not the actual number in use. For TX queues, drivers can update the actual number using netif_set_real_num_tx_queues(). Add a corresponding function for RX queues, netif_set_real_num_rx_queues(). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27net: sk_{detach|attach}_filter() rcu fixesEric Dumazet1-6/+4
sk_attach_filter() and sk_detach_filter() are run with socket locked. Use the appropriate rcu_dereference_protected() instead of blocking BH, and rcu_dereference_bh(). There is no point adding BH prevention and memory barrier. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27fib: use atomic_inc_not_zero() in fib_rules_lookupEric Dumazet1-3/+5
It seems we dont use appropriate refcount increment in an rcu_read_lock() protected section. fib_rule_get() might increment a null refcount and bad things could happen. While fib_nl_delrule() respects an rcu grace period before calling fib_rule_put(), fib_rules_cleanup_ops() calls fib_rule_put() without a grace period. Note : after this patch, we might avoid the synchronize_rcu() call done in fib_nl_delrule() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27tcp: Fix >4GB writes on 64-bit.David S. Miller1-2/+3
Fixes kernel bugzilla #16603 tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write zero bytes, for example. There is also the problem higher up of how verify_iovec() works. It wants to prevent the total length from looking like an error return value. However it does this using 'int', but syscalls return 'long' (and thus signed 64-bit on 64-bit machines). So it could trigger false-positives on 64-bit as written. So fix it to use 'long'. Reported-by: Olaf Bonorden <bono@onlinehome.de> Reported-by: Daniel Büse <dbuese@gmx.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller2-5/+5
Conflicts: drivers/net/qlcnic/qlcnic_init.c net/ipv4/ip_output.c
2010-09-26rps: allocate rx queues in register_netdevice onlyEric Dumazet1-44/+32
Instead of having two places were we allocate dev->_rx, introduce netif_alloc_rx_queues() helper and call it only from register_netdevice(), not from alloc_netdev_mq() Goal is to let drivers change dev->num_rx_queues after allocating netdev and before registering it. This also removes a lot of ifdefs in net/core/dev.c Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26net: propagate NETIF_F_HIGHDMA to vlansEric Dumazet1-3/+4
Automatically allows vlans to get NETIF_F_HIGHDMA if underlying device supports it. On 32bit arches (and more precisely if CONFIG_HIGHMEM is enabled), it can help to reduce cost of illegal_highdma() and __skb_linearize() calls. Tested on tg3 , bnx2, bonding, this worked very well. This is a generalization of a patch provided by Yi Zou & Jeff Kirsher. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-24net: fix a lockdep splatEric Dumazet1-4/+4
We have for each socket : One spinlock (sk_slock.slock) One rwlock (sk_callback_lock) Possible scenarios are : (A) (this is used in net/sunrpc/xprtsock.c) read_lock(&sk->sk_callback_lock) (without blocking BH) <BH> spin_lock(&sk->sk_slock.slock); ... read_lock(&sk->sk_callback_lock); ... (B) write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) (C) spin_lock_bh(&sk->sk_slock) ... write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) spin_unlock_bh(&sk->sk_slock) This (C) case conflicts with (A) : CPU1 [A] CPU2 [C] read_lock(callback_lock) <BH> spin_lock_bh(slock) <wait to spin_lock(slock)> <wait to write_lock_bh(callback_lock)> We have one problematic (C) use case in inet_csk_listen_stop() : local_bh_disable(); bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock) WARN_ON(sock_owned_by_user(child)); ... sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock) lockdep is not happy with this, as reported by Tetsuo Handa It seems only way to deal with this is to use read_lock_bh(callbacklock) everywhere. Thanks to Jarek for pointing a bug in my first attempt and suggesting this solution. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-23net: return operator cleanupEric Dumazet3-6/+6
Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-21net: core: use kernel's converter from hex to binAndy Shevchenko2-12/+11
Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-21ethtool: Fix build due to lack of ethtool.h include.David S. Miller1-0/+1
net/core/ethtool.c: In function 'ethtool_get_regs': net/core/ethtool.c:818:2: error: implicit declaration of function 'vmalloc' net/core/ethtool.c:818:9: warning: assignment makes pointer from integer without a cast net/core/ethtool.c:833:2: error: implicit declaration of function 'vfree' Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-21ethtool: Allocate register dump buffer with vmalloc()Ben Hutchings1-2/+2
Some NICs have huge register files which exceed the maximum heap allocation size. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-17ethtool, ixgbe: Move RX n-tuple mask fixup to ethtoolBen Hutchings1-0/+34
The ethtool utility does not set masks for flow parameters that are not specified, so if both value and mask are 0 then this must be treated as equivalent to a mask with all bits set. Currently that is done in the only driver that implements RX n-tuple filtering, ixgbe. Move it to the ethtool core. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-17netns: keep vlan slaves on master netns moveDavid Lamparter1-0/+4
previously, if a vlan master device was moved from one network namespace to another, all 802.1q and macvlan slaves were deleted. we can use dev->reg_state to figure out whether dev_change_net_namespace is happening, since that won't set dev->reg_state NETREG_UNREGISTERING. so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when reg_state is not NETREG_UNREGISTERING. Signed-off-by: David Lamparter <equinox@diac24.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-17ethtool: change ethtool_set_gro() to use ethtool_op_get_rx_csumEric Dumazet1-2/+5
To be able to switch on GRO on a device, ethtool_set_gro() checks this device provides a get_rx_csum() method. Some devices dont provide this method, while they do support RX checksumming. This patch allows bonding to support GRO : ethtool -K bond0 gro on Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-16net: include inetdevice.h for rcu_dereference_raw api changeStephen Rothwell1-0/+1
rcu_dereference_raw() now needs to know the type of its argument. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15net: enable GRO by default for vlan devicesBrandon Philips1-0/+5
Currently vlan devices don't have GRO by default as none of the Ethernet drivers add NETIF_F_GRO to their vlan_features. As GRO is a software feature add GRO to dev->vlan_features in register_netdevice() and let vlan_dev_init() take care that it gets enabled only when dev->features has NETIF_F_GRO too. Signed-off-by: Brandon Philips <bphilips@suse.de> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15ipv4: ip_ptr cleanupsEric Dumazet1-1/+1
dev->ip_ptr is protected by rtnl and rcu. Yet some places dont use appropriate primitives and/or locking rules. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15ethtool: Remove unimplemented flow specification typesBen Hutchings1-4/+4
struct ethtool_rawip4_spec and struct ethtool_ether_spec are neither commented nor used by any driver, so remove them. Adjust padding in the user-visible unions that included these structures. Fix references to struct ethtool_rawip4_spec in ethtool_get_rx_ntuple(), which should use struct ethtool_usrip4_spec. struct ethtool_usrip4_spec cannot hold IPv6 host addresses and there is no separate structure that can, so remove ETH_RX_NFC_IP6 and the reference to it in niu. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-14net: use rcu_barrier() in rollback_registered_manyEric Dumazet1-1/+1
netdev_wait_allrefs() waits that all references to a device vanishes. It currently uses a _very_ pessimistic 250 ms delay between each probe. Some users reported that no more than 4 devices can be dismantled per second, this is a pretty serious problem for some setups. Most of the time, a refcount is about to be released by an RCU callback, that is still in flight because rollback_registered_many() uses a synchronize_rcu() call instead of rcu_barrier(). Problem is visible if number of online cpus is one, because synchronize_rcu() is then a no op. time to remove 50 ipip tunnels on a UP machine : before patch : real 11.910s after patch : real 1.250s Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Octavian Purdila <opurdila@ixiacom.com> Reported-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-13flow: better memory managementEric Dumazet1-36/+42
Allocate hash tables for every online cpus, not every possible ones. NUMA aware allocations. Dont use a full page on arches where PAGE_SIZE > 1024*sizeof(void *) misc: __percpu , __read_mostly, __cpuinit annotations flow_compare_t is just an "unsigned long" Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-10pkt_sched: remov unnecessary bh_disablestephen hemminger1-2/+2
Now that est_tree_lock is acquired with BH protection, the other call is unnecessary. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-09Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6David S. Miller3-15/+19
Conflicts: net/mac80211/main.c
2010-09-09net/core: add lock context change annotations in net/core/sock.cNamhyung Kim1-0/+4
__lock_sock() and __release_sock() releases and regrabs lock but were missing proper annotations. Add it. This removes following warning from sparse. (Currently __lock_sock() does not emit any warning about it but I think it is better to add also.) net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-09net/core: remove address space warnings on verify_iovec()Namhyung Kim1-2/+4
move_addr_to_kernel() and copy_from_user() requires their argument as __user pointer but were missing proper markups. Add it. This removes following warnings from sparse. net/core/iovec.c:44:52: warning: incorrect type in argument 1 (different address spaces) net/core/iovec.c:44:52: expected void [noderef] <asn:1>*uaddr net/core/iovec.c:44:52: got void *msg_name net/core/iovec.c:55:34: warning: incorrect type in argument 2 (different address spaces) net/core/iovec.c:55:34: expected void const [noderef] <asn:1>*from net/core/iovec.c:55:34: got struct iovec *msg_iov Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08net: rps: add the shortcut for one rps_cpusChangli Gao1-3/+11
When there is only one rps_cpus, skb_get_rxhash() can be eliminated. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08gro: Re-fix different skb headroomsJarek Poplawski1-1/+1
The patch: "gro: fix different skb headrooms" in its part: "2) allocate a minimal skb for head of frag_list" is buggy. The copied skb has p->data set at the ip header at the moment, and skb_gro_offset is the length of ip + tcp headers. So, after the change the length of mac header is skipped. Later skb_set_mac_header() sets it into the NET_SKB_PAD area (if it's long enough) and ip header is misaligned at NET_SKB_PAD + NET_IP_ALIGN offset. There is no reason to assume the original skb was wrongly allocated, so let's copy it as it was. bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626 fixes commit: 3d3be4333fdf6faa080947b331a6a19bce1a4f57 Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-07net: fix tx queue selection for bridged devices implementing select_queueHelmut Schaa1-8/+8
When a net device is implementing the select_queue callback and is part of a bridge, frames coming from the bridge already have a tx queue associated to the socket (introduced in commit a4ee3ce3293dc931fab19beb472a8bde1295aebe, "net: Use sk_tx_queue_mapping for connected sockets"). The call to sk_tx_queue_get will then return the tx queue used by the bridge instead of calling the select_queue callback. In case of mac80211 this broke QoS which is implemented by using the select_queue callback. Furthermore it introduced problems with rt2x00 because frames with the same TID and RA sometimes appeared on different tx queues which the hw cannot handle correctly. Fix this by always calling select_queue first if it is available and only afterwards use the socket tx queue mapping. Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-06net: poll() optimizationsEric Dumazet1-3/+2
No need to test twice sk->sk_shutdown & RCV_SHUTDOWN Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>