aboutsummaryrefslogtreecommitdiffstats
path: root/net/core/dev.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2013-10-19ipv4: gso: make inet_gso_segment() stackableEric Dumazet1-0/+2
In order to support GSO on IPIP, we need to make inet_gso_segment() stackable. It should not assume network header starts right after mac header. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+2
Conflicts: include/linux/netdevice.h net/core/sock.c Trivial merge issues. Removal of "extern" for functions declaration in netdevice.h at the same time "const" was added to an argument. Two parallel line additions in net/core/sock.c Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07net: Separate the close_list and the unreg_list v2Eric W. Biederman1-11/+14
Separate the unreg_list and the close_list in dev_close_many preventing dev_close_many from permuting the unreg_list. The permutations of the unreg_list have resulted in cases where the loopback device is accessed it has been freed in code such as dst_ifdown. Resulting in subtle memory corruption. This is the second bug from sharing the storage between the close_list and the unreg_list. The issues that crop up with sharing are apparently too subtle to show up in normal testing or usage, so let's forget about being clever and use two separate lists. v2: Make all callers pass in a close_list to dev_close_many Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07netif_set_xps_queue: make cpu mask constMichael S. Tsirkin1-1/+2
virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+48
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/usb/qmi_wwan.c drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h include/net/netfilter/nf_conntrack_synproxy.h include/net/secure_seq.h The conflicts are of two varieties: 1) Conflicts with Joe Perches's 'extern' removal from header file function declarations. Usually it's an argument signature change or a function being added/removed. The resolutions are trivial. 2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds a new value, another changes an existing value. That sort of thing. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30dev: always advertise rx_flags changes via netlinkNicolas Dichtel1-23/+37
When flags IFF_PROMISC and IFF_ALLMULTI are changed, netlink messages are not consistent. For example, if a multicast daemon is running (flag IFF_ALLMULTI set in dev->flags but not dev->gflags, ie not exported to userspace) and then a user sets it via netlink (flag IFF_ALLMULTI set in dev->flags and dev->gflags, ie exported to userspace), no netlink message is sent. Same for IFF_PROMISC and because dev->promiscuity is exported via IFLA_PROMISCUITY, we may send a netlink message after each change of this counter. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30dev: update __dev_notify_flags() to send rtnl msgNicolas Dichtel1-5/+6
This patch only prepares the next one, there is no functional change. Now, __dev_notify_flags() can also be used to notify flags changes via rtnetlink. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28net: Delay default_device_exit_batch until no devices are unregistering v2Eric W. Biederman1-1/+48
There is currently serialization network namespaces exiting and network devices exiting as the final part of netdev_run_todo does not happen under the rtnl_lock. This is compounded by the fact that the only list of devices unregistering in netdev_run_todo is local to the netdev_run_todo. This lack of serialization in extreme cases results in network devices unregistering in netdev_run_todo after the loopback device of their network namespace has been freed (making dst_ifdown unsafe), and after the their network namespace has exited (making the NETDEV_UNREGISTER, and NETDEV_UNREGISTER_FINAL callbacks unsafe). Add the missing serialization by a per network namespace count of how many network devices are unregistering and having a wait queue that is woken up whenever the count is decreased. The count and wait queue allow default_device_exit_batch to wait until all of the unregistration activity for a network namespace has finished before proceeding to unregister the loopback device and then allowing the network namespace to exit. Only a single global wait queue is used because there is a single global lock, and there is a single waiter, per network namespace wait queues would be a waste of resources. The per network namespace count of unregistering devices gives a progress guarantee because the number of network devices unregistering in an exiting network namespace must ultimately drop to zero (assuming network device unregistration completes). The basic logic remains the same as in v1. This patch is now half comment and half rtnl_lock_unregistering an expanded version of wait_event performs no extra work in the common case where no network devices are unregistering when we get to default_device_exit_batch. Reported-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: create sysfs symlinks for neighbour devicesVeaceslav Falico1-1/+34
Also, remove the same functionality from bonding - it will be already done for any device that links to its lower/upper neighbour. The links will be created for dev's kobject, and will look like lower_eth0 for lower device eth0 and upper_bridge0 for upper device bridge0. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: expose the master link to sysfs, and remove it from bondVeaceslav Falico1-2/+17
Currently, we can have only one master upper neighbour, so it would be useful to create a symlink to it in the sysfs device directory, the way that bonding now does it, for every device. Lower devices from bridge/team/etc will automagically get it, so we could rely on it. Also, remove the same functionality from bonding. CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add a possibility to get private from netdev_adjacent->listVeaceslav Falico1-0/+10
It will be useful to get first/last element. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add for_each iterators through neighbour lower link's privateVeaceslav Falico1-1/+59
Add a possibility to iterate through netdev_adjacent's private, currently only for lower neighbours. Add both RCU and RTNL/other locking variants of iterators, and make the non-rcu variant to be safe from removal. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add netdev_adjacent->private and allow to use itVeaceslav Falico1-11/+57
Currently, even though we can access any linked device, we can't attach anything to it, which is vital to properly manage them. To fix this, add a new void *private to netdev_adjacent and functions setting/getting it (per link), so that we can save, per example, bonding's slave structures there, per slave device. netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to upper dev and populates the neighbour link only with private. netdev_lower_dev_get_private{,_rcu}() returns the private, if found. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add RCU variant to search for netdev_adjacent linkVeaceslav Falico1-0/+13
Currently we have only the RTNL flavour, however we can traverse it while holding only RCU, so add the RCU search. Add an RCU variant that uses list_head * as an argument, so that it can be universally used afterwards. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add adj_list to save only neighboursVeaceslav Falico1-100/+103
Currently, we distinguish neighbours (first-level linked devices) from non-neighbours by the neighbour bool in the netdev_adjacent. This could be quite time-consuming in case we would like to traverse *only* through neighbours - cause we'd have to traverse through all devices and check for this flag, and in a (quite common) scenario where we have lots of vlans on top of bridge, which is on top of a bond - the bonding would have to go through all those vlans to get its upper neighbour linked devices. This situation is really unpleasant, cause there are already a lot of cases when a device with slaves needs to go through them in hot path. To fix this, introduce a new upper/lower device lists structure - adj_list, which contains only the neighbours. It works always in pair with the all_adj_list structure (renamed from upper/lower_dev_list), i.e. both of them contain the same links, only that all_adj_list contains also non-neighbour device links. It's really a small change visible, currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't change the main linked logic at all. Also, add some comments a fix a name collision in netdev_for_each_upper_dev_rcu() and rework the naming by the following rules: netdev_(all_)(upper|lower)_* If "all_" is present, then we work with the whole list of upper/lower devices, otherwise - only with direct neighbours. Uninline functions - to get better stack traces. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: use lists as arguments instead of bool upperVeaceslav Falico1-32/+22
Currently we make use of bool upper when we want to specify if we want to work with upper/lower list. It's, however, harder to read, debug and occupies a lot more code. Fix this by just passing the correct upper/lower_dev_list list_head pointer instead of bool upper, and work internally with it. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04net: correctly interlink lower/upper devicesVeaceslav Falico1-2/+2
Currently we're linking upper devices to lower ones, which results in upside-down relationship: upper devices seeing lower devices via its upper lists. Fix this by correctly linking lower devices to the upper ones. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04skb: allow skb_scrub_packet() to be used by tunnelsNicolas Dichtel1-1/+1
This function was only used when a packet was sent to another netns. Now, it can also be used after tunnel encapsulation or decapsulation. Only skb_orphan() should not be done when a packet is not crossing netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add netdev_upper_get_next_dev_rcu(dev, iter)Veaceslav Falico1-0/+25
This function returns the next dev in the dev->upper_dev_list after the struct list_head **iter position, and updates *iter accordingly. Returns NULL if there are no devices left. Caller must hold RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: remove search_list from netdev_adjacentVeaceslav Falico1-36/+1
We already don't need it cause we see every upper/lower device in the list already. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add lower_dev_list to net_device and make a full meshVeaceslav Falico1-27/+258
This patch adds lower_dev_list list_head to net_device, which is the same as upper_dev_list, only for lower devices, and begins to use it in the same way as the upper list. It also changes the way the whole adjacent device lists work - now they contain *all* of upper/lower devices, not only the first level. The first level devices are distinguished by the bool neighbour field in netdev_adjacent, also added by this patch. There are cases when a device can be added several times to the adjacent list, the simplest would be: /---- eth0.10 ---\ eth0- --- bond0 \---- eth0.20 ---/ where both bond0 and eth0 'see' each other in the adjacent lists two times. To avoid duplication of netdev_adjacent structures ref_nr is being kept as the number of times the device was added to the list. The 'full view' is achieved by adding, on link creation, all of the upper_dev's upper_dev_list devices as upper devices to all of the lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice versa. On unlink they are removed using the same logic. I've tested it with thousands vlans/bonds/bridges, everything works ok and no observable lags even on a huge number of interfaces. Memory footprint for 128 devices interconnected with each other via both upper and lower (which is impossible, but for the comparison) lists would be: 128*128*2*sizeof(netdev_adjacent) = 1.5MB but in the real world we usualy have at most several devices with slaves and a lot of vlans, so the footprint will be much lower. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: rename netdev_upper to netdev_adjacentVeaceslav Falico1-12/+12
Rename the structure to reflect the upcoming addition of lower_dev_list. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15dev: move skb_scrub_packet() after eth_type_trans()Nicolas Dichtel1-3/+3
skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans() set pkt_type. In fact, we should force pkt_type to PACKET_HOST, so move the call after eth_type_trans(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30net: add ndo to get id of physical port of the deviceJiri Pirko1-0/+18
This patch adds a ndo for getting physical port of the device. Driver which is aware of being virtual function of some physical port should implement this ndo. This is applicable not only for IOV, but for other solutions (NPAR, multichannel) as well. Basically if there is possible to have multiple netdevs on the single hw port. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24net: Make devnet_rename_seq staticThomas Gleixner1-1/+1
No users outside net/core/dev.c. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-18vlan: mask vlan prio bitsEric Dumazet1-2/+9
In commit 48cc32d38a52d0b68f91a171a8d00531edc6a46e ("vlan: don't deliver frames for unknown vlans to protocols") Florian made sure we set pkt_type to PACKET_OTHERHOST if the vlan id is set and we could find a vlan device for this particular id. But we also have a problem if prio bits are set. Steinar reported an issue on a router receiving IPv6 frames with a vlan tag of 4000 (id 0, prio 2), and tunneled into a sit device, because skb->vlan_tci is set. Forwarded frame is completely corrupted : We can see (8100:4000) being inserted in the middle of IPv6 source address : 16:48:00.780413 IP6 2001:16d8:8100:4000:ee1c:0:9d9:bc87 > 9f94:4d95:2001:67c:29f4::: ICMP6, unknown icmp6 type (0), length 64 0x0000: 0000 0029 8000 c7c3 7103 0001 a0ae e651 0x0010: 0000 0000 ccce 0b00 0000 0000 1011 1213 0x0020: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 0x0030: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 It seems we are not really ready to properly cope with this right now. We can probably do better in future kernels : vlan_get_ingress_priority() should be a netdev property instead of a per vlan_dev one. For stable kernels, lets clear vlan_tci to fix the bugs. Reported-by: Steinar H. Gunderson <sesse@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-11gso: Update tunnel segmentation to support Tx checksum offloadAlexander Duyck1-8/+6
This change makes it so that the GRE and VXLAN tunnels can make use of Tx checksum offload support provided by some drivers via the hw_enc_features. Without this fix enabling GSO means sacrificing Tx checksum offload and this actually leads to a performance regression as shown below: Utilization Send Throughput local GSO 10^6bits/s % S state 6276.51 8.39 enabled 7123.52 8.42 disabled To resolve this it was necessary to address two items. First netif_skb_features needed to be updated so that it would correctly handle the Trans Ether Bridging protocol without impacting the need to check for Q-in-Q tagging. To do this it was necessary to update harmonize_features so that it used skb_network_protocol instead of just using the outer protocol. Second it was necessary to update the GRE and UDP tunnel segmentation offloads so that they would reset the encapsulation bit and inner header offsets after the offload was complete. As a result of this change I have seen the following results on a interface with Tx checksum enabled for encapsulated frames: Utilization Send Throughput local GSO 10^6bits/s % S state 7123.52 8.42 disabled 8321.75 5.43 enabled v2: Instead of replacing refrence to skb->protocol with skb_network_protocol just replace the protocol reference in harmonize_features to allow for double VLAN tag checks. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+34
Conflicts: drivers/net/ethernet/freescale/fec_main.c drivers/net/ethernet/renesas/sh_eth.c net/ipv4/gre.c The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list) and the splitting of the gre.c code into seperate files. The FEC conflict was two sets of changes adding ethtool support code in an "!CONFIG_M5272" CPP protected block. Finally the sh_eth.c conflict was between one commit add bits set in the .eesr_err_check mask whilst another commit removed the .tx_error_check member and assignments. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-02core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()Isaku Yamahata1-0/+6
The dev_forward_skb() assignment of pkt_type should be done after the call to eth_type_trans(). ip-encapsulated packets can be handled by localhost. But skb->pkt_type can be PACKET_OTHERHOST when packet comes via veth into ip tunnel device. In that case, the packet is dropped by ip_rcv(). Although this example uses gretap. l2tp-eth also has same issue. For l2tp-eth case, add dummy device for ip address and ip l2tp command. netns A | root netns | netns B veth<->veth=bridge=gretap <-loop back-> gretap=bridge=veth<->veth arp packet -> pkt_type BROADCAST------------>ip_rcv()------------------------> <- arp reply pkt_type ip_rcv()<-----------------OTHERHOST drop sample operations ip link add tapa type gretap remote 172.17.107.4 local 172.17.107.3 ip link add tapb type gretap remote 172.17.107.3 local 172.17.107.4 ip link set tapa up ip link set tapb up ip address add 172.17.107.3 dev tapa ip address add 172.17.107.4 dev tapb ip route get 172.17.107.3 > local 172.17.107.3 dev lo src 172.17.107.3 > cache <local> ip route get 172.17.107.4 > local 172.17.107.4 dev lo src 172.17.107.4 > cache <local> ip link add vetha type veth peer name vetha-peer ip link add vethb type veth peer name vethb-peer brctl addbr bra brctl addbr brb brctl addif bra tapa brctl addif bra vetha-peer brctl addif brb tapb brctl addif brb vethb-peer brctl show > bridge name bridge id STP enabled interfaces > bra 8000.6ea21e758ff1 no tapa > vetha-peer > brb 8000.420020eb92d5 no tapb > vethb-peer ip link set vetha-peer up ip link set vethb-peer up ip link set bra up ip link set brb up ip netns add a ip netns add b ip link set vetha netns a ip link set vethb netns b ip netns exec a ip address add 10.0.0.3/24 dev vetha ip netns exec b ip address add 10.0.0.4/24 dev vethb ip netns exec a ip link set vetha up ip netns exec b ip link set vethb up ip netns exec a arping -I vetha 10.0.0.4 ARPING 10.0.0.4 from 10.0.0.3 vetha ^CSent 2 probes (2 broadcast(s)) Received 0 response(s) Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Hong Zhiguo <honkiko@gmail.com> Cc: Rami Rosen <ramirose@gmail.com> Cc: Tom Parkin <tparkin@katalix.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Jesse Gross <jesse@nicira.com> Cc: dev@openvswitch.org Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-27dev: introduce skb_scrub_packet()Nicolas Dichtel1-10/+1
The goal of this new function is to perform all needed cleanup before sending an skb into another netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26net: fix kernel deadlock with interface rename and netdev name retrieval.Nicolas Schichan1-0/+34
When the kernel (compiled with CONFIG_PREEMPT=n) is performing the rename of a network interface, it can end up waiting for a workqueue to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to the fact that read_secklock_begin() will spin forever waiting for the writer process (the one doing the interface rename) to update the devnet_rename_seq sequence. This patch fixes the problem by adding a helper (netdev_get_name()) and using it in the code handling the SIOCGIFNAME ioctl and SO_BINDTODEVICE setsockopt. The netdev_get_name() helper uses raw_seqcount_begin() to avoid spinning forever, waiting for devnet_rename_seq->sequence to become even. cond_resched() is used in the contended case, before retrying the access to give the writer process a chance to finish. The use of raw_seqcount_begin() will incur some unneeded work in the reader process in the contended case, but this is better than deadlocking the system. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-23net: allow large number of tx queuesEric Dumazet1-7/+19
netif_alloc_netdev_queues() uses kcalloc() to allocate memory for the "struct netdev_queue *_tx" array. For large number of tx queues, kcalloc() might fail, so this patch does a fallback to vzalloc(). As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10net: add napi_id and hashEliezer Tamir1-0/+59
Adds a napi_id and a hashing mechanism to lookup a napi by id. This will be used by subsequent patches to implement low latency Ethernet device polling. Based on a code sample by Eric Dumazet. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-04net: mark netdev_create_hash __net_initBaruch Siach1-1/+1
netdev_create_hash() is only called from netdev_init() which is marked __net_init. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28net: Correct comparisons and calculations using skb->tail and skb-transport_headerSimon Horman1-2/+2
This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer whereas skb->transport_header will be an offset from head. This is corrected by using wrappers that ensure that comparisons and calculations are always made using pointers. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28net: always pass struct netdev_notifier_info to netdevice notifiersCong Wang1-6/+0
commit 351638e7deeed2ec8ce451b53d3 (net: pass info struct via netdevice notifier) breaks booting of my KVM guest, this is due to we still forget to pass struct netdev_notifier_info in several places. This patch completes it. Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28net: pass changed flags along with NETDEV_CHANGE eventJiri Pirko1-2/+7
Use new netdevice notifier infrastructure to pass along changed flags. Signed-off-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: shortened notifier_info struct name Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-28net: pass info struct via netdevice notifierJiri Pirko1-10/+46
So far, only net_device * could be passed along with netdevice notifier event. This patch provides a possibility to pass custom structure able to provide info that event listener needs to know. Signed-off-by: Jiri Pirko <jiri@resnulli.us> v2->v3: fix typo on simeth shortened dev_getter shortened notifier_info struct name v1->v2: fix notifier_call parameter in call_netdevice_notifier() Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27netpoll: remove return value from netpoll_rx_disable()dingtianhong1-11/+4
The netpoll_rx_disable() will always return 0, it is no use and looks wordy, so remove the unnecessary code and get rid of it in _dev_open and _dev_close. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27MPLS: Add limited GSO supportSimon Horman1-0/+4
In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets. The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO. SKB Usage: When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata: * Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch. * Set skb->protocol to the new MPLS ethertype of the packet. * Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack. I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point. Features: I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation. This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets. A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software. Alternate Implementation: One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment(). I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module. MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar. Cc: Jesse Gross <jesse@nicira.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-25net: add netnotifier event for upper device changeJiri Pirko1-1/+2
Now when upper device is changed, event is not propagated via RT Netlink to userspace. Userspace might never now about the change. Fix this by adding upper-device-change notifier event. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-20rps: selective flow shedding during softnet overflowWillem de Bruijn1-1/+47
A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoofed source DoS) can exceed a single cpu processing rate and will degrade throughput of other flows hashed onto the same cpu. This patch adds a more fine grained hashtable. If the netdev backlog is above a threshold, IRQ cpus track the ratio of total traffic of each flow (using 4096 buckets, configurable). The ratio is measured by counting the number of packets per flow over the last 256 packets from the source cpu. Any flow that occupies a large fraction of this (set at 50%) will see packet drop while above the threshold. Tested: Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0, kernel receive (RPS) on cpu0 and application threads on cpus 2--7 each handling 20k req/s. Throughput halves when hit with a 400 kpps antagonist storm. With this patch applied, antagonist overload is dropped and the server processes its complete load. The patch is effective when kernel receive processing is the bottleneck. The above RPS scenario is a extreme, but the same is reached with RFS and sufficient kernel processing (iptables, packet socket tap, ..). Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-17dev: remove duplicate 'skb->dev = dev' in dev_forward_skb()Nicolas Dichtel1-1/+0
This was added by commit 59b9997baba5 (Revert "net: maintain namespace isolation between vlan and real device"). In fact, before the initial commit - the one that is reverted -, this statement was not present. 'skb->dev = dev' is already done in eth_type_trans(), which is call just after. Spotted-by: Alain Ritoux <alain.ritoux@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-08gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()Pravin B Shelar1-0/+11
Rather than having logic to calculate inner protocol in every tunnel gso handler move it to gso code. This simplifies code. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <amwang@redhat.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-02net: use netdev_features_t in skb_needs_linearize()Patrick McHardy1-1/+1
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30net: Use consume_skb() to free gso segmented skbSridhar Samudrala1-1/+4
Use consume_skb() to free the original skb that is successfully transmitted as gso segmented skbs so that it is not treated as a drop due to an error. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25net: remove redundant code in dev_hard_start_xmit()Eric Dumazet1-7/+0
This reverts commit 068a2de57ddf4f4 (net: release dst entry while cache-hot for GSO case too) Before GSO packet segmentation, we already take care of skb->dst if it can be released. There is no point adding extra test for every segment in the gso loop. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+3
Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22net: Remove return value from list_netdevice()dingtianhong1-3/+1
The return value from list_netdevice() is not used and no need, so remove it. Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19net: rate-limit warn-bad-offload splats.Ben Greear1-0/+3
If one does do something unfortunate and allow a bad offload bug into the kernel, this the skb_warn_bad_offload can effectively live-lock the system, filling the logs with the same error over and over. Add rate limitation to this so that box remains otherwise functional in this case. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>