aboutsummaryrefslogtreecommitdiffstats
path: root/certs (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2016-10-29net: ip, diag: include net/inet_sock.hArnd Bergmann1-0/+1
The newly added raw_diag.c fails to build in some configurations unless we include this header: In file included from net/ipv4/raw_diag.c:6:0: include/net/raw.h:71:21: error: field 'inet' has incomplete type net/ipv4/raw_diag.c: In function 'raw_diag_dump': net/ipv4/raw_diag.c:166:29: error: implicit declaration of function 'inet_sk' Fixes: 432490f9d455 ("net: ip, diag -- Add diag interface for raw sockets") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29driver: tun: Move tun check into the block of TUNSETIFF condition checkGao Feng1-1/+5
When cmd is TUNSETIFF and tun is not null, the original codes go ahead, then reach the default case of switch(cmd) and set the ret is -EINVAL. It is not clear for readers. Now move the tun check into the block of TUNSETIFF condition check, and return -EEXIST instead of -EINVAL when the tfile already owns one tun. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29virtio-net: Update the mtu code to match virtio specAaron Conole1-2/+4
The virtio committee recently ratified a change, VIRTIO-152, which defines the mtu field to be 'max' MTU, not simply desired MTU. This commit brings the virtio-net device in compliance with VIRTIO-152. Additionally, drop the max_mtu branch - it cannot be taken since the u16 returned by virtio_cread16 will never exceed the initial value of max_mtu. Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: "Michael S. Tsirkin" <mst@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28net caif: insert missing spaces in pr_* messages and unbreak multi-line stringsColin Ian King1-6/+3
Some of the pr_* messages are missing spaces, so insert these and also unbreak multi-line literal strings in pr_* messages Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28mlxsw: switchx2: Set physical device for port netdeviceJiri Pirko1-0/+1
Do this so the sysfs has "device" link correctly set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28mlxsw: spectrum: Set physical device for port netdeviceJiri Pirko1-0/+1
Do this so the sysfs has "device" link correctly set. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28mlxsw: Move PCI id table definitions into driver modulesJiri Pirko8-71/+135
So far, mlxsw_pci.ko is the module that registers PCI table for all drivers (spectrum and switchx2). That is problematic for example with dracut. Since mlxsw_spectrum.ko and mlxsw_switchx2.ko are loaded dynamically from within mlxsw_core.ko, dracut does not have track of them and avoids them from being included in initramfs. So make this in an ordinary way and define the PCI tables in individual driver modules, so it can be properly loaded and included in dracut initramfs image. As a side effect, this patch could remove no longer necessary driver "kind" strings which were used to link PCI ids with individual mlxsw drivers. Suggested-by: Ivan Vecera <ivecera@redhat.com> Tested-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28mlxsw: pci: Rename header with HW definitionsJiri Pirko2-6/+6
pci.h needs to be used for inner function declarations. So move the original one to more appropriate name, pci_hw.h. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-28mlxsw: spectrum: Remove extra whitespaceIdo Schimmel1-1/+1
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27net: skip genenerating uevents for network namespaces that are exitingAndrey Vagin1-3/+11
No one can see these events, because a network namespace can not be destroyed, if it has sockets. Unlike other devices, uevent-s for network devices are generated only inside their network namespaces. They are filtered in kobj_bcast_filter() My experiments shows that net namespaces are destroyed more 30% faster with this optimization. Here is a perf output for destroying network namespaces without this patch. - 94.76% 0.02% kworker/u48:1 [kernel.kallsyms] [k] cleanup_net - 94.74% cleanup_net - 94.64% ops_exit_list.isra.4 - 41.61% default_device_exit_batch - 41.47% unregister_netdevice_many - rollback_registered_many - 40.36% netdev_unregister_kobject - 14.55% device_del + 13.71% kobject_uevent - 13.04% netdev_queue_update_kobjects + 12.96% kobject_put - 12.72% net_rx_queue_update_kobjects kobject_put - kobject_release + 12.69% kobject_uevent + 0.80% call_netdevice_notifiers_info + 19.57% nfsd_exit_net + 11.15% tcp_net_metrics_exit + 8.25% rpcsec_gss_exit_net It's very critical to optimize the exit path for network namespaces, because they are destroyed under net_mutex and many namespaces can be destroyed for one iteration. v2: use dev_set_uevent_suppress() Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27ethernet: fix min/max MTU typosStefan Richter2-2/+2
Fixes: d894be57ca92('ethernet: use net core MTU range checking in more drivers') CC: Jarod Wilson <jarod@redhat.com> CC: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: mark families as __ro_after_initJohannes Berg35-51/+51
Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: use idr to track familiesJohannes Berg3-181/+123
Since generic netlink family IDs are small integers, allocated densely, IDR is an ideal match for lookups. Replace the existing hand-written hash-table with IDR for allocation and lookup. This lets the families only be written to once, during register, since the list_head can be removed and removal of a family won't cause any writes. It also slightly reduces the code size (by about 1.3k on x86-64). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: statically initialize familiesJohannes Berg37-337/+414
Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: no longer support using static family IDsJohannes Berg37-69/+24
Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27genetlink: introduce and use genl_family_attrbuf()Johannes Berg6-35/+55
This helper function allows family implementations to access their family's attrbuf. This gets rid of the attrbuf usage in families, and also adds locking validation, since it's not valid to use the attrbuf with parallel_ops or outside of the dumpit callback. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27skbedit: allow the user to specify bitmask for markAntonio Quartulli3-3/+21
The user may want to use only some bits of the skb mark in his skbedit rules because the remaining part might be used by something else. Introduce the "mask" parameter to the skbedit actor in order to implement such functionality. When the mask is specified, only those bits selected by the latter are altered really changed by the actor, while the rest is left untouched. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26devlink: Prevent port_type_set() callback when it's not neededElad Raz1-0/+2
When a port_type_set() is been called and the new port type set is the same as the old one, just return success. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26firewire: net: set initial MTU = 1500 unconditionally, fix IPv6 on some CardBus cardsStefan Richter1-7/+1
firewire-net, like the older eth1394 driver, reduced the initial MTU to less than 1500 octets if the local link layer controller's asynchronous packet reception limit was lower. This is bogus, since this reception limit does not have anything to do with the transmission limit. Neither did this reduction affect the TX path positively, nor could it prevent link fragmentation at the RX path. Many FireWire CardBus cards have a max_rec of 9, causing an initial MTU of 1024 - 16 = 1008. RFC 2734 and RFC 3146 allow a minimum max_rec = 8, which would result in an initial MTU of 512 - 16 = 496. On such cards, IPv6 could only be employed if the MTU was manually increased to 1280 or more, i.e. IPv6 would not work without intervention from userland. We now always initialize the MTU to 1500, which is the default according to RFC 2734 and RFC 3146. On a VIA VT6316 based CardBus card which was affected by this, changing the MTU from 1008 to 1500 also increases TX bandwidth by 6 %. RX remains unaffected. CC: netdev@vger.kernel.org CC: linux1394-devel@lists.sourceforge.net CC: Jarod Wilson <jarod@redhat.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26firewire: net: fix maximum possible MTUStefan Richter1-3/+4
Commit b3e3893e1253 ("net: use core MTU range checking in misc drivers") mistakenly introduced an upper limit for firewire-net's MTU based on the local link layer controller's reception capability. Revert this. Neither RFC 2734 nor our implementation impose any particular upper limit. Actually, to be on the safe side and to make the code explicit, set ETH_MAX_MTU = 65535 as upper limit now. (I replaced sizeof(struct rfc2734_header) by the equivalent RFC2374_FRAG_HDR_SIZE in order to avoid distracting long/int conversions.) Fixes: b3e3893e1253('net: use core MTU range checking in misc drivers') CC: netdev@vger.kernel.org CC: linux1394-devel@lists.sourceforge.net CC: Jarod Wilson <jarod@redhat.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: netcp: add missing of_node_put() in netcp_probe()Wei Yongjun1-0/+4
This node pointer is returned by of_get_child_by_name() with refcount incremented in this function. of_node_put() on it before exitting this function. This is detected by Coccinelle semantic patch. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: ena: use setup_timer() and mod_timer()Wei Yongjun1-6/+3
Use setup_timer() instead of init_timer(), being the preferred/standard way to set a timer up. Also, quoting the mod_timer() function comment: -> mod_timer() is a more efficient way to update the expire field of an active timer (if the timer is inactive it will be activated). Use setup_timer and mod_timer to setup and arm a timer, to make the code cleaner and easier to read. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26amd-xgbe: Fix error return code in xgbe_probe()Wei Yongjun1-0/+1
Fix to return error code -ENODEV from the DMA is not supported error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: ns83820: use dev_kfree_skb_irq instead of kfree_skbWei Yongjun1-1/+1
It is not allowed to call kfree_skb() from hardware interrupt context or with interrupts being disabled, spin_lock_irqsave() make sure always in irq disable context. So the kfree_skb() should be replaced with dev_kfree_skb_irq(). This is detected by Coccinelle semantic patch. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: eth: altera: Fix error return code in altera_tse_probe()Wei Yongjun1-0/+2
Fix to return error code -EINVAL from the error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: dsa: mv88e6xxx: use setup_timer to simplify the codeWei Yongjun1-3/+2
Use setup_timer function instead of initializing timer with the function and data fields. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: netcp: drop kfree for memory allocated with devm_kzallocWei Yongjun1-2/+0
It's not necessary to free memory allocated with devm_kzalloc in the remove path and using kfree leads to a double free. Fixes: 84640e27f230 ("net: netcp: Add Keystone NetCP core ethernet driver") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26batman-adv: Revert "use core MTU range checking in misc drivers"Sven Eckelmann1-1/+12
The maximum MTU is defined via the slave devices of an batman-adv interface. Thus it is not possible to calculate the max_mtu during the creation of the batman-adv device when no slave devices are attached. Doing so would for example break non-fragmentation setups which then (incorrectly) allow an MTU of 1500 even when underlying device cannot transport 1500 bytes + batman-adv headers. Checking the dynamically calculated max_mtu via the minimum of the slave devices MTU during .ndo_change_mtu is also used by the bridge interface. Cc: Jarod Wilson <jarod@redhat.com> Fixes: b3e3893e1253 ("net: use core MTU range checking in misc drivers") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: phy: broadcom: Add support for BCM54612EXo Wang2-0/+49
This PHY has internal delays enabled after reset. This clears the internal delay enables unless the interface specifically requests them. Signed-off-by: Xo Wang <xow@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26net: phy: broadcom: Update Auxiliary Control Register macrosXo Wang1-1/+2
Add the RXD-to-RXC skew (delay) time bit in the Miscellaneous Control shadow register and a mask for the shadow selector field. Remove a re-definition of MII_BCM54XX_AUXCTL_SHDWSEL_AUXCTL. Signed-off-by: Xo Wang <xow@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26sch_htb: do not report fake rate estimatorsEric Dumazet1-1/+1
When I prepared commit d250a5f90e53 ("pkt_sched: gen_estimator: Dont report fake rate estimators"), htb still had an implicit rate estimator for all its classes. Then later, I made this rate estimator optional in commit 64153ce0a7b6 ("net_sched: htb: do not setup default rate estimators"), but I forgot to update htb use of gnet_stats_copy_rate_est() After this patch, "tc -s qdisc ..." no longer report fake rate estimators for HTB classes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23net: ip, diag -- Add diag interface for raw socketsCyrill Gorcunov9-4/+333
In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. v2: - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) - implement @destroy for diag requests (by dsa@) v3: - add export of raw_abort for IPv6 (by dsa@) - pass net-admin flag into inet_sk_diag_fill due to changes in net-next branch (by dsa@) v4: - use @pad in struct inet_diag_req_v2 for raw socket protocol specification: raw module carries sockets which may have custom protocol passed from socket() syscall and sole @sdiag_protocol is not enough to match underlied ones - start reporting protocol specifed in socket() call when sockets are raw ones for the same reason: user space tools like ss may parse this attribute and use it for socket matching v5 (by eric.dumazet@): - use sock_hold in raw_sock_get instead of atomic_inc, we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock when looking up so counter won't be zero here. v6: - use sdiag_raw_protocol() helper which will access @pad structure used for raw sockets protocol specification: we can't simply rename this member without breaking uapi v7: - sine sdiag_raw_protocol() helper is not suitable for uapi lets rather make an alias structure with proper names. __check_inet_diag_req_raw helper will catch if any of structure unintentionally changed. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23lwt: Remove unused len fieldThomas Graf3-9/+3
The field is initialized by ILA and MPLS but never used. Remove it. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23net: allow to kill a task which waits net_mutex in copy_new_nsAndrey Vagin1-1/+8
net_mutex can be locked for a long time. It may be because many namespaces are being destroyed or many processes decide to create a network namespace. Both these operations are heavy, so it is better to have an ability to kill a process which is waiting net_mutex. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23net/sched: em_meta: Fix 'meta vlan' to correctly recognize zero VID framesShmulik Ladkani1-4/+5
META_COLLECTOR int_vlan_tag() assumes that if the accel tag (vlan_tci) is zero, then no vlan accel tag is present. This is incorrect for zero VID vlan accel packets, making the following match fail: tc filter add ... basic match 'meta(vlan mask 0xfff eq 0)' ... Apparently 'int_vlan_tag' was implemented prior VLAN_TAG_PRESENT was introduced in 05423b2 "vlan: allow null VLAN ID to be used" (and at time introduced, the 'vlan_tx_tag_get' call in em_meta was not adapted). Fix, testing skb_vlan_tag_present instead of testing skb_vlan_tag_get's value. Fixes: 05423b2413 ("vlan: allow null VLAN ID to be used") Fixes: 1a31f2042e ("netsched: Allow meta match on vlan tag on receive") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23mlxsw: Convert resources into arrayJiri Pirko8-190/+223
Since the number of resources is going to get much bigger, ease up the addition by simly defining IDs. Convert the existing structure members to a set array, one for validity, one for values. Introduce a set of getters and setters for easy access. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23mlxsw: cmd: Push resource query defines to cmd.hJiri Pirko2-6/+9
Push cmd resource query related defines to cmd.h where they belong. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23mlxsw: reg: Generare register names automaticallyJiri Pirko1-123/+73
Extend the MLXSW_REG_DEFINE macro to store register name in string form. Use this string later on instead of hard coded string values. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23mlxsw: reg: Use helper macro to define registersJiri Pirko1-240/+66
Save some code and also prepare to easily carry name in string form. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23mlxsw: item: Make char *buf arg constant for gettersJiri Pirko4-19/+23
Enforce const for getter buf args. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23mlxsw: item: Make struct mlxsw_item args constJiri Pirko1-12/+15
These should be const, so enforce it. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22reuseport, bpf: add test case for bpf_get_numa_node_idDaniel Borkmann3-4/+263
The test case is very similar to reuseport_bpf_cpu, only that here we select socket members based on current numa node id. # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17 node 0 size: 128867 MB node 0 free: 120080 MB node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23 node 1 size: 96765 MB node 1 free: 87504 MB node distances: node 0 1 0: 10 20 1: 20 10 # ./reuseport_bpf_numa ---- IPv4 UDP ---- send node 0, receive socket 0 send node 1, receive socket 1 send node 1, receive socket 1 send node 0, receive socket 0 ---- IPv6 UDP ---- send node 0, receive socket 0 send node 1, receive socket 1 send node 1, receive socket 1 send node 0, receive socket 0 ---- IPv4 TCP ---- send node 0, receive socket 0 send node 1, receive socket 1 send node 1, receive socket 1 send node 0, receive socket 0 ---- IPv6 TCP ---- send node 0, receive socket 0 send node 1, receive socket 1 send node 1, receive socket 1 send node 0, receive socket 0 SUCCESS Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22bpf: add helper for retrieving current numa node idDaniel Borkmann6-0/+24
Use case is mainly for soreuseport to select sockets for the local numa node, but since generic, lets also add this for other networking and tracing program types. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22udp: use it's own memory accounting schemaPaolo Abeni4-66/+33
Completely avoid default sock memory accounting and replace it with udp-specific accounting. Since the new memory accounting model encapsulates completely the required locking, remove the socket lock on both enqueue and dequeue, and avoid using the backlog on enqueue. Be sure to clean-up rx queue memory on socket destruction, using udp its own sk_destruct. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr readers Kpps (vanilla) Kpps (patched) 1 170 440 3 1250 2150 6 3000 3650 9 4200 4450 12 5700 6250 v4 -> v5: - avoid unneeded test in first_packet_length v3 -> v4: - remove useless sk_rcvqueues_full() call v2 -> v3: - do not set the now unsed backlog_rcv callback v1 -> v2: - add memory pressure support - fixed dropwatch accounting for ipv6 Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22udp: implement memory accounting helpersPaolo Abeni2-0/+110
Avoid using the generic helpers. Use the receive queue spin lock to protect the memory accounting operation, both on enqueue and on dequeue. On dequeue perform partial memory reclaiming, trying to leave a quantum of forward allocated memory. On enqueue use a custom helper, to allow some optimizations: - use a plain spin_lock() variant instead of the slightly costly spin_lock_irqsave(), - avoid dst_force check, since the calling code has already dropped the skb dst - avoid orphaning the skb, since skb_steal_sock() already did the work for us The above needs custom memory reclaiming on shutdown, provided by the udp_destruct_sock(). v5 -> v6: - don't orphan the skb on enqueue v4 -> v5: - replace the mem_lock with the receive queue spin lock - ensure that the bh is always allowed to enqueue at least a skb, even if sk_rcvbuf is exceeded v3 -> v4: - reworked memory accunting, simplifying the schema - provide an helper for both memory scheduling and enqueuing v1 -> v2: - use a udp specific destrctor to perform memory reclaiming - remove a couple of helpers, unneeded after the above cleanup - do not reclaim memory on dequeue if not under memory pressure - reworked the fwd accounting schema to avoid potential integer overflow Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-22net/socket: factor out helpers for memory and queue manipulationPaolo Abeni3-33/+71
Basic sock operations that udp code can use with its own memory accounting schema. No functional change is introduced in the existing APIs. v4 -> v5: - avoid whitespace changes v2 -> v4: - avoid exporting __sock_enqueue_skb v1 -> v2: - avoid export sock_rmem_free Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21net: remove MTU limits on a few ether_setup callersJarod Wilson5-2/+11
These few drivers call ether_setup(), but have no ndo_change_mtu, and thus were overlooked for changes to MTU range checking behavior. They previously had no range checks, so for feature-parity, set their min_mtu to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500 inherited from the ether_setup() changes. Fine-tuning can come after we get back to full feature-parity here. CC: netdev@vger.kernel.org Reported-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st> CC: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st> CC: R Parameswaran <parameswaran.r7@gmail.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-21hv_netvsc: fix a race between netvsc_send() and netvsc_init_buf()Vitaly Kuznetsov1-0/+7
Fix in commit 880988348270 ("hv_netvsc: set nvdev link after populating chn_table") turns out to be incomplete. A crash in netvsc_get_next_send_section() is observed on mtu change when the device is under load. The race I identified is: if we get to netvsc_send() after we set net_device_ctx->nvdev link in netvsc_device_add() but before we finish netvsc_connect_vsp()->netvsc_init_buf() send_section_map is not allocated and we crash. Unfortunately we can't set net_device_ctx->nvdev link after the netvsc_init_buf() call as during the negotiation we need to receive packets and on the receive path we check for it. It would probably be possible to split nvdev into a pair of nvdev_in and nvdev_out links and check them accordingly in get_outbound_net_device()/ get_inbound_net_device() but this looks like an overkill. Check that send_section_map is allocated in netvsc_send(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20ipv4/6: use core net MTU range checkingJarod Wilson4-33/+12
ipv4/ip_tunnel: - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - t_hlen - preserve all ndo_change_mtu checks for now to prevent regressions ipv6/ip6_tunnel: - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - preserve all ndo_change_mtu checks for now to prevent regressions ipv6/ip6_vti: - min_mtu = 1280, max_mtu = 65535 - remove redundant vti6_change_mtu ipv6/sit: - min_mtu = 1280, max_mtu = 0xFFF8 - t_hlen - remove redundant ipip6_tunnel_change_mtu CC: netdev@vger.kernel.org CC: "David S. Miller" <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20s390/net: use net core MTU range checkingJarod Wilson5-29/+8
ctcm: - min_mtu = 576, max_mtu = 65527 netiucv: - min_mtu = 576, max_mtu = 65535 qeth: - min_mtu = 64, max_mtu = 65535 CC: netdev@vger.kernel.org CC: linux-s390@vger.kernel.org CC: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>