linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2017-03-17	igb: Add support for DMA_ATTR_WEAK_ORDERING	Alexander Duyck	2	-3/+6
	Since we are already using DMA attributes in igb for Rx there is no reason why we can't also apply DMA_ATTR_WEAK_ORDERING which is needed on some platforms to improve performance. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-03-17	netfilter: refcounter conversions	Reshetova, Elena	21	-75/+85
	refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-16	liquidio: fix wrong information about link modes reported to ethtool	Manish Awasthi	1	-4/+10
	Information reported to ethtool about link modes is wrong for 25G NIC. Fix it by checking for presence of 25G NIC, checking the link speed reported by NIC firmware, and then assigning proper values to the ethtool_link_ksettings struct. Signed-off-by: Manish Awasthi <manish.awasthi@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	Merge branch 'netvsc-small-changes'	David S. Miller	3	-22/+27
	Stephen Hemminger says: ==================== netvsc: small changes for net-next One bugfix, and two non-code patches ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	netvsc: remove unused #define	stephen hemminger	1	-3/+0
	Not used anywhere. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	netvsc: add comments about callback's and NAPI	stephen hemminger	1	-1/+12
	Add some short description of how callback's and NAPI interoperate. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	netvsc: avoid race with callback	stephen hemminger	2	-18/+15
	Change the argument to channel callback from the channel pointer to the internal data structure containing per-channel info. This avoids any possible races when callback happens during initialization and makes IRQ code simpler. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	Merge branch 'bpf-inline-lookups'	David S. Miller	9	-65/+261
	Alexei Starovoitov says: ==================== bpf: inline bpf_map_lookup_elem() bpf_map_lookup_elem() is one of the most frequently used helper functions. Improve JITed program performance by inlining this helper. bpf_map_type before after hash 58M 74M array 174M 280M The values are number of lookups per second in ideal conditions measured by micro-benchmark in patch 6. The 'perf report' for HASH map type: before: 54.23% map_perf_test [kernel.kallsyms] [k] __htab_map_lookup_elem 14.24% map_perf_test [kernel.kallsyms] [k] lookup_elem_raw 8.84% map_perf_test [kernel.kallsyms] [k] htab_map_lookup_elem 5.93% map_perf_test [kernel.kallsyms] [k] bpf_map_lookup_elem 2.30% map_perf_test [kernel.kallsyms] [k] bpf_prog_da4fc6a3f41761a2 1.49% map_perf_test [kernel.kallsyms] [k] kprobe_ftrace_handler after: 60.03% map_perf_test [kernel.kallsyms] [k] __htab_map_lookup_elem 18.07% map_perf_test [kernel.kallsyms] [k] lookup_elem_raw 2.91% map_perf_test [kernel.kallsyms] [k] bpf_prog_da4fc6a3f41761a2 1.94% map_perf_test [kernel.kallsyms] [k] _einittext 1.90% map_perf_test [kernel.kallsyms] [k] __audit_syscall_exit 1.72% map_perf_test [kernel.kallsyms] [k] kprobe_ftrace_handler so the cost of htab_map_lookup_elem() and bpf_map_lookup_elem() is gone after inlining. 'per-cpu' and 'lru' map types can be optimized similarly in the future. Note the sparse will complain that bpf is addictive ;) kernel/bpf/hashtab.c:438:19: sparse: subtraction of functions? Share your drugs kernel/bpf/verifier.c:3342:38: sparse: subtraction of functions? Share your drugs it's not a new warning, just in new places. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	samples/bpf: add map_lookup microbenchmark	Alexei Starovoitov	2	-0/+65
	$ map_perf_test 128 speed of HASH bpf_map_lookup_elem() in lookups per second w/o JIT w/JIT before 46M 58M after 42M 74M perf report before: 54.23% map_perf_test [kernel.kallsyms] [k] __htab_map_lookup_elem 14.24% map_perf_test [kernel.kallsyms] [k] lookup_elem_raw 8.84% map_perf_test [kernel.kallsyms] [k] htab_map_lookup_elem 5.93% map_perf_test [kernel.kallsyms] [k] bpf_map_lookup_elem 2.30% map_perf_test [kernel.kallsyms] [k] bpf_prog_da4fc6a3f41761a2 1.49% map_perf_test [kernel.kallsyms] [k] kprobe_ftrace_handler after: 60.03% map_perf_test [kernel.kallsyms] [k] __htab_map_lookup_elem 18.07% map_perf_test [kernel.kallsyms] [k] lookup_elem_raw 2.91% map_perf_test [kernel.kallsyms] [k] bpf_prog_da4fc6a3f41761a2 1.94% map_perf_test [kernel.kallsyms] [k] _einittext 1.90% map_perf_test [kernel.kallsyms] [k] __audit_syscall_exit 1.72% map_perf_test [kernel.kallsyms] [k] kprobe_ftrace_handler Notice that bpf_map_lookup_elem() and htab_map_lookup_elem() are trivial functions, yet they take sizeable amount of cpu time. htab_map_gen_lookup() removes bpf_map_lookup_elem() and converts htab_map_lookup_elem() into three BPF insns which causing cpu time for bpf_prog_da4fc6a3f41761a2() slightly increase. $ map_perf_test 256 speed of ARRAY bpf_map_lookup_elem() in lookups per second w/o JIT w/JIT before 97M 174M after 64M 280M before: 37.33% map_perf_test [kernel.kallsyms] [k] array_map_lookup_elem 13.95% map_perf_test [kernel.kallsyms] [k] bpf_map_lookup_elem 6.54% map_perf_test [kernel.kallsyms] [k] bpf_prog_da4fc6a3f41761a2 4.57% map_perf_test [kernel.kallsyms] [k] kprobe_ftrace_handler after: 32.86% map_perf_test [kernel.kallsyms] [k] bpf_prog_da4fc6a3f41761a2 6.54% map_perf_test [kernel.kallsyms] [k] kprobe_ftrace_handler array_map_gen_lookup() removes calls to array_map_lookup_elem() and bpf_map_lookup_elem() and replaces them with 7 bpf insns. The performance without JIT is slower, since executing extra insns in the interpreter is slower than running native C code, but with JIT the performance gains are obvious, since native C->x86 code is replaced with fewer bpf->x86 instructions. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	bpf: inline htab_map_lookup_elem()	Alexei Starovoitov	1	-1/+30
	Optimize: bpf_call bpf_map_lookup_elem map->ops->map_lookup_elem htab_map_lookup_elem __htab_map_lookup_elem into: bpf_call __htab_map_lookup_elem to improve performance of JITed programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	bpf: add helper inlining infra and optimize map_array lookup	Alexei Starovoitov	5	-4/+77
	Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem() into a sequence of bpf instructions. When JIT is on the sequence of bpf instructions is the sequence of native cpu instructions with significantly faster performance than indirect call and two function's prologue/epilogue. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	bpf: adjust insn_aux_data when patching insns	Alexei Starovoitov	1	-5/+39
	convert_ctx_accesses() replaces single bpf instruction with a set of instructions. Adjust corresponding insn_aux_data while patching. It's needed to make sure subsequent 'for(all insn)' loops have matching insn and insn_aux_data. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	bpf: refactor fixup_bpf_calls()	Alexei Starovoitov	1	-41/+35
	reduce indent and make it iterate over instructions similar to convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	bpf: move fixup_bpf_calls() function	Alexei Starovoitov	2	-56/+57
	no functional change. move fixup_bpf_calls() to verifier.c it's being refactored in the next patch Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	tcp: remove tcp_tw_recycle	Soheil Hassas Yeganeh	9	-59/+9
	The tcp_tw_recycle was already broken for connections behind NAT, since the per-destination timestamp is not monotonically increasing for multiple machines behind a single destination address. After the randomization of TCP timestamp offsets in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection), the tcp_tw_recycle is broken for all types of connections for the same reason: the timestamps received from a single machine is not monotonically increasing, anymore. Remove tcp_tw_recycle, since it is not functional. Also, remove the PAWSPassive SNMP counter since it is only used for tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req since the strict argument is only set when tcp_tw_recycle is enabled. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	tcp: remove per-destination timestamp cache	Soheil Hassas Yeganeh	6	-179/+11
	Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection) randomizes TCP timestamps per connection. After this commit, there is no guarantee that the timestamps received from the same destination are monotonically increasing. As a result, the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts in struct tcp_metrics_block) is broken and cannot be relied upon. Remove the per-destination timestamp cache and all related code paths. Note that this cache was already broken for caching timestamps of multiple machines behind a NAT sharing the same address. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Lutz Vieweg <lvml@5t9.de> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	Merge branch 'sunvnet-better-connection-management'	David S. Miller	4	-25/+201
	Shannon Nelson says: ==================== sunvnet: better connection management These patches remove some problems in handling of carrier state with the ldmvsw vswitch, remove an xoff misuse in sunvnet, and add stats for debug and tracking of point-to-point connections between the ldom VMs. v2: - added ldmvsw ndo_open to reset the LDC channel - updated copyrights ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	sunvnet: xoff not needed when removing port link	Shannon Nelson	1	-4/+0
	The sunvnet netdev is connected to the controlling ldom's vswitch for network bridging. However, for higher performance between ldoms, there also is a channel between each client ldom. These connections are represented in the sunvnet driver by a queue for each ldom. The driver uses select_queue to tell the stack which queue to use by tracking the mac addresses on the other end of each port. When a connected ldom shuts down, the driver receives an LDC_EVENT_RESET and the port is removed from the driver, thus a queue with no ldom on the other end will never be selected for Tx. The driver was trying to reinforce the "don't use this queue" notion with netif_tx_stop_queue() and netif_tx_wake_queue(), which really should only be used to signal a Tx queue is full (aka XOFF). This misuse of queue state resulted in NETDEV WATCHDOG messages and lots of unnecessary calls into the driver's tx_timeout handler. Simply removing these takes care of the problem. Orabug: 25190537 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	sunvnet: count multicast packets	Shannon Nelson	1	-0/+2
	Make sure multicast packets get counted in the device. Orabug: 25190537 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	sunvnet: track port queues correctly	Shannon Nelson	2	-13/+22
	Track our used and unused queue indexies correctly. Otherwise, as ports dropped out and returned, they all eventually ended up with the same queue index. Orabug: 25190537 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	sunvnet: add stats to track ldom to ldom packets and bytes	Shannon Nelson	3	-1/+136
	In this driver, there is a "port" created for the connection to each of the other ldoms; a netdev queue is mapped to each port, and they are collected under a single netdev. The generic netdev statistics show us all the traffic in and out of our network device, but don't show individual queue/port stats. This patch breaks out the traffic counts for the individual ports and gives us a little view into the state of those connections. Orabug: 25190537 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	ldmvsw: better use of link up and down on ldom vswitch	Shannon Nelson	3	-7/+41
	When an ldom VM is bound, the network vswitch infrastructure is set up for it, but was being forced 'UP' by the userland switch configuration script. When 'UP' but not actually connected to a running VM, the ipv6 neighbor probes fail (not a horrible thing) and start cluttering up the kernel logs. Funny thing: these are debug messages that never actually show up, but we do see the net_ratelimited messages that say N callbacks were suppressed. This patch defers the netif_carrier_on() until an actual link has been established with the VM, as indicated by receiving an LDC_EVENT_UP from the underlying LDC protocol. Similarly, we take the link down when we see the LDC_EVENT_RESET. Now when we see the ndo_open(), we reset the link to get things talking again. Orabug: 25525312 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	bonding: add 802.3ad support for 25G speeds	Jarod Wilson	1	-0/+9
	Cut-n-paste enablement of 802.3ad bonding on 25G NICs, which currently report 0 as their bandwidth. CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Acked-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	tcp_westwood: fix tcp_westwood_info() style mistakes	chun Long	1	-2/+2
	replace comma to semi colons in tcp_westwood_info(). Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	liquidio: use meaningful names for IRQs	Rick Farrington	4	-16/+101
	All IRQs owned by the PF and VF drivers share the same nondescript name "octeon"; this makes it difficult to setup interrupt affinity. Change the IRQ names to reflect their specific purpose: LiquidIO<id>-<func>-<type>-<queue pair num> Examples: LiquidIO0-pf0-rxtx-3 LiquidIO1-vf1-rxtx-0 LiquidIO0-pf0-aux We cannot use netdev->name for naming the IRQs because: 1. Early during init, the PF and VF drivers require interrupts to send/receive control data from the NIC firmware; so the PF and VF must request IRQs long before the netdev struct is registered. 2. The IRQ name can only be specified at the time it is requested. It cannot be changed after that. Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: Satanand Burla <satananda.burla@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	liquidio: remove/replace invalid code	Rick Farrington	1	-16/+10
	Remove invalid call to dma_sync_single_for_cpu() because previous DMA allocation was coherent--not streaming. Remove code that references fields in struct list_head; replace it with calls to list_empty() and list_first_entry(). Also, add comment to clarify complicated if statement. Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: Derek Chickles <derek.chickles@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	netem: apply correct delay when rate throttling	Nik Unger	1	-8/+18
	I recently reported on the netem list that iperf network benchmarks show unexpected results when a bandwidth throttling rate has been configured for netem. Specifically: 1) The measured link bandwidth increases when a higher delay is added 2) The measured link bandwidth appears higher than the specified limit 3) The measured link bandwidth for the same very slow settings varies significantly across machines The issue can be reproduced by using tc to configure netem with a 512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a veth pair between network namespaces, and then using iperf (or any other network benchmarking tool) to test throughput. Complete detailed instructions are in the original email chain here: https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html There appear to be two underlying bugs causing these effects: - The first issue causes long delays when the rate is slow and no delay is configured (e.g., "rate 512kbit"). This is because SKBs are not orphaned when no delay is configured, so orphaning does not occur until after the rate-induced delay has been applied. For this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us") dramatically increases the measured bandwidth. - The second issue is that rate-induced delays are not correctly applied, allowing SKB delays to occur in parallel. The indended approach is to compute the delay for an SKB and to add this delay to the end of the current queue. However, the code does not detect existing SKBs in the queue due to improperly testing sch->q.qlen, which is nonzero even when packets exist only in the rbtree. Consequently, new SKBs do not wait for the current queue to empty. When packet delays vary significantly (e.g., if packet sizes are different), then this also causes unintended reordering. I modified the code to expect a delay (and orphan the SKB) when a rate is configured. I also added some defensive tests that correctly find the latest scheduled delivery time, even if it is (unexpectedly) for a packet in sch->q. I have tested these changes on the latest kernel (4.11.0-rc1+) and the iperf / ping test results are as expected. Signed-off-by: Nik Unger <njunger@uwaterloo.ca> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	Merge branch 'sched-cleanups'	David S. Miller	2	-4/+2
	Or Gerlitz says: ==================== small set of sched cleanups Just two cleanups -- but for the 2nd one I think we need ack from Cong Wang to make sure this isn't actually a bug report.. changes from V1: - addressed comment from Sergei to use 12 hex digits etc ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	net/sched: fq_codel: Avoid set-but-unused variable	Or Gerlitz	1	-2/+0
	The code introduced by commit 2ccccf5fb43f ("net_sched: update hierarchical backlog too") only sets prev_backlog in fq_codel_dequeue() but not using that anywhere, remove that setting. Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	net/sched: act_ife: Staticfy find_decode_metaid()	Or Gerlitz	1	-2/+2
	As it's used only on that file. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	net: ethernet: bgmac: Allow MAC address to be specified in DTB	Steve Lin	1	-16/+23
	Allows the BCMA version of the bgmac driver to obtain MAC address from the device tree. If no MAC address is specified there, then the previous behavior (obtaining MAC address from SPROM) is used. Signed-off-by: Steve Lin <steven.lin1@broadcom.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Jon Mason <jon.mason@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	net: ethernet: fs_enet: Remove useless includes	Christophe Leroy	2	-12/+0
	CONFIG_8xx is being deprecated. Since the includes dependent on CONFIG_8xx are useless, just drop them. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	isdn: hardware: mISDN: Remove reference to CONFIG_8xx	Christophe Leroy	2	-4/+4
	CONFIG_8xx is deprecated and should soon be removed in favor of CONFIG_PPC_8xx. Anyway, hfc_multi_8xx.h only uses 8xx I/O ports which are linked to the CPM1 communication processor included in the 8xx rather than the 8xx itself. This patch therefore makes it dependent on CONFIG_CPM1 instead, like several other drivers. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	net: mvneta: support suspend and resume	Jane Li	1	-4/+57
	Add basic support for handling suspend and resume. Signed-off-by: Jane Li <jiel@marvell.com> Reviewed-by: Jisheng Zhang <jszhang@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	Merge branch 'mlxsw-vrf'	David S. Miller	9	-20/+254
	Jiri Pirko says: ==================== mlxsw: Enable VRF offload Ido says: Packets received from netdevs enslaved to different VRF devices are forwarded using different FIB tables. In the Spectrum ASIC this is achieved by binding different router interfaces (RIFs) to different virtual routers (VRs). Each RIF represents an enslaved netdev and each VR has its own FIB table according to which packets are forwarded. The first three patches add an helper to check if a FIB rule is a default rule and extend the FIB notification chain to include the rule's info as part of the RULE_{ADD,DEL} events. This allows offloading drivers to sanitize the rules they don't support and flush their tables. The fourth patch introduces a small change in the VRF driver to allow capable drivers to more easily offload VRFs. Finally, the last patches gradually add support for VRFs in the mlxsw driver. First, on top of port netdevs, stacked LAG and VLAN devices and then on top of bridges. Some limitations I would like to point out: 1) The old model where 'oif' / 'iif' rules were programmed for each L3 master device isn't supported. Upon insertion of these rules the driver will flush its tables and forwarding will be done by the kernel instead. It's inferior in every way to the single 'l3mdev' rule, so this shouldn't be an issue. 2) Inter-VRF routes pointing to a VRF device aren't offloaded. Packets hitting these routes will be forwarded by the kernel. Inter-VRF routes pointing to netdevs enslaved to a different VRF are offloaded. 3) There's a small discrepancy between the kernel's datapath and the device's. By default, packets forwarded by the kernel first do a lookup in the local table and then in the VRF's table (assuming no match). In the device, lookup is done only in the VRF's table, which is probably the intended behavior. Changes in v2 allow user to properly re-order the default rules without triggering the abort mechanism. Changes in v3: * Remove 'l3mdev' from the matchall list, as it's related to the action and not the selector (David Ahern). * Use container_of() instead of typecasting (David Ahern). * Add David's Acked-by to the second patch. * Add an helper in IPv4 code to check if rule is a default rule (David Ahern). Changes in v2: * Drop default rule indication and allow re-ordering of default rules (David Ahern). * Remove ifdef around 'struct fib_rule_notifier_info' and drop redundant dependency on IP_MULTIPLE_TABLES from rocker and mlxsw. * Add David's Acked-by to the fourth patch. * Remove netif_is_vrf_master() and use netif_is_l3_master() instead (David Ahern). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	mlxsw: spectrum_router: Don't abort on l3mdev rules	Ido Schimmel	1	-1/+1
	Now that port netdevs can be enslaved to a VRF master we need to make sure the device's routing tables won't be flushed upon the insertion of a l3mdev rule. Note that we assume the notified l3mdev rule is a simple rule as used by the VRF master. We don't check for the presence of other selectors such as 'iif' and 'oif'. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	mlxsw: spectrum_router: Add support for VRFs on top of bridges	Ido Schimmel	3	-1/+81
	In a similar fashion to the previous patch, allow bridges and VLAN devices on top of bridges to be enslaved to a VRF master device. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	mlxsw: spectrum_router: Add support for VRFs	Ido Schimmel	3	-3/+61
	Allow port netdevs, LAG and VLAN devices stacked on top of these to be enslaved to a VRF master device. Upon enslavement, create a router interface (RIF) for the enslaved netdev and associate it with a virtual router (VR) based on the VRF's table ID. If a RIF already exists for the netdev (f.e., due to the existence of an IP address), then it's deleted and a new one is created with the appropriate VR binding. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	mlxsw: spectrum_router: Don't destroy RIF if L3 slave	Ido Schimmel	1	-1/+2
	We usually destroy the netdev's router interface (RIF) when the last IP address is removed from it. However, we shouldn't do that if it's enslaved to an L3 master device. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	mlxsw: spectrum_router: Associate RIFs with correct VR	Ido Schimmel	1	-2/+5
	When a router interface (RIF) is created due to a netdev being enslaved to a VRF master, then it should be associated with the appropriate virtual router (VR) and not the default one. If netdev is a VRF slave, lookup the VR based on the VRF's table ID. Otherwise default to the MAIN table. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	net: vrf: Set slave's private flag before linking	Ido Schimmel	1	-2/+6
	Allow listeners of the subsequent CHANGEUPPER notification to retrieve the VRF's table ID by calling l3mdev_fib_table() with the slave netdev. Without this change, the netdev won't be considered an L3 slave and the function would return 0. This is consistent with other master device such as bridge and bond that set the slave's private flag before linking. It also makes do_vrf_{add,del}_slave() symmetric. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	ipv4: fib_rules: Dump FIB rules when registering FIB notifier	Ido Schimmel	3	-6/+43
	In commit c3852ef7f2f8 ("ipv4: fib: Replay events when registering FIB notifier") we dumped the FIB tables and replayed the events to the passed notification block. However, we merely sent a RULE_ADD notification in case custom rules were in use. As explained in previous patches, this approach won't work anymore. Instead, we should notify the caller about all the FIB rules and let it act accordingly. Upon registration to the FIB notification chain, replay a RULE_ADD notification for each programmed FIB rule, custom or not. The integrity of the dump is ensured by the mechanism introduced in the above mentioned commit. Prevent regressions by making sure current listeners correctly sanitize the notified rules. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	ipv4: fib_rules: Add notifier info to FIB rules notifications	Ido Schimmel	2	-5/+13
	Whenever a FIB rule is added or removed, a notification is sent in the FIB notification chain. However, listeners don't have a way to tell which rule was added or removed. This is problematic as we would like to give listeners the ability to decide which action to execute based on the notified rule. Specifically, offloading drivers should be able to determine if they support the reflection of the notified FIB rule and flush their LPM tables in case they don't. Do that by adding a notifier info to these notifications and embed the common FIB rule struct in it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	ipv4: fib_rules: Check if rule is a default rule	Ido Schimmel	4	-0/+43
	Currently, when non-default (custom) FIB rules are used, devices capable of layer 3 offloading flush their tables and let the kernel do the forwarding instead. When these devices' drivers are loaded they register to the FIB notification chain, which lets them know about the existence of any custom FIB rules. This is done by sending a RULE_ADD notification based on the value of 'net->ipv4.fib_has_custom_rules'. This approach is problematic when VRF offload is taken into account, as upon the creation of the first VRF netdev, a l3mdev rule is programmed to direct skbs to the VRF's table. Instead of merely reading the above value and sending a single RULE_ADD notification, we should iterate over all the FIB rules and send a detailed notification for each, thereby allowing offloading drivers to sanitize the rules they don't support and potentially flush their tables. While l3mdev rules are uniquely marked, the default rules are not. Therefore, when they are being notified they might invoke offloading drivers to unnecessarily flush their tables. Solve this by adding an helper to check if a FIB rule is a default rule. Namely, its selector should match all packets and its action should point to the local, main or default tables. As noted by David Ahern, uniquely marking the default rules is insufficient. When using VRFs, it's common to avoid false hits by moving the rule for the local table to just before the main table: Default configuration: $ ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default Common configuration with VRFs: $ ip rule show 1000: from all lookup [l3mdev-table] 32765: from all lookup local 32766: from all lookup main 32767: from all lookup default Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	r8152: simply the arguments	hayeswang	1	-17/+26
	Replace &tp->napi with napi and tp->netdev with netdev. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16	ipvs: Document sysctl pmtu_disc	Hangbin Liu	1	-0/+8
	Document sysctl pmtu_disc based on commit 3654e61137db ("ipvs: add pmtu_disc option to disable IP DF for TUN packets"). Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-16	ipvs: Document sysctl sync_ports	Hangbin Liu	1	-0/+8
	Document sysctl sync_ports based on commit f73181c8288f ("ipvs: add support for sync threads"). Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-16	ipvs: Document sysctl sync_qlen_max and sync_sock_size	Hangbin Liu	1	-0/+14
	Document sysctl sync_qlen_max and sync_sock_size based on commit 1c003b1580e2 ("ipvs: wakeup master thread"). Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-16	ipvs: fix sync_threshold description and add sync_refresh_period, sync_retries	Hangbin Liu	1	-9/+31
	Fix sync_threshold description which should have two values. Also add sync_refresh_period and sync_retries based on commit 749c42b620a9 ("ipvs: reduce sync rate with time thresholds"). Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-16	ipvs: remove an annoying printk in netns init	Cong Wang	1	-2/+0
	At most it is used for debugging purpose, but I don't think it is even useful for debugging, just remove it. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>