aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-04-26neigh: align nlattr properly when neededNicolas Dichtel1-1/+2
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26rtnl: align nlattr properly when neededNicolas Dichtel1-2/+2
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ovs: align nlattr properly when neededNicolas Dichtel1-12/+15
I also fix commit 8b32ab9e6ef1: use nla_total_size_64bit() for OVS_FLOW_ATTR_USED in ovs_flow_cmd_msg_size(). Fixes: 8b32ab9e6ef1 ("ovs: use nla_put_u64_64bit()") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26sock_diag: align nlattr properly when neededNicolas Dichtel3-6/+10
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f190 which is only in net-next right now, thus I didn't make a separate patch. Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ila: add checksum neutral ILA translationsTom Herbert4-15/+105
Support checksum neutral ILA as described in the ILA draft. The low order 16 bits of the identifier are used to contain the checksum adjustment value. The csum-mode parameter is added to described checksum processing. There are three values: - adjust transport checksum (previous behavior) - do checksum neutral mapping - do nothing On output the csum-mode in the ila_params is checked and acted on. If mode is checksum neutral mapping then to mapping and set C-bit. On input, C-bit is checked. If it is set checksum-netural mapping is done (regardless of csum-mode in ila params) and C-bit will be cleared. If it is not set then action in csum-mode is taken. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ila: xlat changesTom Herbert1-69/+34
Change model of xlat to be used only for input where lookup is done on the locator part of an address (comparing to locator_match as key in rhashtable). This is needed for checksum neutral translation which obfuscates the low order 16 bits of the identifier. It also permits hosts to be in muliple ILA domains (each locator can map to a different SIR address). A check is also added to disallow translating non-ILA addresses (check of type in identifier). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ila: Add struct definitions and helpersTom Herbert4-82/+161
Add structures for identifiers, locators, and an ila address which is composed of a locator and identifier and in6_addr can be cast to it. This includes a three bit type field and enums for the types defined in ILA I-D. In ILA lwt don't allow user to set a translation for a non-ILA address (type of identifier is zero meaning it is an IID). This also requires that the destination prefix is at least 65 bytes (64 bit locator and first byte of identifier). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25RDS: TCP: Call pskb_extract() helper functionSowmini Varadhan1-11/+3
rds-stress experiments with request size 256 bytes, 8K acks, using 16 threads show a 40% improvment when pskb_extract() replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);} pattern in the Rx path, so we leverage the perf gain with this commit. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25skbuff: Add pskb_extract() helper functionSowmini Varadhan1-0/+242
A pattern of skb usage seen in modules such as RDS-TCP is to extract `to_copy' bytes from the received TCP segment, starting at some offset `off' into a new skb `clone'. This is done in the ->data_ready callback, where the clone skb is queued up for rx on the PF_RDS socket, while the parent TCP segment is returned unchanged back to the TCP engine. The existing code uses the sequence clone = skb_clone(..); pskb_pull(clone, off, ..); pskb_trim(clone, to_copy, ..); with the intention of discarding the first `off' bytes. However, skb_clone() + pskb_pull() implies pksb_expand_head(), which ends up doing a redundant memcpy of bytes that will then get discarded in __pskb_pull_tail(). To avoid this inefficiency, this commit adds pskb_extract() that creates the clone, and memcpy's only the relevant header/frag/frag_list to the start of `clone'. pskb_trim() is then invoked to trim clone down to the requested to_copy bytes. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25codel: split into multiple filesMichal Kazior2-0/+4
It was impossible to include codel.h for the purpose of having access to codel_params or codel_vars structure definitions and using them for embedding in other more complex structures. This splits allows codel.h itself to be treated like any other header file while codel_qdisc.h and codel_impl.h contain function definitions with logic that was previously in codel.h. This copies over copyrights and doesn't involve code changes other than adding a few additional include directives to net/sched/sch*codel.c. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25codel: generalize the implementationMichal Kazior2-7/+32
This strips out qdisc specific bits from the code and makes it slightly more reusable. Codel will be used by wireless/mac80211 in the future. Signed-off-by: Michal Kazior <michal.kazior@tieto.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25net: better drop monitoring in ip{6}_recv_error()Eric Dumazet2-10/+10
We should call consume_skb(skb) when skb is properly consumed, or kfree_skb(skb) when skb must be dropped in error case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25tcp: SYN packets are now simply consumedEric Dumazet1-18/+1
We now have proper per-listener but also per network namespace counters for SYN packets that might be dropped. We replace the kfree_skb() by consume_skb() to be drop monitor [1] friendly, and remove an obsolete comment. FastOpen SYN packets can carry payload in them just fine. [1] perf record -a -g -e skb:kfree_skb sleep 1; perf report Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25wireless: use nla_put_u64_64bit()Nicolas Dichtel1-36/+55
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25netfilter/ipvs: use nla_put_u64_64bit()Nicolas Dichtel1-12/+24
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25ieee802154: use nla_put_u64_64bit()Nicolas Dichtel2-7/+13
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25l2tp: use nla_put_u64_64bit()Nicolas Dichtel1-32/+48
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25bridge: use nla_put_u64_64bit()Nicolas Dichtel1-26/+36
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25ovs: use nla_put_u64_64bit()Nicolas Dichtel1-1/+2
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25ipv6: use nla_put_u64_64bit()Nicolas Dichtel2-7/+11
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25sched: use nla_put_u64_64bit()Nicolas Dichtel3-5/+10
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25rtnl: use nla_put_u64_64bit()Nicolas Dichtel1-18/+18
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25soreuseport: Resolve merge conflict for v4/v6 ordering fixCraig Gallek1-1/+5
d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets") was merged as a bug fix to the net tree. Two conflicting changes were committed to net-next before the above fix was merged back to net-next: ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") These changes switched the datastructure used for TCP and UDP sockets from hlist_nulls to hlist. This patch applies the necessary parts of the net tree fix to net-next which were not automatic as part of the merge. Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24tcp-tso: do not split TSO packets at retransmit timeEric Dumazet3-38/+32
Linux TCP stack painfully segments all TSO/GSO packets before retransmits. This was fine back in the days when TSO/GSO were emerging, with their bugs, but we believe the dark age is over. Keeping big packets in write queues, but also in stack traversal has a lot of benefits. - Less memory overhead, because write queues have less skbs - Less cpu overhead at ACK processing. - Better SACK processing, as lot of studies mentioned how awful linux was at this ;) - Less cpu overhead to send the rtx packets (IP stack traversal, netfilter traversal, drivers...) - Better latencies in presence of losses. - Smaller spikes in fq like packet schedulers, as retransmits are not constrained by TCP Small Queues. 1 % packet losses are common today, and at 100Gbit speeds, this translates to ~80,000 losses per second. Losses are often correlated, and we see many retransmit events leading to 1-MSS train of packets, at the time hosts are already under stress. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24tipc: fix stale links after re-enabling bearerParthasarathy Bhuvaragan1-2/+1
Commit 42b18f605fea ("tipc: refactor function tipc_link_timeout()"), introduced a bug which prevents sending of probe messages during link synchronization phase. This leads to hanging links, if the bearer is disabled/enabled after links are up. In this commit, we send the probe messages correctly. Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24tcp: Merge txstamp_ack in tcp_skb_collapse_tstampMartin KaFai Lau1-0/+2
When collapsing skbs, txstamp_ack also needs to be merged. Retrans Collapse Test: ~~~~~~ 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 730) = 730 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 730) = 730 +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0 0.200 write(4, ..., 11680) = 11680 0.200 > P. 1:731(730) ack 1 0.200 > P. 731:1461(730) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:13141(4380) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 13141 win 257 BPF Output Before: ~~~~~ <No output due to missing SCM_TSTAMP_ACK timestamp> BPF Output After: ~~~~~ <...>-2027 [007] d.s. 79.765921: : ee_data:1459 Sacks Collapse Test: ~~~~~ 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 1460) = 1460 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 13140) = 13140 +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 BPF Output Before: ~~~~~ <No output due to missing SCM_TSTAMP_ACK timestamp> BPF Output After: ~~~~~ <...>-2049 [007] d.s. 89.185538: : ee_data:14599 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24tcp: Carry txstamp_ack in tcp_fragment_tstampMartin KaFai Lau1-0/+2
When a tcp skb is sliced into two smaller skbs (e.g. in tcp_fragment() and tso_fragment()), it does not carry the txstamp_ack bit to the newly created skb if it is needed. The end result is a timestamping event (SCM_TSTAMP_ACK) will be missing from the sk->sk_error_queue. This patch carries this bit to the new skb2 in tcp_fragment_tstamp(). BPF Output Before: ~~~~~~ <No output due to missing SCM_TSTAMP_ACK timestamp> BPF Output After: ~~~~~~ <...>-2050 [000] d.s. 100.928763: : ee_data:14599 Packetdrill Script: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 14600) = 14600 +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0 0.200 > . 1:7301(7300) ack 1 0.200 > P. 7301:14601(7300) ack 1 0.300 < . 1:1(0) ack 14601 win 257 0.300 close(4) = 0 0.300 > F. 14601:14601(0) ack 1 0.400 < F. 1:1(0) ack 16062 win 257 0.400 > . 14602:14602(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller11-814/+578
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, mostly from Florian Westphal to sort out the lack of sufficient validation in x_tables and connlabel preparation patches to add nf_tables support. They are: 1) Ensure we don't go over the ruleset blob boundaries in mark_source_chains(). 2) Validate that target jumps land on an existing xt_entry. This extra sanitization comes with a performance penalty when loading the ruleset. 3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables. 4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables. 5) Make sure the minimal possible target size in x_tables. 6) Similar to #3, add xt_compat_check_entry_offsets() for compat code. 7) Check that standard target size is valid. 8) More sanitization to ensure that the target_offset field is correct. 9) Add xt_check_entry_match() to validate that matches are well-formed. 10-12) Three patch to reduce the number of parameters in translate_compat_table() for {arp,ip,ip6}tables by using a container structure. 13) No need to return value from xt_compat_match_from_user(), so make it void. 14) Consolidate translate_table() so it can be used by compat code too. 15) Remove obsolete check for compat code, so we keep consistent with what was already removed in the native layout code (back in 2007). 16) Get rid of target jump validation from mark_source_chains(), obsoleted by #2. 17) Introduce xt_copy_counters_from_user() to consolidate counter copying, and use it from {arp,ip,ip6}tables. 18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump functions. 19) Move nf_connlabel_match() to xt_connlabel. 20) Skip event notification if connlabel did not change. 21) Update of nf_connlabels_get() to make the upcoming nft connlabel support easier. 23) Remove spinlock to read protocol state field in conntrack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23xfrm: align nlattr properly when neededNicolas Dichtel1-4/+6
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23libnl: nla_put_msecs(): align on a 64-bit areaNicolas Dichtel3-12/+16
nla_data() is now aligned on a 64-bit area. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23libnl: nla_put_be64(): align on a 64-bit areaNicolas Dichtel11-34/+60
nla_data() is now aligned on a 64-bit area. A temporary version (nla_put_be64_32bit()) is added for nla_put_net64(). This function is removed in the next patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23libnl: nla_put_le64(): align on a 64-bit areaNicolas Dichtel1-5/+8
nla_data() is now aligned on a 64-bit area. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller31-136/+306
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds29-133/+298
Pull networking fixes from David Miller: 1) Fix memory leak in iwlwifi, from Matti Gottlieb. 2) Add missing registration of netfilter arp_tables into initial namespace, from Florian Westphal. 3) Fix potential NULL deref in DecNET routing code. 4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry Ivanov. 5) Fix dst ref counting in VRF, from David Ahern. 6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck. 7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause. 8) Ravalidate IPV6 datagram socket cached routes properly, particularly with UDP, from Martin KaFai Lau. 9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang. 10) Fix stats typing in bcmgenet driver, from Eric Dumazet. 11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing, from Joe Stringer. 12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown. 13) atl2 doesn't actually support scatter-gather, so don't advertise the feature. From Ben Hucthings. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits) openvswitch: use flow protocol when recalculating ipv6 checksums Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets atl2: Disable unimplemented scatter/gather feature net/mlx4_en: Split SW RX dropped counter per RX ring net/mlx4_core: Don't allow to VF change global pause settings net/mlx4_core: Avoid repeated calls to pci enable/disable net/mlx4_core: Implement pci_resume callback net: phy: spi_ks8895: Don't leak references to SPI devices net: ethernet: davinci_emac: Fix platform_data overwrite net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable qede: Fix single MTU sized packet from firmware GRO flow qede: Fix setting Skb network header qede: Fix various memory allocation error flows for fastpath tcp: Merge tx_flags and tskey in tcp_shifted_skb tcp: Merge tx_flags and tskey in tcp_collapse_retrans drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks openvswitch: Orphan skbs before IPv6 defrag Revert "Prevent NUll pointer dereference with two PHYs on cpsw" VSOCK: Only check error on skb_recv_datagram when skb is NULL ...
2016-04-21openvswitch: use flow protocol when recalculating ipv6 checksumsSimon Horman1-2/+2
When using masked actions the ipv6_proto field of an action to set IPv6 fields may be zero rather than the prevailing protocol which will result in skipping checksum recalculation. This patch resolves the problem by relying on the protocol in the flow key rather than that in the set field action. Fixes: 83d2b9ba1abc ("net: openvswitch: Support masked set actions.") Cc: Jarno Rajahalme <jrajahalme@nicira.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21net: Add support for IP ID mangling TSO in cases that require encapsulationAlexander Duyck1-0/+11
This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports NETIF_F_TSO. This way if needed a device can then later enable the TSO with IP ID mangling and the tunnels on top of that device can then also make use of the IP ID mangling as well. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21tcp: Merge tx_flags and tskey in tcp_shifted_skbMartin KaFai Lau2-2/+3
After receiving sacks, tcp_shifted_skb() will collapse skbs if possible. tx_flags and tskey also have to be merged. This patch reuses the tcp_skb_collapse_tstamp() to handle them. BPF Output Before: ~~~~~ <no-output-due-to-missing-tstamp-event> BPF Output After: ~~~~~ <...>-2024 [007] d.s. 88.644374: : ee_data:14599 Packetdrill Script: ~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 1460) = 1460 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 13140) = 13140 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 0.400 close(4) = 0 0.400 > F. 14601:14601(0) ack 1 0.500 < F. 1:1(0) ack 14602 win 257 0.500 > . 14602:14602(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21tcp: Merge tx_flags and tskey in tcp_collapse_retransMartin KaFai Lau1-0/+16
If two skbs are merged/collapsed during retransmission, the current logic does not merge the tx_flags and tskey. The end result is the SCM_TSTAMP_ACK timestamp could be missing for a packet. The patch: 1. Merge the tx_flags 2. Overwrite the prev_skb's tskey with the next_skb's tskey BPF Output Before: ~~~~~~ <no-output-due-to-missing-tstamp-event> BPF Output After: ~~~~~~ packetdrill-2092 [001] d.s. 453.998486: : ee_data:1459 Packetdrill Script: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 730) = 730 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 730) = 730 +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0 0.200 write(4, ..., 11680) = 11680 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 > P. 1:731(730) ack 1 0.200 > P. 731:1461(730) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:13141(4380) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 13141 win 257 0.400 close(4) = 0 0.400 > F. 13141:13141(0) ack 1 0.500 < F. 1:1(0) ack 13142 win 257 0.500 > . 13142:13142(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21ip6mr: align RTA_MFC_STATS on 64-bitNicolas Dichtel1-2/+2
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21ipmr: align RTA_MFC_STATS on 64-bitNicolas Dichtel1-2/+2
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21rtnl: use the new API to align IFLA_STATS*Nicolas Dichtel1-17/+5
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21NLA_BINARY misuse bug in HSRPeter Heise1-4/+3
Removed .type field from NLA to do proper length checking. Reported by Daniel Borkmann and Julia Lawall. Signed-off-by: Peter Heise <peter.heise@airbus.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diagXin Long2-10/+8
EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates back to before jiffies_to_msecs() has been introduced. Now we can remove it and use jiffies_to_msecs(). Suggested-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acksMartin KaFai Lau1-1/+2
Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received, it could incorrectly think that a skb has already been acked and queue a SCM_TSTAMP_ACK cmsg to the sk->sk_error_queue. In tcp_ack_tstamp(), it checks 'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'. If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill script, between() returns true but the tskey is actually not acked. e.g. try between(3, 2, 1). The fix is to replace between() with one before() and one !before(). By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be removed. A packetdrill script is used to reproduce the dup ack scenario. Due to the lacking cmsg support in packetdrill (may be I cannot find it), a BPF prog is used to kprobe to sock_queue_err_skb() and print out the value of serr->ee.ee_data. Both the packetdrill and the bcc BPF script is attached at the end of this commit message. BPF Output Before Fix: ~~~~~~ <...>-2056 [001] d.s. 433.927987: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 433.929563: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 433.930765: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 434.028177: : ee_data:1459 packetdrill-2056 [001] d.s. 434.029686: : ee_data:14599 BPF Output After Fix: ~~~~~~ <...>-2049 [000] d.s. 113.517039: : ee_data:1459 <...>-2049 [000] d.s. 113.517253: : ee_data:14599 BCC BPF Script: ~~~~~~ #!/usr/bin/env python from __future__ import print_function from bcc import BPF bpf_text = """ #include <uapi/linux/ptrace.h> #include <net/sock.h> #include <bcc/proto.h> #include <linux/errqueue.h> #ifdef memset #undef memset #endif int trace_err_skb(struct pt_regs *ctx) { struct sk_buff *skb = (struct sk_buff *)ctx->si; struct sock *sk = (struct sock *)ctx->di; struct sock_exterr_skb *serr; u32 ee_data = 0; if (!sk || !skb) return 0; serr = SKB_EXT_ERR(skb); bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data); bpf_trace_printk("ee_data:%u\\n", ee_data); return 0; }; """ b = BPF(text=bpf_text) b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb") print("Attached to kprobe") b.trace_print() Packetdrill Script: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 1460) = 1460 0.200 write(4, ..., 13140) = 13140 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 0.400 close(4) = 0 0.400 > F. 14601:14601(0) ack 1 0.500 < F. 1:1(0) ack 14602 win 257 0.500 > . 14602:14602(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21net: dsa: remove tag_protocol from dsa_switchVivien Didelot1-3/+2
Having the tag protocol in dsa_switch_driver for setup time and in dsa_switch_tree for runtime is enough. Remove dsa_switch's one. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21openvswitch: Orphan skbs before IPv6 defragJoe Stringer1-0/+1
This is the IPv6 counterpart to commit 8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()"). Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be cloned (implicitly orphaning) prior to queueing for reassembly. As such, when the IPv6 message is eventually reassembled, the skb->sk for all fragments would be NULL. After that commit was introduced, rather than cloning, the original skbs were queued directly without orphaning. The end result is that all frags except for the first and last may have a socket attached. This commit explicitly orphans such skbs during nf_ct_frag6_gather() to prevent BUG_ON(skb->sk) during a later call to ip6_fragment(). kernel BUG at net/ipv6/ip6_output.c:631! [...] Call Trace: <IRQ> [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0 [<ffffffffa042c7c0>] ? do_output.isra.28+0x1b0/0x1b0 [openvswitch] [<ffffffff810bb8a2>] ? __lock_is_held+0x52/0x70 [<ffffffffa042c587>] ovs_fragment+0x1f7/0x280 [openvswitch] [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50 [<ffffffff81697ea0>] ? dst_discard_out+0x20/0x20 [<ffffffff81697e80>] ? dst_ifdown+0x80/0x80 [<ffffffffa042c703>] do_output.isra.28+0xf3/0x1b0 [openvswitch] [<ffffffffa042d279>] do_execute_actions+0x709/0x12c0 [openvswitch] [<ffffffffa04340a4>] ? ovs_flow_stats_update+0x74/0x1e0 [openvswitch] [<ffffffffa04340d1>] ? ovs_flow_stats_update+0xa1/0x1e0 [openvswitch] [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40 [<ffffffffa042de75>] ovs_execute_actions+0x45/0x120 [openvswitch] [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch] [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40 [<ffffffffa042def4>] ovs_execute_actions+0xc4/0x120 [openvswitch] [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch] [<ffffffffa04337f2>] ? key_extract+0x442/0xc10 [openvswitch] [<ffffffffa043b26d>] ovs_vport_receive+0x5d/0xb0 [openvswitch] [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50 [<ffffffffa043c11d>] internal_dev_xmit+0x6d/0x150 [openvswitch] [<ffffffffa043c0b5>] ? internal_dev_xmit+0x5/0x150 [openvswitch] [<ffffffff8168fb5f>] dev_hard_start_xmit+0x2df/0x660 [<ffffffff8168f5ea>] ? validate_xmit_skb.isra.105.part.106+0x1a/0x2b0 [<ffffffff81690925>] __dev_queue_xmit+0x8f5/0x950 [<ffffffff81690080>] ? __dev_queue_xmit+0x50/0x950 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0 [<ffffffff81690990>] dev_queue_xmit+0x10/0x20 [<ffffffff8169a418>] neigh_resolve_output+0x178/0x220 [<ffffffff81752759>] ? ip6_finish_output2+0x219/0x7b0 [<ffffffff81752759>] ip6_finish_output2+0x219/0x7b0 [<ffffffff817525a5>] ? ip6_finish_output2+0x65/0x7b0 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50 [<ffffffff817796cc>] icmpv6_push_pending_frames+0xac/0xe0 [<ffffffff8177a4be>] icmpv6_echo_reply+0x42e/0x500 [<ffffffff8177acbf>] icmpv6_rcv+0x4cf/0x580 [<ffffffff81755ac7>] ip6_input_finish+0x1a7/0x690 [<ffffffff81755925>] ? ip6_input_finish+0x5/0x690 [<ffffffff817567a0>] ip6_input+0x30/0xa0 [<ffffffff81755920>] ? ip6_rcv_finish+0x1a0/0x1a0 [<ffffffff817557ce>] ip6_rcv_finish+0x4e/0x1a0 [<ffffffff8175640f>] ipv6_rcv+0x45f/0x7c0 [<ffffffff81755fe6>] ? ipv6_rcv+0x36/0x7c0 [<ffffffff81755780>] ? ip6_make_skb+0x1c0/0x1c0 [<ffffffff8168b649>] __netif_receive_skb_core+0x229/0xb80 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0 [<ffffffff8168c07f>] ? process_backlog+0x6f/0x230 [<ffffffff8168bfb6>] __netif_receive_skb+0x16/0x70 [<ffffffff8168c088>] process_backlog+0x78/0x230 [<ffffffff8168c0ed>] ? process_backlog+0xdd/0x230 [<ffffffff8168db43>] net_rx_action+0x203/0x480 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0 [<ffffffff817c156e>] __do_softirq+0xde/0x49f [<ffffffff81752768>] ? ip6_finish_output2+0x228/0x7b0 [<ffffffff817c070c>] do_softirq_own_stack+0x1c/0x30 <EOI> [<ffffffff8106f88b>] do_softirq.part.18+0x3b/0x40 [<ffffffff8106f946>] __local_bh_enable_ip+0xb6/0xc0 [<ffffffff81752791>] ip6_finish_output2+0x251/0x7b0 [<ffffffff81754af1>] ? ip6_fragment+0xba1/0xc50 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50 [<ffffffff81778558>] rawv6_sendmsg+0xa28/0xe30 [<ffffffff81719097>] ? inet_sendmsg+0xc7/0x1d0 [<ffffffff817190d6>] inet_sendmsg+0x106/0x1d0 [<ffffffff81718fd5>] ? inet_sendmsg+0x5/0x1d0 [<ffffffff8166d078>] sock_sendmsg+0x38/0x50 [<ffffffff8166d4d6>] SYSC_sendto+0xf6/0x170 [<ffffffff8100201b>] ? trace_hardirqs_on_thunk+0x1b/0x1d [<ffffffff8166e38e>] SyS_sendto+0xe/0x10 [<ffffffff817bebe5>] entry_SYSCALL_64_fastpath+0x18/0xa8 Code: 06 48 83 3f 00 75 26 48 8b 87 d8 00 00 00 2b 87 d0 00 00 00 48 39 d0 72 14 8b 87 e4 00 00 00 83 f8 01 75 09 48 83 7f 18 00 74 9a <0f> 0b 41 8b 86 cc 00 00 00 49 8# RIP [<ffffffff8175468a>] ip6_fragment+0x73a/0xc50 RSP <ffff880072803120> Fixes: 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free clone operations") Reported-by: Daniele Di Proietto <diproiettod@vmware.com> Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20rtnetlink: add new RTM_GETSTATS message to dump link statsRoopa Prabhu1-0/+158
This patch adds a new RTM_GETSTATS message to query link stats via netlink from the kernel. RTM_NEWLINK also dumps stats today, but RTM_NEWLINK returns a lot more than just stats and is expensive in some cases when frequent polling for stats from userspace is a common operation. RTM_GETSTATS is an attempt to provide a light weight netlink message to explicity query only link stats from the kernel on an interface. The idea is to also keep it extensible so that new kinds of stats can be added to it in the future. This patch adds the following attribute for NETDEV stats: struct nla_policy ifla_stats_policy[IFLA_STATS_MAX + 1] = { [IFLA_STATS_LINK_64] = { .len = sizeof(struct rtnl_link_stats64) }, }; Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of a single interface or all interfaces with NLM_F_DUMP. Future possible new types of stat attributes: link af stats: - IFLA_STATS_LINK_IPV6 (nested. for ipv6 stats) - IFLA_STATS_LINK_MPLS (nested. for mpls/mdev stats) extended stats: - IFLA_STATS_LINK_EXTENDED (nested. extended software netdev stats like bridge, vlan, vxlan etc) - IFLA_STATS_LINK_HW_EXTENDED (nested. extended hardware stats which are available via ethtool today) This patch also declares a filter mask for all stat attributes. User has to provide a mask of stats attributes to query. filter mask can be specified in the new hdr 'struct if_stats_msg' for stats messages. Other important field in the header is the ifindex. This api can also include attributes for global stats (eg tcp) in the future. When global stats are included in a stats msg, the ifindex in the header must be zero. A single stats message cannot contain both global and netdev specific stats. To easily distinguish them, netdev specific stat attributes name are prefixed with IFLA_STATS_LINK_ Without any attributes in the filter_mask, no stats will be returned. This patch has been tested with mofified iproute2 ifstat. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19VSOCK: Only check error on skb_recv_datagram when skb is NULLJorgen Hansen1-5/+2
If skb_recv_datagram returns an skb, we should ignore the err value returned. Otherwise, datagram receives will return EAGAIN when they have to wait for a datagram. Acked-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19net: dsa: kill circular reference with slave privVivien Didelot2-10/+4
The dsa_slave_priv structure does not need a pointer to its net_device. Kill it. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19bpf: add event output helper for notifications/sampling/loggingDaniel Borkmann1-0/+2
This patch adds a new helper for cls/act programs that can push events to user space applications. For networking, this can be f.e. for sampling, debugging, logging purposes or pushing of arbitrary wake-up events. The idea is similar to a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example"). The eBPF program utilizes a perf event array map that user space populates with fds from perf_event_open(), the eBPF program calls into the helper f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw)) so that the raw data is pushed into the fd f.e. at the map index of the current CPU. User space can poll/mmap/etc on this and has a data channel for receiving events that can be post-processed. The nice thing is that since the eBPF program and user space application making use of it are tightly coupled, they can define their own arbitrary raw data format and what/when they want to push. While f.e. packet headers could be one part of the meta data that is being pushed, this is not a substitute for things like packet sockets as whole packet is not being pushed and push is only done in a single direction. Intention is more of a generically usable, efficient event pipe to applications. Workflow is that tc can pin the map and applications can attach themselves e.g. after cls/act setup to one or multiple map slots, demuxing is done by the eBPF program. Adding this facility is with minimal effort, it reuses the helper introduced in a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") and we get its functionality for free by overloading its BPF_FUNC_ identifier for cls/act programs, ctx is currently unused, but will be made use of in future. Example will be added to iproute2's BPF example files. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>