aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv6 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-04-27net: rename ICMP6_INC_STATS_BH()Eric Dumazet3-9/+9
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: tcp: rename TCP_INC_STATS_BHEric Dumazet1-7/+7
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: udp: rename UDP_INC_STATS_BH()Eric Dumazet1-19/+19
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(), and UDP6_INC_STATS_BH() to __UDP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: snmp: kill various STATS_USER() helpersEric Dumazet1-25/+24
In the old days (before linux-3.0), SNMP counters were duplicated, one for user context, and one for BH context. After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(), NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(), SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(), UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER() Following patches will rename __BH helpers to make clear their usage is not tied to BH being disabled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-33/+15
Minor overlapping changes in the conflicts. In the macsec case, the change of the default ID macro name overlapped with the 64-bit netlink attribute alignment fixes in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: ipv6: Use passed in table for nexthop lookupsDavid Ahern1-2/+38
Similar to 3bfd847203c6 ("net: Use passed in table for nexthop lookups") for IPv4, if the route spec contains a table id use that to lookup the next hop first and fall back to a full lookup if it fails (per the fix 4c9bcd117918b ("net: Fix nexthop lookups")). Example: root@kenny:~# ip -6 ro ls table red local 2100:1::1 dev lo proto none metric 0 pref medium 2100:1::/120 dev eth1 proto kernel metric 256 pref medium local 2100:2::1 dev lo proto none metric 0 pref medium 2100:2::/120 dev eth2 proto kernel metric 256 pref medium local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium fe80::/64 dev eth1 proto kernel metric 256 pref medium fe80::/64 dev eth2 proto kernel metric 256 pref medium ff00::/8 dev red metric 256 pref medium ff00::/8 dev eth1 metric 256 pref medium ff00::/8 dev eth2 metric 256 pref medium unreachable default dev lo metric 240 error -113 pref medium root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64 RTNETLINK answers: No route to host Route add fails even though 2100:1::64 is a reachable next hop: root@kenny:~# ping6 -I red 2100:1::64 ping6: Warning: source address might be selected on device other than red. PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes 64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms With this patch: root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64 root@kenny:~# ip -6 ro ls table red local 2100:1::1 dev lo proto none metric 0 pref medium 2100:1::/120 dev eth1 proto kernel metric 256 pref medium local 2100:2::1 dev lo proto none metric 0 pref medium 2100:2::/120 dev eth2 proto kernel metric 256 pref medium 2100:3::/64 via 2100:1::64 dev eth1 metric 1024 pref medium local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium fe80::/64 dev eth1 proto kernel metric 256 pref medium fe80::/64 dev eth2 proto kernel metric 256 pref medium ff00::/8 dev red metric 256 pref medium ff00::/8 dev eth1 metric 256 pref medium ff00::/8 dev eth2 metric 256 pref medium unreachable default dev lo metric 240 error -113 pref medium Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26net: ipv6: Delete host routes on an ifdownDavid Ahern1-33/+15
It was a simple idea -- save IPv6 configured addresses on a link down so that IPv6 behaves similar to IPv4. As always the devil is in the details and the IPv6 stack as too many behavioral differences from IPv4 making the simple idea more complicated than it needs to be. The current implementation for keeping IPv6 addresses can panic or spit out a warning in one of many paths: 1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in rt6_fill_node while handling a route dump request. 2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in fib6_del 3. Panic in fib6_purge_rt because rt6i_ref count is not 1. The root cause of all these is references related to the host route for an address that is retained. So, this patch deletes the host route every time the ifdown loop runs. Since the host route is deleted and will be re-generated an up there is no longer a need for the l3mdev fix up. On the 'admin up' side move addrconf_permanent_addr into the NETDEV_UP event handling so that it runs only once versus on UP and CHANGE events. All of the current panics and warnings appear to be related to addresses on the loopback device, but given the catastrophic nature when a bug is triggered this patch takes the conservative approach and evicts all host routes rather than trying to determine when it can be re-used and when it can not. That can be a later optimizaton if desired. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26Revert "ipv6: Revert optional address flusing on ifdown."David S. Miller1-12/+150
This reverts commit 841645b5f2dfceac69b78fcd0c9050868d41ea61. Ok, this puts the feature back. I've decided to apply David A.'s bug fix and run with that rather than make everyone wait another whole release for this feature. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ila: add checksum neutral ILA translationsTom Herbert4-15/+105
Support checksum neutral ILA as described in the ILA draft. The low order 16 bits of the identifier are used to contain the checksum adjustment value. The csum-mode parameter is added to described checksum processing. There are three values: - adjust transport checksum (previous behavior) - do checksum neutral mapping - do nothing On output the csum-mode in the ila_params is checked and acted on. If mode is checksum neutral mapping then to mapping and set C-bit. On input, C-bit is checked. If it is set checksum-netural mapping is done (regardless of csum-mode in ila params) and C-bit will be cleared. If it is not set then action in csum-mode is taken. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ila: xlat changesTom Herbert1-69/+34
Change model of xlat to be used only for input where lookup is done on the locator part of an address (comparing to locator_match as key in rhashtable). This is needed for checksum neutral translation which obfuscates the low order 16 bits of the identifier. It also permits hosts to be in muliple ILA domains (each locator can map to a different SIR address). A check is also added to disallow translating non-ILA addresses (check of type in identifier). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26ila: Add struct definitions and helpersTom Herbert4-82/+161
Add structures for identifiers, locators, and an ila address which is composed of a locator and identifier and in6_addr can be cast to it. This includes a three bit type field and enums for the types defined in ILA I-D. In ILA lwt don't allow user to set a translation for a non-ILA address (type of identifier is zero meaning it is an IID). This also requires that the destination prefix is at least 65 bytes (64 bit locator and first byte of identifier). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25net: better drop monitoring in ip{6}_recv_error()Eric Dumazet1-5/+5
We should call consume_skb(skb) when skb is properly consumed, or kfree_skb(skb) when skb must be dropped in error case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25ipv6: Revert optional address flusing on ifdown.David S. Miller1-150/+12
This reverts the following three commits: 70af921db6f8835f4b11c65731116560adb00c14 799977d9aafbf0ca0b9c39b04cbfb16db71302c9 f1705ec197e705b79ea40fe7a2cc5acfa1d3bfac The feature was ill conceived, has terrible semantics, and has added nothing but regressions to the already fragile ipv6 stack. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25ipv6: use nla_put_u64_64bit()Nicolas Dichtel2-7/+11
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-241/+79
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, mostly from Florian Westphal to sort out the lack of sufficient validation in x_tables and connlabel preparation patches to add nf_tables support. They are: 1) Ensure we don't go over the ruleset blob boundaries in mark_source_chains(). 2) Validate that target jumps land on an existing xt_entry. This extra sanitization comes with a performance penalty when loading the ruleset. 3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables. 4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables. 5) Make sure the minimal possible target size in x_tables. 6) Similar to #3, add xt_compat_check_entry_offsets() for compat code. 7) Check that standard target size is valid. 8) More sanitization to ensure that the target_offset field is correct. 9) Add xt_check_entry_match() to validate that matches are well-formed. 10-12) Three patch to reduce the number of parameters in translate_compat_table() for {arp,ip,ip6}tables by using a container structure. 13) No need to return value from xt_compat_match_from_user(), so make it void. 14) Consolidate translate_table() so it can be used by compat code too. 15) Remove obsolete check for compat code, so we keep consistent with what was already removed in the native layout code (back in 2007). 16) Get rid of target jump validation from mark_source_chains(), obsoleted by #2. 17) Introduce xt_copy_counters_from_user() to consolidate counter copying, and use it from {arp,ip,ip6}tables. 18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump functions. 19) Move nf_connlabel_match() to xt_connlabel. 20) Skip event notification if connlabel did not change. 21) Update of nf_connlabels_get() to make the upcoming nft connlabel support easier. 23) Remove spinlock to read protocol state field in conntrack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-68/+143
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21ip6mr: align RTA_MFC_STATS on 64-bitNicolas Dichtel1-2/+2
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19net/ipv6/addrconf: fix sysctl table indentationKonstantin Khlebnikov1-309/+307
Separated from previous patch for readability. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-19net/ipv6/addrconf: simplify sysctl registrationKonstantin Khlebnikov1-26/+17
Struct ctl_table_header holds pointer to sysctl table which could be used for freeing it after unregistration. IPv4 sysctls already use that. Remove redundant NULL assignment: ndev allocated using kzalloc. This also saves some bytes: sysctl table could be shorter than DEVCONF_MAX+1 if some options are disable in config. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16ip6gre: Add support for GSOAlexander Duyck1-23/+33
This patch adds code borrowed from bits and pieces of other protocols to the IPv6 GRE path so that we can support GSO over IPv6 based GRE tunnels. By adding this support we are able to significantly improve the throughput for GRE tunnels as we are able to make use of GSO. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16ip6gre: Add support for basic offloads offloads excluding GSOAlexander Duyck1-2/+21
This patch adds support for the basic offloads we support on most devices. Specifically with this patch set we can support checksum offload, basic scatter-gather, and highdma. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16ip6gretap: Fix MTU to allow for Ethernet headerAlexander Duyck1-0/+2
When we were creating an ip6gretap interface the MTU was about 6 bytes short of what was needed. It turns out we were not taking the Ethernet header into account and as a result we were eating into the 8 bytes reserved for the encap limit. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skbAlexander Duyck1-8/+6
This patch updates the IP tunnel core function iptunnel_handle_offloads so that we return an int and do not free the skb inside the function. This actually allows us to clean up several paths in several tunnels so that we can free the skb at one point in the path without having to have a secondary path if we are supporting tunnel offloads. In addition it should resolve some double-free issues I have found in the tunnels paths as I believe it is possible for us to end up triggering such an event in the case of fou or gue. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15tcp: do not mess with listener sk_wmem_allocEric Dumazet1-2/+2
When removing sk_refcnt manipulation on synflood, I missed that using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already transitioned to 0. We should hold sk_refcnt instead, but this is a big deal under attack. (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only) In this patch, I chose to not attach a socket to syncookies skb. Performance is now 5 Mpps instead of 3.2 Mpps. Following patch will remove last known false sharing in tcp_rcv_state_process() Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14ipv6: udp: Do a route lookup and update during release_cbMartin KaFai Lau2-0/+21
This patch adds a release_cb for UDPv6. It does a route lookup and updates sk->sk_dst_cache if it is needed. It picks up the left-over job from ip6_sk_update_pmtu() if the sk was owned by user during the pmtu update. It takes a rcu_read_lock to protect the __sk_dst_get() operations because another thread may do ip6_dst_store() without taking the sk lock (e.g. sendmsg). Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reported-by: Wei Wang <weiwan@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14ipv6: datagram: Update dst cache of a connected datagram sk during pmtu updateMartin KaFai Lau2-9/+23
There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone. This patch updates the sk->sk_dst_cache for a connected datagram sk during pmtu-update code path. Note that the sk->sk_v6_daddr is used to do the route lookup instead of skb->data (i.e. iph). It is because a UDP socket can become connected after sending out some datagrams in un-connected state. or It can be connected multiple times to different destinations. Hence, iph may not be related to where sk is currently connected to. It is done under '!sock_owned_by_user(sk)' condition because the user may make another ip6_datagram_connect() (i.e changing the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update code path. For the sock_owned_by_user(sk) == true case, the next patch will introduce a release_cb() which will update the sk->sk_dst_cache. Test: Server (Connected UDP Socket): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Route Details: [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac' 2fac::/64 dev eth0 proto kernel metric 256 pref medium 2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium A simple python code to create a connected UDP socket: import socket import errno HOST = '2fac::1' PORT = 8080 s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) s.bind((HOST, PORT)) s.connect(('2fac:face::face', 53)) print("connected") while True: try: data = s.recv(1024) except socket.error as se: if se.errno == errno.EMSGSIZE: pmtu = s.getsockopt(41, 24) print("PMTU:%d" % pmtu) break s.close() Python program output after getting a ICMPV6_PKT_TOOBIG: [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py connected PMTU:1300 Cache routes after recieving TOOBIG: [root@arch-fb-vm1 ~]# ip -6 r show table cache 2fac:face::face via 2fac::face dev eth0 metric 0 cache expires 463sec mtu 1300 pref medium Client (Send the ICMPV6_PKT_TOOBIG): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ scapy is used to generate the TOOBIG message. Here is the scapy script I have used: >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac:: 1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53) >>> sendp(p, iface='qemubr0') Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reported-by: Wei Wang <weiwan@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14ipv6: datagram: Refactor dst lookup and update codes to a new functionMartin KaFai Lau1-46/+57
This patch moves the route lookup and update codes for connected datagram sk to a newly created function ip6_datagram_dst_update() It will be reused during the pmtu update in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14ipv6: datagram: Refactor flowi6 init codes to a new functionMartin KaFai Lau1-20/+30
Move flowi6 init codes for connected datagram sk to a newly created function ip6_datagram_flow_key_init(). Notes: 1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect 2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init() This new function will be reused during pmtu update in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GSO: Support partial segmentation offloadAlexander Duyck1-1/+9
This patch adds support for something I am referring to as GSO partial. The basic idea is that we can support a broader range of devices for segmentation if we use fixed outer headers and have the hardware only really deal with segmenting the inner header. The idea behind the naming is due to the fact that everything before csum_start will be fixed headers, and everything after will be the region that is handled by hardware. With the current implementation it allows us to add support for the following GSO types with an inner TSO_MANGLEID or TSO6 offload: NETIF_F_GSO_GRE NETIF_F_GSO_GRE_CSUM NETIF_F_GSO_IPIP NETIF_F_GSO_SIT NETIF_F_UDP_TUNNEL NETIF_F_UDP_TUNNEL_CSUM In the case of hardware that already supports tunneling we may be able to extend this further to support TSO_TCPV4 without TSO_MANGLEID if the hardware can support updating inner IPv4 headers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID valuesAlexander Duyck1-2/+6
This patch does two things. First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As a result we should now be able to aggregate flows that were converted from IPv6 to IPv4. In addition this allows us more flexibility for future implementations of segmentation as we may be able to use a fixed IP ID when segmenting the flow. The second thing this does is that it places limitations on the outer IPv4 ID header in the case of tunneled frames. Specifically it forces the IP ID to be incrementing by 1 unless the DF bit is set in the outer IPv4 header. This way we can avoid creating overlapping series of IP IDs that could possibly be fragmented if the frame goes through GRO and is then resegmented via GSO. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14GSO: Add GSO type for fixed IPv4 IDAlexander Duyck1-1/+2
This patch adds support for TSO using IPv4 headers with a fixed IP ID field. This is meant to allow us to do a lossless GRO in the case of TCP flows that use a fixed IP ID such as those that convert IPv6 header to IPv4 headers. In addition I am adding a feature that for now I am referring to TSO with IP ID mangling. Basically when this flag is enabled the device has the option to either output the flow with incrementing IP IDs or with a fixed IP ID regardless of what the original IP ID ordering was. This is useful in cases where the DF bit is set and we do not care if the original IP ID value is maintained. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13net: ipv6: Do not keep linklocal and loopback addressesDavid Ahern1-2/+10
f1705ec197e7 added the option to retain user configured addresses on an admin down. A comment to one of the later revisions suggested using the IFA_F_PERMANENT flag rather than adding a user_managed boolean to the ifaddr struct. A side effect of this change is that link local and loopback addresses are also retained which is not part of the objective of f1705ec197e7. Add check to drop those addresses. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13ipv6, token: allow for clearing the current device tokenDaniel Borkmann1-4/+6
The original tokenized iid support implemented via f53adae4eae5 ("net: ipv6: add tokenized interface identifier support") didn't allow for clearing a device token as it was intended that this addressing mode was the only one active for globally scoped IPv6 addresses. Later we relaxed that restriction via 617fe29d45bd ("net: ipv6: only invalidate previously tokenized addresses"), and we should also allow for clearing tokens as there's no good reason why it shouldn't be allowed. Fixes: 617fe29d45bd ("net: ipv6: only invalidate previously tokenized addresses") Reported-by: Robin H. Johnson <robbat2@gentoo.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14netfilter: x_tables: introduce and use xt_copy_counters_from_userFlorian Westphal1-44/+5
The three variants use same copy&pasted code, condense this into a helper and use that. Make sure info.name is 0-terminated. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: remove obsolete checkFlorian Westphal1-7/+0
Since 'netfilter: x_tables: validate targets of jumps' change we validate that the target aligns exactly with beginning of a rule, so offset test is now redundant. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: remove obsolete overflow check for compat case tooFlorian Westphal1-2/+0
commit 9e67d5a739327c44885adebb4f3a538050be73e4 ("[NETFILTER]: x_tables: remove obsolete overflow check") left the compat parts alone, but we can kill it there as well. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: do compat validation via translate_tableFlorian Westphal1-125/+23
This looks like refactoring, but its also a bug fix. Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few sanity tests that are done in the normal path. For example, we do not check for underflows and the base chain policies. While its possible to also add such checks to the compat path, its more copy&pastry, for instance we cannot reuse check_underflow() helper as e->target_offset differs in the compat case. Other problem is that it makes auditing for validation errors harder; two places need to be checked and kept in sync. At a high level 32 bit compat works like this: 1- initial pass over blob: validate match/entry offsets, bounds checking lookup all matches and targets do bookkeeping wrt. size delta of 32/64bit structures assign match/target.u.kernel pointer (points at kernel implementation, needed to access ->compatsize etc.) 2- allocate memory according to the total bookkeeping size to contain the translated ruleset 3- second pass over original blob: for each entry, copy the 32bit representation to the newly allocated memory. This also does any special match translations (e.g. adjust 32bit to 64bit longs, etc). 4- check if ruleset is free of loops (chase all jumps) 5-first pass over translated blob: call the checkentry function of all matches and targets. The alternative implemented by this patch is to drop steps 3&4 from the compat process, the translation is changed into an intermediate step rather than a full 1:1 translate_table replacement. In the 2nd pass (step #3), change the 64bit ruleset back to a kernel representation, i.e. put() the kernel pointer and restore ->u.user.name . This gets us a 64bit ruleset that is in the format generated by a 64bit iptables userspace -- we can then use translate_table() to get the 'native' sanity checks. This has two drawbacks: 1. we re-validate all the match and target entry structure sizes even though compat translation is supposed to never generate bogus offsets. 2. we put and then re-lookup each match and target. THe upside is that we get all sanity tests and ruleset validations provided by the normal path and can remove some duplicated compat code. iptables-restore time of autogenerated ruleset with 300k chains of form -A CHAIN0001 -m limit --limit 1/s -j CHAIN0002 -A CHAIN0002 -m limit --limit 1/s -j CHAIN0003 shows no noticeable differences in restore times: old: 0m30.796s new: 0m31.521s 64bit: 0m25.674s Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: xt_compat_match_from_user doesn't need a retvalFlorian Westphal1-18/+9
Always returned 0. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: ip6_tables: simplify translate_compat_table argsFlorian Westphal1-35/+24
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: check for bogus target offsetFlorian Westphal1-2/+3
We're currently asserting that targetoff + targetsize <= nextoff. Extend it to also check that targetoff is >= sizeof(xt_entry). Since this is generic code, add an argument pointing to the start of the match/target, we can then derive the base structure size from the delta. We also need the e->elems pointer in a followup change to validate matches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: add compat version of xt_check_entry_offsetsFlorian Westphal1-1/+2
32bit rulesets have different layout and alignment requirements, so once more integrity checks get added to xt_check_entry_offsets it will reject well-formed 32bit rulesets. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: kill check_entry helperFlorian Westphal1-12/+8
Once we add more sanity testing to xt_check_entry_offsets it becomes relvant if we're expecting a 32bit 'config_compat' blob or a normal one. Since we already have a lot of similar-named functions (check_entry, compat_check_entry, find_and_check_entry, etc.) and the current incarnation is short just fold its contents into the callers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: add and use xt_check_entry_offsetsFlorian Westphal1-11/+1
Currently arp/ip and ip6tables each implement a short helper to check that the target offset is large enough to hold one xt_entry_target struct and that t->u.target_size fits within the current rule. Unfortunately these checks are not sufficient. To avoid adding new tests to all of ip/ip6/arptables move the current checks into a helper, then extend this helper in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: validate targets of jumpsFlorian Westphal1-0/+16
When we see a jump also check that the offset gets us to beginning of a rule (an ipt_entry). The extra overhead is negible, even with absurd cases. 300k custom rules, 300k jumps to 'next' user chain: [ plus one jump from INPUT to first userchain ]: Before: real 0m24.874s user 0m7.532s sys 0m16.076s After: real 0m27.464s user 0m7.436s sys 0m18.840s Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: don't move to non-existent next ruleFlorian Westphal1-0/+4
Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Base chains enforce absolute verdict. User defined chains are supposed to end with an unconditional return, xtables userspace adds them automatically. But if such return is missing we will move to non-existent next rule. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-27/+31
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains the first batch of Netfilter updates for your net-next tree. 1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong. 2) Define and register netfilter's afinfo for the bridge family, this comes in preparation for native nfqueue's bridge for nft, from Stephane Bryant. 3) Add new attributes to store layer 2 and VLAN headers to nfqueue, also from Stephane Bryant. 4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes coming from userspace, from Stephane Bryant. 5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit in IPv6 SYNPROXY, from Liping Zhang. 6) Remove unnecessary check for dst == NULL in nf_reject_ipv6, from Haishuang Yan. 7) Deinline ctnetlink event report functions, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11net: vrf: Fix dev refcnt leak due to IPv6 prefix routeDavid Ahern1-0/+10
ifupdown2 found a kernel bug with IPv6 routes and movement from the main table to the VRF table. Sequence of events: Create the interface and add addresses: ip link add dev eth4.105 link eth4 type vlan id 105 ip addr add dev eth4.105 8.105.105.10/24 ip -6 addr add dev eth4.105 2008:105:105::10/64 At this point IPv6 has inserted a prefix route in the main table even though the interface is 'down'. From there the VRF device is created: ip link add dev vrf105 type vrf table 105 ip addr add dev vrf105 9.9.105.10/32 ip -6 addr add dev vrf105 2000:9:105::10/128 ip link set vrf105 up Then the interface is enslaved, while still in the 'down' state: ip link set dev eth4.105 master vrf105 Since the device is down the VRF driver cycling the device does not send the NETDEV_UP and NETDEV_DOWN but rather the NETDEV_CHANGE event which does not flush the routes inserted prior. When the link is brought up ip link set dev eth4.105 up the prefix route is added in the VRF table, but does not remove the route from the main table. Fix by handling the NETDEV_CHANGEUPPER event similar what was implemented for IPv4 in 7f49e7a38b77 ("net: Flush local routes when device changes vrf association") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11net: vrf: Fix dst reference countingDavid Ahern1-3/+4
Vivek reported a kernel exception deleting a VRF with an active connection through it. The root cause is that the socket has a cached reference to a dst that is destroyed. Converting the dst_destroy to dst_release and letting proper reference counting kick in does not work as the dst has a reference to the device which needs to be released as well. I talked to Hannes about this at netdev and he pointed out the ipv4 and ipv6 dst handling has dst_ifdown for just this scenario. Rather than continuing with the reinvented dst wheel in VRF just remove it and leverage the ipv4 and ipv6 versions. Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-5/+5
2016-04-09ipv6: fix inet6_lookup_listener()Eric Dumazet1-2/+2
A stupid refactoring bug in inet6_lookup_listener() needs to be fixed in order to get proper SO_REUSEPORT behavior. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>