aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv6/netfilter (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-09-28netfilter: masquerade: don't flush all conntracks if only one address deleted on deviceTan Hu1-3/+16
We configured iptables as below, which only allowed incoming data on established connections: iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT iptables -t mangle -P PREROUTING DROP When deleting a secondary address, current masquerade implements would flush all conntracks on this device. All the established connections on primary address also be deleted, then subsequent incoming data on the connections would be dropped wrongly because it was identified as NEW connection. So when an address was delete, it should only flush connections related with the address. Signed-off-by: Tan Hu <tan.hu@zte.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17netfilter: xtables: avoid BUG_ONFlorian Westphal2-3/+12
I see no reason for them, label or timer cannot be NULL, and if they were, we'll crash with null deref anyway. For skb_header_pointer failure, just set hotdrop to true and toss such packet. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
2018-09-10net: Add and use skb_mark_not_on_list().David S. Miller1-1/+1
An SKB is not on a list if skb->next is NULL. Codify this convention into a helper function and use it where we are dequeueing an SKB and need to mark it as such. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09ip: frags: fix crash in ip_do_fragment()Taehee Yoo1-0/+1
A kernel crash occurrs when defragmented packet is fragmented in ip_do_fragment(). In defragment routine, skb_orphan() is called and skb->ip_defrag_offset is set. but skb->sk and skb->ip_defrag_offset are same union member. so that frag->sk is not NULL. Hence crash occurrs in skb->sk check routine in ip_do_fragment() when defragmented packet is fragmented. test commands: %iptables -t nat -I POSTROUTING -j MASQUERADE %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000 splat looks like: [ 261.069429] kernel BUG at net/ipv4/ip_output.c:636! [ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3 [ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600 [ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c [ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202 [ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004 [ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8 [ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395 [ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4 [ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000 [ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000 [ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0 [ 261.198158] Call Trace: [ 261.199018] ? dst_output+0x180/0x180 [ 261.205011] ? save_trace+0x300/0x300 [ 261.209018] ? ip_copy_metadata+0xb00/0xb00 [ 261.213034] ? sched_clock_local+0xd4/0x140 [ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack] [ 261.223014] ? rt_cpu_seq_stop+0x10/0x10 [ 261.227014] ? find_held_lock+0x39/0x1c0 [ 261.233008] ip_finish_output+0x51d/0xb50 [ 261.237006] ? ip_fragment.constprop.56+0x220/0x220 [ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack] [ 261.250152] ? rcu_is_watching+0x77/0x120 [ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4] [ 261.261033] ? nf_hook_slow+0xb1/0x160 [ 261.265007] ip_output+0x1c7/0x710 [ 261.269005] ? ip_mc_output+0x13f0/0x13f0 [ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0 [ 261.278152] ? ip_fragment.constprop.56+0x220/0x220 [ 261.282996] ? nf_hook_slow+0xb1/0x160 [ 261.287007] raw_sendmsg+0x21f9/0x4420 [ 261.291008] ? dst_output+0x180/0x180 [ 261.297003] ? sched_clock_cpu+0x126/0x170 [ 261.301003] ? find_held_lock+0x39/0x1c0 [ 261.306155] ? stop_critical_timings+0x420/0x420 [ 261.311004] ? check_flags.part.36+0x450/0x450 [ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40 [ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40 [ 261.326142] ? cyc2ns_read_end+0x10/0x10 [ 261.330139] ? raw_bind+0x280/0x280 [ 261.334138] ? sched_clock_cpu+0x126/0x170 [ 261.338995] ? check_flags.part.36+0x450/0x450 [ 261.342991] ? __lock_acquire+0x4500/0x4500 [ 261.348994] ? inet_sendmsg+0x11c/0x500 [ 261.352989] ? dst_output+0x180/0x180 [ 261.357012] inet_sendmsg+0x11c/0x500 [ ... ] v2: - clear skb->sk at reassembly routine.(Eric Dumarzet) Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16netfilter: ip6t_rpfilter: set F_IFACE for linklocal addressesFlorian Westphal1-1/+11
Roman reports that DHCPv6 client no longer sees replies from server due to ip6tables -t raw -A PREROUTING -m rpfilter --invert -j DROP rule. We need to set the F_IFACE flag for linklocal addresses, they are scoped per-device. Fixes: 47b7e7f82802 ("netfilter: don't set F_IFACE on ipv6 fib lookups") Reported-by: Roman Mamedov <rm@romanrm.net> Tested-by: Roman Mamedov <rm@romanrm.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-05ipv6: defrag: drop non-last frags smaller than min mtuFlorian Westphal1-0/+4
don't bother with pathological cases, they only waste cycles. IPv6 requires a minimum MTU of 1280 so we should never see fragments smaller than this (except last frag). v3: don't use awkward "-offset + len" v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68). There were concerns that there could be even smaller frags generated by intermediate nodes, e.g. on radio networks. Cc: Peter Oskolkov <posk@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05ip: use rb trees for IP frag queue.Peter Oskolkov1-0/+1
Similar to TCP OOO RX queue, it makes sense to use rb trees to store IP fragments, so that OOO fragments are inserted faster. Tested: - a follow-up patch contains a rather comprehensive ip defrag self-test (functional) - ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`: netstat --statistics Ip: 282078937 total packets received 0 forwarded 0 incoming packets discarded 946760 incoming packets delivered 18743456 requests sent out 101 fragments dropped after timeout 282077129 reassemblies required 944952 packets reassembled ok 262734239 packet reassembles failed (The numbers/stats above are somewhat better re: reassemblies vs a kernel without this patchset. More comprehensive performance testing TBD). Reported-by: Jann Horn <jannh@google.com> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller6-879/+17
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree: 1) No need to set ttl from reject action for the bridge family, from Taehee Yoo. 2) Use a fixed timeout for flow that are passed up from the flowtable to conntrack, from Florian Westphal. 3) More preparation patches for tproxy support for nf_tables, from Mate Eckl. 4) Remove unnecessary indirection in core IPv6 checksum function, from Florian Westphal. 5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it. From Florian Westphal. 6) socket match now selects socket infrastructure, instead of depending on it. From Mate Eckl. 7) Patch series to simplify conntrack tuple building/parsing from packet path and ctnetlink, from Florian Westphal. 8) Fetch timeout policy from protocol helpers, instead of doing it from core, from Florian Westphal. 9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from Florian Westphal. 10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES respectively, instead of IPV6. Patch from Mate Eckl. 11) Add specific function for garbage collection in conncount, from Yi-Hung Wei. 12) Catch number of elements in the connlimit list, from Yi-Hung Wei. 13) Move locking to nf_conncount, from Yi-Hung Wei. 14) Series of patches to add lockless tree traversal in nf_conncount, from Yi-Hung Wei. 15) Resolve clash in matching conntracks when race happens, from Martynas Pumputis. 16) If connection entry times out, remove template entry from the ip_vs_conn_tab table to improve behaviour under flood, from Julian Anastasov. 17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng. 18) Call abort from 2-phase commit protocol before requesting modules, make sure this is done under the mutex, from Florian Westphal. 19) Grab module reference when starting transaction, also from Florian. 20) Dynamically allocate expression info array for pre-parsing, from Florian. 21) Add per netns mutex for nf_tables, from Florian Westphal. 22) A couple of patches to simplify and refactor nf_osf code to prepare for nft_osf support. 23) Break evaluation on missing socket, from Mate Eckl. 24) Allow to match socket mark from nft_socket, from Mate Eckl. 25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is built-in into nf_conntrack. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller3-6/+15
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18ipv6: remove dependency of nf_defrag_ipv6 on ipv6 moduleFlorian Westphal2-7/+13
IPV6=m DEFRAG_IPV6=m CONNTRACK=y yields: net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get': net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable' net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6' Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params ip6_frag_init and ip6_expire_frag_queue so it would be needed to force IPV6=y too. This patch gets rid of the 'followup linker error' by removing the dependency of ipv6.ko symbols from netfilter ipv6 defrag. Shared code is placed into a header, then used from both. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-17netfilter: conntrack: remove l3proto abstractionFlorian Westphal4-771/+4
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto abstraction. This gets rid of all l3proto indirect calls and the need to do a lookup on the function to call for l3 demux. It increases module size by only a small amount (12kbyte), so this reduces size because nf_conntrack.ko is useless without either nf_conntrack_ipv4 or nf_conntrack_ipv6 module. before: text data bss dec hex filename 7357 1088 0 8445 20fd nf_conntrack_ipv4.ko 7405 1084 4 8493 212d nf_conntrack_ipv6.ko 72614 13689 236 86539 1520b nf_conntrack.ko 19K nf_conntrack_ipv4.ko 19K nf_conntrack_ipv6.ko 179K nf_conntrack.ko after: text data bss dec hex filename 79277 13937 236 93450 16d0a nf_conntrack.ko 191K nf_conntrack.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_timeout() indirectionFlorian Westphal1-4/+10
Not needed, we can have the l4trackers fetch it themselvs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_l4proto indirection from l3 protocol trackersFlorian Westphal1-29/+0
Handle it in the core instead. ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this doesn't create an ipv6 dependency. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove invert_tuple indirection from l3 protocol trackersFlorian Westphal2-12/+1
Its simpler to just handle it directly in nf_ct_invert_tuple(). Also gets rid of need to pass l3proto pointer to resolve_conntrack(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove pkt_to_tuple indirection from l3 protocol trackersFlorian Westphal1-18/+0
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove ctnetlink callbacks from l3 protocol trackersFlorian Westphal2-49/+0
handle everything from ctnetlink directly. After all these years we still only support ipv4 and ipv6, so it seems reasonable to remove l3 protocol tracker support and instead handle ipv4/ipv6 from a common, always builtin inet tracker. Step 1: Get rid of all the l3proto->func() calls. Start with ctnetlink, then move on to packet-path ones. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-09netfilter: ipv6: nf_defrag: drop skb dst before queueingFlorian Westphal1-0/+2
Eric Dumazet reports: Here is a reproducer of an annoying bug detected by syzkaller on our production kernel [..] ./b78305423 enable_conntrack Then : sleep 60 dmesg | tail -10 [ 171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2 [ 221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2 Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook. skb gets queued until frag timer expiry -- 1 minute. Normally nf_conntrack_reasm gets called during prerouting, so skb has no dst yet which might explain why this wasn't spotted earlier. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-06netfilter: nf_tproxy: fix possible non-linear access to transport headerMáté Eckl1-6/+12
This patch fixes a silent out-of-bound read possibility that was present because of the misuse of this function. Mostly it was called with a struct udphdr *hp which had only the udphdr part linearized by the skb_header_pointer, however nf_tproxy_get_sock_v{4,6} uses it as a tcphdr pointer, so some reads for tcp specific attributes may be invalid. Fixes: a583636a83ea ("inet: refactor inet[6]_lookup functions to take skb") Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-05netfilter: x_tables: set module owner for icmp(6) matchesFlorian Westphal1-0/+1
nft_compat relies on xt_request_find_match to increment refcount of the module that provides the match/target. The (builtin) icmp matches did't set the module owner so it was possible to rmmod ip(6)tables while icmp extensions were still in use. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-03Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+3
Simple overlapping changes in stmmac driver. Adjust skb_gro_flush_final_remcsum function signature to make GRO list changes in net-next, as per Stephen Rothwell's example merge resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28netfilter: check if the socket netns is correct.Flavio Leitner1-4/+4
Netfilter assumes that if the socket is present in the skb, then it can be used because that reference is cleaned up while the skb is crossing netns. We want to change that to preserve the socket reference in a future patch, so this is a preparation updating netfilter to check if the socket netns matches before use it. Signed-off-by: Flavio Leitner <fbl@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-18netfilter: ipv6: nf_defrag: reduce struct net memory wasteEric Dumazet1-3/+3
It is a waste of memory to use a full "struct netns_sysctl_ipv6" while only one pointer is really used, considering netns_sysctl_ipv6 keeps growing. Also, since "struct netns_frags" has cache line alignment, it is better to move the frags_hdr pointer outside, otherwise we spend a full cache line for this pointer. This saves 192 bytes of memory per netns. Fixes: c038a767cd69 ("ipv6: add a new namespace for nf_conntrack_reasm") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-0/+1
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree: 1) Reject non-null terminated helper names from xt_CT, from Gao Feng. 2) Fix KASAN splat due to out-of-bound access from commit phase, from Alexey Kodanev. 3) Missing conntrack hook registration on IPVS FTP helper, from Julian Anastasov. 4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo. 5) Fix inverted check on packet xmit to non-local addresses, also from Julian. 6) Fix ebtables alignment compat problems, from Alin Nastac. 7) Hook mask checks are not correct in xt_set, from Serhey Popovych. 8) Fix timeout listing of element in ipsets, from Jozsef. 9) Cap maximum timeout value in ipset, also from Jozsef. 10) Don't allow family option for hash:mac sets, from Florent Fourcot. 11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this Florian. 12) Another bug reported by KASAN in the rbtree set backend, from Taehee Yoo. 13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT. From Gao Feng. 14) Missing initialization of match/target in ebtables, from Florian Westphal. 15) Remove useless nft_dup.h file in include path, from C. Labbe. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-08netfilter: x_tables: initialise match/target check parameter structFlorian Westphal1-0/+1
syzbot reports following splat: BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162 ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162 xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506 ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline] ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline] The uninitialised access is xt_mtchk_param->nft_compat ... which should be set to 0. Fix it by zeroing the struct beforehand, same for tgchk. ip(6)tables targetinfo uses c99-style initialiser, so no change needed there. Reported-by: syzbot+da4494182233c23a5fcf@syzkaller.appspotmail.com Fixes: 55917a21d0cc0 ("netfilter: x_tables: add context to know if extension runs from nft_compat") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-06-03netfilter: Libify xt_TPROXYMáté Eckl3-1/+151
The extracted functions will likely be usefull to implement tproxy support in nf_tables. Extrancted functions: - nf_tproxy_sk_is_transparent - nf_tproxy_laddr4 - nf_tproxy_handle_time_wait4 - nf_tproxy_get_sock_v4 - nf_tproxy_laddr6 - nf_tproxy_handle_time_wait6 - nf_tproxy_get_sock_v6 (nf_)tproxy_handle_time_wait6 also needed some refactor as its current implementation was xtables-specific. Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-29netfilter: nat: merge ipv4/ipv6 masquerade code into main nat moduleFlorian Westphal3-9/+2
Instead of using extra modules for these, turn the config options into an implicit dependency that adds masq feature to the protocol specific nf_nat module. before: text data bss dec hex filename 2001 860 4 2865 b31 net/ipv4/netfilter/nf_nat_masquerade_ipv4.ko 5579 780 2 6361 18d9 net/ipv4/netfilter/nf_nat_ipv4.ko 2860 836 8 3704 e78 net/ipv6/netfilter/nf_nat_masquerade_ipv6.ko 6648 780 2 7430 1d06 net/ipv6/netfilter/nf_nat_ipv6.ko after: text data bss dec hex filename 7245 872 8 8125 1fbd net/ipv4/netfilter/nf_nat_ipv4.ko 9165 848 12 10025 2729 net/ipv6/netfilter/nf_nat_ipv6.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller5-154/+114
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Remove obsolete nf_log tracing from nf_tables, from Florian Westphal. 2) Add support for map lookups to numgen, random and hash expressions, from Laura Garcia. 3) Allow to register nat hooks for iptables and nftables at the same time. Patchset from Florian Westpha. 4) Timeout support for rbtree sets. 5) ip6_rpfilter works needs interface for link-local addresses, from Vincent Bernat. 6) Add nf_ct_hook and nf_nat_hook structures and use them. 7) Do not drop packets on packets raceing to insert conntrack entries into hashes, this is particularly a problem in nfqueue setups. 8) Address fallout from xt_osf separation to nf_osf, patches from Florian Westphal and Fernando Mancera. 9) Remove reference to struct nft_af_info, which doesn't exist anymore. From Taehee Yoo. This batch comes with is a conflict between 25fd386e0bc0 ("netfilter: core: add missing __rcu annotation") in your tree and 2c205dd3981f ("netfilter: add struct nf_nat_hook and use it") coming in this batch. This conflict can be solved by leaving the __rcu tag on __netfilter_net_init() - added by 25fd386e0bc0 - and remove all code related to nf_nat_decode_session_hook - which is gone after 2c205dd3981f, as described by: diff --cc net/netfilter/core.c index e0ae4aae96f5,206fb2c4c319..168af54db975 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@@ -611,7 -580,13 +611,8 @@@ const struct nf_conntrack_zone nf_ct_zo EXPORT_SYMBOL_GPL(nf_ct_zone_dflt); #endif /* CONFIG_NF_CONNTRACK */ - static void __net_init __netfilter_net_init(struct nf_hook_entries **e, int max) -#ifdef CONFIG_NF_NAT_NEEDED -void (*nf_nat_decode_session_hook)(struct sk_buff *, struct flowi *); -EXPORT_SYMBOL(nf_nat_decode_session_hook); -#endif - + static void __net_init + __netfilter_net_init(struct nf_hook_entries __rcu **e, int max) { int h; I can also merge your net-next tree into nf-next, solve the conflict and resend the pull request if you prefer so. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23netfilter: ip6t_rpfilter: provide input interface for route lookupVincent Bernat1-0/+2
In commit 47b7e7f82802, this bit was removed at the same time the RT6_LOOKUP_F_IFACE flag was removed. However, it is needed when link-local addresses are used, which is a very common case: when packets are routed, neighbor solicitations are done using link-local addresses. For example, the following neighbor solicitation is not matched by "-m rpfilter": IP6 fe80::5254:33ff:fe00:1 > ff02::1:ff00:3: ICMP6, neighbor solicitation, who has 2001:db8::5254:33ff:fe00:3, length 32 Commit 47b7e7f82802 doesn't quite explain why we shouldn't use RT6_LOOKUP_F_IFACE in the rpfilter case. I suppose the interface check later in the function would make it redundant. However, the remaining of the routing code is using RT6_LOOKUP_F_IFACE when there is no source address (which matches rpfilter's case with a non-unicast destination, like with neighbor solicitation). Signed-off-by: Vincent Bernat <vincent@bernat.im> Fixes: 47b7e7f82802 ("netfilter: don't set F_IFACE on ipv6 fib lookups") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: nf_nat: add nat type hooks to nat coreFlorian Westphal3-112/+103
Currently the packet rewrite and instantiation of nat NULL bindings happens from the protocol specific nat backend. Invocation occurs either via ip(6)table_nat or the nf_tables nat chain type. Invocation looks like this (simplified): NF_HOOK() | `---iptable_nat | `---> nf_nat_l3proto_ipv4 -> nf_nat_packet | new packet? pass skb though iptables nat chain | `---> iptable_nat: ipt_do_table In nft case, this looks the same (nft_chain_nat_ipv4 instead of iptable_nat). This is a problem for two reasons: 1. Can't use iptables nat and nf_tables nat at the same time, as the first user adds a nat binding (nf_nat_l3proto_ipv4 adds a NULL binding if do_table() did not find a matching nat rule so we can detect post-nat tuple collisions). 2. If you use e.g. nft_masq, snat, redir, etc. uses must also register an empty base chain so that the nat core gets called fro NF_HOOK() to do the reverse translation, which is neither obvious nor user friendly. After this change, the base hook gets registered not from iptable_nat or nftables nat hooks, but from the l3 nat core. iptables/nft nat base hooks get registered with the nat core instead: NF_HOOK() | `---> nf_nat_l3proto_ipv4 -> nf_nat_packet | new packet? pass skb through iptables/nftables nat chains | +-> iptables_nat: ipt_do_table +-> nft nat chain x `-> nft nat chain y The nat core deals with null bindings and reverse translation. When no mapping exists, it calls the registered nat lookup hooks until one creates a new mapping. If both iptables and nftables nat hooks exist, the first matching one is used (i.e., higher priority wins). Also, nft users do not need to create empty nat hooks anymore, nat core always registers the base hooks that take care of reverse/reply translation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: nf_tables: allow chain type to override hook registerFlorian Westphal1-6/+14
Will be used in followup patch when nat types no longer use nf_register_net_hook() but will instead register with the nat core. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: xtables: allow table definitions not backed by hook_opsFlorian Westphal1-1/+4
The ip(6)tables nat table is currently receiving skbs from the netfilter core, after a followup patch skbs will be coming from the netfilter nat core instead, so the table is no longer backed by normal hook_ops. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: nf_nat: move common nat code to nat coreFlorian Westphal1-46/+2
Copy-pasted, both l3 helpers almost use same code here. Split out the common part into an 'inet' helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08netfilter: x_tables: add module alias for icmp matchesFlorian Westphal1-0/+1
The icmp matches are implemented in ip_tables and ip6_tables, respectively, so for normal iptables they are always available: those modules are loaded once iptables calls getsockopt() to fetch available module revisions. In iptables-over-nftables case probing occurs via nfnetlink, so these modules might not be loaded. Add aliases so modprobe can load these when icmp/icmp6 is requested. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller11-276/+180
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree, more relevant updates in this batch are: 1) Add Maglev support to IPVS. Moreover, store lastest server weight in IPVS since this is needed by maglev, patches from from Inju Song. 2) Preparation works to add iptables flowtable support, patches from Felix Fietkau. 3) Hand over flows back to conntrack slow path in case of TCP RST/FIN packet is seen via new teardown state, also from Felix. 4) Add support for extended netlink error reporting for nf_tables. 5) Support for larger timeouts that 23 days in nf_tables, patch from Florian Westphal. 6) Always set an upper limit to dynamic sets, also from Florian. 7) Allow number generator to make map lookups, from Laura Garcia. 8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat. 9) Extend ip6tables SRH match to support previous, next and last SID, from Ahmed Abdelsalam. 10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez. 11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot. 12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables, from Taehee Yoo. 13) Unify meta bridge with core nft_meta, then make nft_meta built-in. Make rt and exthdr built-in too, again from Florian. 14) Missing initialization of tbl->entries in IPVS, from Cong Wang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-06netfilter: nf_nat: remove unused ct arg from lookup functionsFlorian Westphal3-13/+7
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-06netfilter: ip6t_srh: extend SRH matching for previous, next and last SIDAhmed Abdelsalam1-9/+164
IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed by SR encapsulated packet. Each SID is encoded as an IPv6 prefix. When a Firewall receives an SR encapsulated packet, it should be able to identify which node previously processed the packet (previous SID), which node is going to process the packet next (next SID), and which node is the last to process the packet (last SID) which represent the final destination of the packet in case of inline SR mode. An example use-case of using these features could be SID list that includes two firewalls. When the second firewall receives a packet, it can check whether the packet has been processed by the first firewall or not. Based on that check, it decides to apply all rules, apply just subset of the rules, or totally skip all rules and forward the packet to the next SID. This patch extends SRH match to support matching previous SID, next SID, and last SID. Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: x_tables: remove duplicate ip6t_get_target function callTaehee Yoo1-1/+0
In the check_target, ip6t_get_target is called twice. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: add NAT support for shifted portmap rangesThierry Du Tre6-8/+8
This is a patch proposal to support shifted ranges in portmaps. (i.e. tcp/udp incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100) Currently DNAT only works for single port or identical port ranges. (i.e. ports 5000-5100 on WAN interface redirected to a LAN host while original destination port is not altered) When different port ranges are configured, either 'random' mode should be used, or else all incoming connections are mapped onto the first port in the redirect range. (in described example WAN:5000-5100 will all be mapped to 192.168.1.5:2000) This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET which uses a base port value to calculate an offset with the destination port present in the incoming stream. That offset is then applied as index within the redirect port range (index modulo rangewidth to handle range overflow). In described example the base port would be 5000. An incoming stream with destination port 5004 would result in an offset value 4 which means that the NAT'ed stream will be using destination port 2004. Other possibilities include deterministic mapping of larger or multiple ranges to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port 51xx) This patch does not change any current behavior. It just adds new NAT proto range functionality which must be selected via the specific flag when intended to use. A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed which makes this functionality immediately available. Signed-off-by: Thierry Du Tre <thierry@dtsystems.be> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: nf_flow_table: move init code to nf_flow_table_core.cFelix Fietkau1-2/+1
Reduces duplication of .gc and .params in flowtable type definitions and makes the API clearer Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_tableFelix Fietkau1-232/+0
Useful as preparation for adding iptables support for offload. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-21netfilter: nf_flow_table: cache mtu in struct flow_offload_tupleFelix Fietkau1-14/+3
Reduces the number of cache lines touched in the offload forwarding path. This is safe because PMTU limits are bypassed for the forwarding path (see commit f87c10a8aa1e for more details). Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19netfilter: nf_tables: NAT chain and extensions require NF_TABLESPablo Neira Ayuso1-27/+28
Move these options inside the scope of the 'if' NF_TABLES and NF_TABLES_IPV6 dependencies. This patch fixes: net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain': >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain' net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit': >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type' net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init': >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type' that happens with: CONFIG_NF_TABLES=m CONFIG_NFT_CHAIN_NAT_IPV6=y Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-04inet: frags: fix ip6frag_low_thresh boundaryEric Dumazet1-2/+0
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, since linker might place next to it a non zero value preventing a change to ip6frag_low_thresh. ip6frag_low_thresh is not used anymore in the kernel, but we do not want to prematuraly break user scripts wanting to change it. Since specifying a minimal value of 0 for proc_doulongvec_minmax() is moot, let's remove these zero values in all defrag units. Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+4
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c, we had some overlapping changes: 1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE --> MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE 2) In 'net-next' params->log_rq_size is renamed to be params->log_rq_mtu_frames. 3) In 'net-next' params->hard_mtu is added. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CBEric Dumazet1-18/+11
nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: break the 2GB limit for frags storageEric Dumazet1-5/+5
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: remove inet_frag_maybe_warn_overflow()Eric Dumazet1-3/+2
This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31inet: frags: use rhashtables for reassembly unitsEric Dumazet1-38/+13
Some applications still rely on IP fragmentation, and to be fair linux reassembly unit is not working under any serious load. It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!) A work queue is supposed to garbage collect items when host is under memory pressure, and doing a hash rebuild, changing seed used in hash computations. This work queue blocks softirqs for up to 25 ms when doing a hash rebuild, occurring every 5 seconds if host is under fire. Then there is the problem of sharing this hash table for all netns. It is time to switch to rhashtables, and allocate one of them per netns to speedup netns dismantle, since this is a critical metric these days. Lookup is now using RCU. A followup patch will even remove the refcount hold/release left from prior implementation and save a couple of atomic operations. Before this patch, 16 cpus (16 RX queue NIC) could not handle more than 1 Mpps frags DDOS. After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB of storage for the fragments (exact number depends on frags being evicted after timeout) $ grep FRAG /proc/net/sockstat FRAG: inuse 1966916 memory 2140004608 A followup patch will change the limits for 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Florian Westphal <fw@strlen.de> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>