aboutsummaryrefslogtreecommitdiffstats
path: root/net/netfilter/nf_conntrack_core.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-27netfilter: conntrack: avoid same-timeout updateFlorian Westphal1-5/+4
No need to dirty a cache line if timeout is unchanged. Also, WARN() is useless here: we crash on 'skb->len' access if skb is NULL. Last, ct->timeout is u32, not 'unsigned long' so adapt the function prototype accordingly. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27netfilter: nat: remove nf_nat_l3proto.h and nf_nat_core.hFlorian Westphal1-1/+0
The l3proto name is gone, its header file is the last trace. While at it, also remove nf_nat_core.h, its very small and all users include nf_nat.h too. before: text data bss dec hex filename 22948 1612 4136 28696 7018 nf_nat.ko after removal of l3proto register/unregister functions: text data bss dec hex filename 22196 1516 4136 27848 6cc8 nf_nat.ko checkpatch complains about overly long lines, but line breaks do not make things more readable and the line length gets smaller here, not larger. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-3/+11
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for you net-next tree: 1) Missing NFTA_RULE_POSITION_ID netlink attribute validation, from Phil Sutter. 2) Restrict matching on tunnel metadata to rx/tx path, from wenxu. 3) Avoid indirect calls for IPV6=y, from Florian Westphal. 4) Add two indirections to prepare merger of IPV4 and IPV6 nat modules, from Florian Westphal. 5) Broken indentation in ctnetlink, from Colin Ian King. 6) Patches to use struct_size() from netfilter and IPVS, from Gustavo A. R. Silva. 7) Display kernel splat only once in case of racing to confirm conntrack from bridge plus nfqueue setups, from Chieh-Min Wang. 8) Skip checksum validation for layer 4 protocols that don't need it, patch from Alin Nastac. 9) Sparse warning due to symbol that should be static in CLUSTERIP, from Wei Yongjun. 10) Add new toggle to disable SDP payload translation when media endpoint is reachable though the same interface as the signalling peer, from Alin Nastac. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirmChieh-Min Wang1-3/+11
For bridge(br_flood) or broadcast/multicast packets, they could clone skb with unconfirmed conntrack which break the rule that unconfirmed skb->_nfct is never shared. With nfqueue running on my system, the race can be easily reproduced with following warning calltrace: [13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P W 4.4.60 #7744 [13257.707568] Hardware name: Qualcomm (Flattened Device Tree) [13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14) [13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8) [13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0) [13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24) [13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618) [13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc) [13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8) [13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0) [13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c) [13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8) [13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0) [13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4) [13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac) [13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170) [13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420) [13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248) [13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0) [13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c) [13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368) [13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44) [13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200) [13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64) [13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34) The original code just triggered the warning but do nothing. It will caused the shared conntrack moves to the dying list and the packet be droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack). - Reproduce steps: +----------------------------+ | br0(bridge) | | | +-+---------+---------+------+ | eth0| | eth1| | eth2| | | | | | | +--+--+ +--+--+ +---+-+ | | | | | | +--+-+ +-+--+ +--+-+ | PC1| | PC2| | PC3| +----+ +----+ +----+ iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass ps: Our nfq userspace program will set mark on packets whose connection has already been processed. PC1 sends broadcast packets simulated by hping3: hping3 --rand-source --udp 192.168.1.255 -i u100 - Broadcast racing flow chart is as follow: br_handle_frame BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish) // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage br_handle_frame_finish // check if this packet is broadcast br_flood_forward br_flood list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port maybe_deliver deliver_clone skb = skb_clone(skb) __br_forward BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...) // queue in our nfq and received by our userspace program // goto __nf_conntrack_confirm with process context on CPU 1 br_pass_frame_up BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...) // goto __nf_conntrack_confirm with softirq context on CPU 0 Because conntrack confirm can happen at both INPUT and POSTROUTING stage. So with NFQUEUE running, skb->_nfct with the same unconfirmed conntrack could race on different core. This patch fixes a repeating kernel splat, now it is only displayed once. Signed-off-by: Chieh-Min Wang <chiehminw@synology.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+16
An ipvlan bug fix in 'net' conflicted with the abstraction away of the IPV6 specific support in 'net-next'. Similarly, a bug fix for mlx5 in 'net' conflicted with the flow action conversion in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04netfilter: nf_nat: skip nat clash resolution for same-origin entriesMartynas Pumputis1-0/+16
It is possible that two concurrent packets originating from the same socket of a connection-less protocol (e.g. UDP) can end up having different IP_CT_DIR_REPLY tuples which results in one of the packets being dropped. To illustrate this, consider the following simplified scenario: 1. Packet A and B are sent at the same time from two different threads by same UDP socket. No matching conntrack entry exists yet. Both packets cause allocation of a new conntrack entry. 2. get_unique_tuple gets called for A. No clashing entry found. conntrack entry for A is added to main conntrack table. 3. get_unique_tuple is called for B and will find that the reply tuple of B is already taken by A. It will allocate a new UDP source port for B to resolve the clash. 4. conntrack entry for B cannot be added to main conntrack table because its ORIGINAL direction is clashing with A and the REPLY directions of A and B are not the same anymore due to UDP source port reallocation done in step 3. This patch modifies nf_conntrack_tuple_taken so it doesn't consider colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal. [ Florian: simplify patch to not use .allow_clash setting and always ignore identical flows ] Signed-off-by: Martynas Pumputis <martynas@weave.works> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-22netfilter: conntrack: fix bogus port values for other l4 protocolsFlorian Westphal1-11/+35
We must only extract l4 proto information if we can track the layer 4 protocol. Before removal of pkt_to_tuple callback, the code to extract port information was only reached for TCP/UDP/LITE/DCCP/SCTP. The other protocols were handled by the indirect call, and the 'generic' tracker took care of other protocols that have no notion of 'ports'. After removal of the callback we must be more strict here and only init port numbers for those protocols that have ports. Fixes: df5e1629087a ("netfilter: conntrack: remove pkt_to_tuple callback") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-22netfilter: conntrack: fix IPV6=n buildsFlorian Westphal1-0/+8
Stephen Rothwell reports: After merging the netfilter-next tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: ERROR: "nf_conntrack_invert_icmpv6_tuple" [nf_conntrack.ko] undefined! ERROR: "nf_conntrack_icmpv6_packet" [nf_conntrack.ko] undefined! ERROR: "nf_conntrack_icmpv6_init_net" [nf_conntrack.ko] undefined! ERROR: "icmpv6_pkt_to_tuple" [nf_conntrack.ko] undefined! ERROR: "nf_ct_gre_keymap_destroy" [nf_conntrack.ko] undefined! icmpv6 related errors are due to lack of IS_ENABLED(CONFIG_IPV6) (no icmpv6 support is builtin if kernel has CONFIG_IPV6=n), the nf_ct_gre_keymap_destroy error is due to lack of PROTO_GRE check. Fixes: a47c54048162 ("netfilter: conntrack: handle builtin l4proto packet functions via direct calls") Fixes: e2e48b471634 ("netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls") Fixes: 197c4300aec0 ("netfilter: conntrack: remove invert_tuple callback") Fixes: 2a389de86e4a ("netfilter: conntrack: remove l4proto init and get_net callbacks") Fixes: e56894356f60 ("netfilter: conntrack: remove l4proto destroy hook") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: remove nf_ct_l4proto_find_getFlorian Westphal1-8/+3
Its now same as __nf_ct_l4proto_find(), so rename that to nf_ct_l4proto_find and use it everywhere. It never returns NULL and doesn't need locks or reference counts. Before this series: 302824 net/netfilter/nf_conntrack.ko 21504 net/netfilter/nf_conntrack_proto_gre.ko text data bss dec hex filename 6281 1732 4 8017 1f51 nf_conntrack_proto_gre.ko 108356 20613 236 129205 1f8b5 nf_conntrack.ko After: 294864 net/netfilter/nf_conntrack.ko text data bss dec hex filename 106979 19557 240 126776 1ef38 nf_conntrack.ko so, even with builtin gre, total size got reduced. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: remove l4proto destroy hookFlorian Westphal1-4/+11
Only one user (gre), add a direct call and remove this facility. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookupsFlorian Westphal1-44/+9
after removal of the packet and invert function pointers, several places do not need to lookup the l4proto structure anymore. Remove those lookups. The function nf_ct_invert_tuplepr becomes redundant, replace it with nf_ct_invert_tuple everywhere. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: remove remaining l4proto indirect packet callsFlorian Westphal1-7/+19
Now that all l4trackers are builtin, no need to use a mix of direct and indirect calls. This removes the last two users: gre and the generic l4 protocol tracker. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: remove invert_tuple callbackFlorian Westphal1-2/+6
Only used by icmp(v6). Prefer a direct call and remove this function from the l4proto struct. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: remove pkt_to_tuple callbackFlorian Westphal1-2/+4
GRE is now builtin, so we can handle it via direct call and remove the callback. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: handle icmp pkt_to_tuple helper via direct callsFlorian Westphal1-0/+6
rather than handling them via indirect call, use a direct one instead. This leaves GRE as the last user of this indirect call facility. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18netfilter: conntrack: handle builtin l4proto packet functions via direct callsFlorian Westphal1-1/+44
The l4 protocol trackers are invoked via indirect call: l4proto->packet(). With one exception (gre), all l4trackers are builtin, so we can make .packet optional and use a direct call for most protocols. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS1-1/+1
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: reference totalram_pages and managed_pages once per functionArun KS1-3/+4
Patch series "mm: convert totalram_pages, totalhigh_pages and managed pages to atomic", v5. This series converts totalram_pages, totalhigh_pages and zone->managed_pages to atomic variables. totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better to remove the lock and convert variables to atomic. With the change, preventing poteintial store-to-read tearing comes as a bonus. This patch (of 4): This is in preparation to a later patch which converts totalram_pages and zone->managed_pages to atomic variables. Please note that re-reading the value might lead to a different value and as such it could lead to unexpected behavior. There are no known bugs as a result of the current code but it is better to prevent from them in principle. Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-21netfilter: conntrack: remove empty pernet fini stubsFlorian Westphal1-22/+6
after moving sysctl handling into single place, the init functions can't fail anymore and some of the fini functions are empty. Remove them and change return type to void. This also simplifies error unwinding in conntrack module init path. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-03netfilter: conntrack: fix calculation of next bucket number in early_dropVasily Khoruzhick1-5/+8
If there's no entry to drop in bucket that corresponds to the hash, early_drop() should look for it in other buckets. But since it increments hash instead of bucket number, it actually looks in the same bucket 8 times: hsize is 16k by default (14 bits) and hash is 32-bit value, so reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in most cases. Fix it by increasing bucket number instead of hash and rename _hash to bucket to avoid future confusion. Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic") Cc: <stable@vger.kernel.org> # v4.7+ Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove l3->l4 mapping informationFlorian Westphal1-8/+7
l4 protocols are demuxed by l3num, l4num pair. However, almost all l4 trackers are l3 agnostic. Only exceptions are: - gre, icmp (ipv4 only) - icmpv6 (ipv6 only) This commit gets rid of the l3 mapping, l4 trackers can now be looked up by their IPPROTO_XXX value alone, which gets rid of the additional l3 indirection. For icmp, ipcmp6 and gre, add a check on state->pf and return -NF_ACCEPT in case we're asked to track e.g. icmpv6-in-ipv4, this seems more fitting than using the generic tracker. Additionally we can kill the 2nd l4proto definitions that were needed for v4/v6 split -- they are now the same so we can use single l4proto struct for each protocol, rather than two. The EXPORT_SYMBOLs can be removed as all these object files are part of nf_conntrack with no external references. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove error callback and handle icmp from coreFlorian Westphal1-7/+36
icmp(v6) are the only two layer four protocols that need the error() callback (to handle icmp errors that are related to an established connections, e.g. packet too big, port unreachable and the like). Remove the error callback and handle these two special cases from the core. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: remove the l4proto->new() functionFlorian Westphal1-6/+0
->new() gets invoked after ->error() and before ->packet() if a conntrack lookup has found no result for the tuple. We can fold it into ->packet() -- the packet() implementations can check if the conntrack is confirmed (new) or not (already in hash). If its unconfirmed, the conntrack isn't in the hash yet so current skb created a new conntrack entry. Only relevant side effect -- if packet() doesn't return NF_ACCEPT but -NF_ACCEPT (or drop), while the conntrack was just created, then the newly allocated conntrack is freed right away, rather than not created in the first place. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20netfilter: conntrack: pass nf_hook_state to packet and error handlersFlorian Westphal1-24/+25
nf_hook_state contains all the hook meta-information: netns, protocol family, hook location, and so on. Instead of only passing selected information, pass a pointer to entire structure. This will allow to merge the error and the packet handlers and remove the ->new() function in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03netfilter: use kvmalloc_array to allocate memory for hashtableLi RongQing1-23/+6
nf_ct_alloc_hashtable is used to allocate memory for conntrack, NAT bysrc and expectation hashtable. Assuming 64k bucket size, which means 7th order page allocation, __get_free_pages, called by nf_ct_alloc_hashtable, will trigger the direct memory reclaim and stall for a long time, when system has lots of memory stress so replace combination of __get_free_pages and vzalloc with kvmalloc_array, which provides a overflow check and a fallback if no high order memory is available, and do not retry to reclaim memory, reduce stall and remove nf_ct_free_hashtable, since it is just a kvfree Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller1-67/+185
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree: 1) No need to set ttl from reject action for the bridge family, from Taehee Yoo. 2) Use a fixed timeout for flow that are passed up from the flowtable to conntrack, from Florian Westphal. 3) More preparation patches for tproxy support for nf_tables, from Mate Eckl. 4) Remove unnecessary indirection in core IPv6 checksum function, from Florian Westphal. 5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it. From Florian Westphal. 6) socket match now selects socket infrastructure, instead of depending on it. From Mate Eckl. 7) Patch series to simplify conntrack tuple building/parsing from packet path and ctnetlink, from Florian Westphal. 8) Fetch timeout policy from protocol helpers, instead of doing it from core, from Florian Westphal. 9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from Florian Westphal. 10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES respectively, instead of IPV6. Patch from Mate Eckl. 11) Add specific function for garbage collection in conncount, from Yi-Hung Wei. 12) Catch number of elements in the connlimit list, from Yi-Hung Wei. 13) Move locking to nf_conncount, from Yi-Hung Wei. 14) Series of patches to add lockless tree traversal in nf_conncount, from Yi-Hung Wei. 15) Resolve clash in matching conntracks when race happens, from Martynas Pumputis. 16) If connection entry times out, remove template entry from the ip_vs_conn_tab table to improve behaviour under flood, from Julian Anastasov. 17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng. 18) Call abort from 2-phase commit protocol before requesting modules, make sure this is done under the mutex, from Florian Westphal. 19) Grab module reference when starting transaction, also from Florian. 20) Dynamically allocate expression info array for pre-parsing, from Florian. 21) Add per netns mutex for nf_tables, from Florian Westphal. 22) A couple of patches to simplify and refactor nf_osf code to prepare for nft_osf support. 23) Break evaluation on missing socket, from Mate Eckl. 24) Allow to match socket mark from nft_socket, from Mate Eckl. 25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is built-in into nf_conntrack. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller1-1/+1
All conflicts were trivial overlapping changes, so reasonably easy to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18netfilter: Remove useless param helper of nf_ct_helper_ext_addGao Feng1-2/+1
The param helper of nf_ct_helper_ext_add is useless now, then remove it now. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-18netfilter: nf_conntrack: resolve clash for matching conntracksMartynas Pumputis1-8/+22
This patch enables the clash resolution for NAT (disabled in "590b52e10d41") if clashing conntracks match (i.e. both tuples are equal) and a protocol allows it. The clash might happen for a connections-less protocol (e.g. UDP) when two threads in parallel writes to the same socket and consequent calls to "get_unique_tuple" return the same tuples (incl. reply tuples). In this case it is safe to perform the resolution, as the losing CT describes the same mangling as the winning CT, so no modifications to the packet are needed, and the result of rules traversal for the loser's packet stays valid. Signed-off-by: Martynas Pumputis <martynas@weave.works> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-17netfilter: conntrack: remove l3proto abstractionFlorian Westphal1-7/+4
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto abstraction. This gets rid of all l3proto indirect calls and the need to do a lookup on the function to call for l3 demux. It increases module size by only a small amount (12kbyte), so this reduces size because nf_conntrack.ko is useless without either nf_conntrack_ipv4 or nf_conntrack_ipv6 module. before: text data bss dec hex filename 7357 1088 0 8445 20fd nf_conntrack_ipv4.ko 7405 1084 4 8493 212d nf_conntrack_ipv6.ko 72614 13689 236 86539 1520b nf_conntrack.ko 19K nf_conntrack_ipv4.ko 19K nf_conntrack_ipv6.ko 179K nf_conntrack.ko after: text data bss dec hex filename 79277 13937 236 93450 16d0a nf_conntrack.ko 191K nf_conntrack.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_timeout() indirectionFlorian Westphal1-14/+2
Not needed, we can have the l4trackers fetch it themselvs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: avoid l4proto pkt_to_tuple callsFlorian Westphal1-1/+15
Handle common protocols (udp, tcp, ..), in the core and only do the call if needed by the l4proto tracker. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: avoid calls to l4proto invert_tupleFlorian Westphal1-1/+7
Handle the common cases (tcp, udp, etc). in the core and only do the indirect call for the protocols that need it (GRE for instance). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_l4proto indirection from l3 protocol trackersFlorian Westphal1-20/+88
Handle it in the core instead. ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this doesn't create an ipv6 dependency. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove invert_tuple indirection from l3 protocol trackersFlorian Westphal1-10/+16
Its simpler to just handle it directly in nf_ct_invert_tuple(). Also gets rid of need to pass l3proto pointer to resolve_conntrack(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove pkt_to_tuple indirection from l3 protocol trackersFlorian Westphal1-6/+33
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16openvswitch: use nf_ct_get_tuplepr, invert_tupleprFlorian Westphal1-2/+1
These versions deal with the l3proto/l4proto details internally. It removes only caller of nf_ct_get_tuple, so make it static. After this, l3proto->get_l4proto() can be removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-10netfilter: Add nf_ct_get_tuple_skb global lookup functionToke Høiland-Jørgensen1-0/+36
This adds a global netfilter function to extract a conntrack tuple from an skb. The function uses a new function added to nf_ct_hook, which will try to get the tuple from skb->_nfct, and do a full lookup if that fails. This makes it possible to use the lookup function before the skb has passed through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is copied to the caller to avoid issues with reference counting. The function returns false if conntrack is not loaded, allowing it to be used without incurring a module dependency on conntrack. This is used by the NAT mode in sch_cake. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-09netfilter: nf_conntrack: Fix possible possible crash on module loading.Andrey Ryabinin1-1/+1
Loading the nf_conntrack module with doubled hashsize parameter, i.e. modprobe nf_conntrack hashsize=12345 hashsize=12345 causes NULL-ptr deref. If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function will be called also twice. The first nf_conntrack_set_hashsize() call will set the 'nf_conntrack_htable_size' variable: nf_conntrack_set_hashsize() ... /* On boot, we can set this without any fancy locking. */ if (!nf_conntrack_htable_size) return param_set_uint(val, kp); But on the second invocation, the nf_conntrack_htable_size is already set, so the nf_conntrack_set_hashsize() will take a different path and call the nf_conntrack_hash_resize() function. Which will crash on the attempt to dereference 'nf_conntrack_hash' pointer: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack] Call Trace: nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack] parse_args+0x1f9/0x5a0 load_module+0x1281/0x1a50 __se_sys_finit_module+0xbe/0xf0 do_syscall_64+0x7c/0x390 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fix this, by checking !nf_conntrack_hash instead of !nf_conntrack_htable_size. nf_conntrack_hash will be initialized only after the module loaded, so the second invocation of the nf_conntrack_set_hashsize() won't crash, it will just reinitialize nf_conntrack_htable_size again. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracksPablo Neira Ayuso1-0/+77
In nfqueue, two consecutive skbuffs may race to create the conntrack entry. Hence, the one that loses the race gets dropped due to clash in the insertion into the hashes from the nf_conntrack_confirm() path. This patch adds a new nf_conntrack_update() function which searches for possible clashes and resolve them. NAT mangling for the packet losing race is corrected by using the conntrack information that won race. In order to avoid direct module dependencies with conntrack and NAT, the nf_ct_hook and nf_nat_hook structures are used for this purpose. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: add struct nf_nat_hook and use itPablo Neira Ayuso1-5/+0
Move decode_session() and parse_nat_setup_hook() indirections to struct nf_nat_hook structure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-23netfilter: add struct nf_ct_hook and use itPablo Neira Ayuso1-3/+6
Move the nf_ct_destroy indirection to the struct nf_ct_hook. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-07netfilter: ctnetlink: export nf_conntrack_maxFlorent Fourcot1-0/+1
IPCTNL_MSG_CT_GET_STATS netlink command allow to monitor current number of conntrack entries. However, if one wants to compare it with the maximum (and detect exhaustion), the only solution is currently to read sysctl value. This patch add nf_conntrack_max value in netlink message, and simplify monitoring for application built on netlink API. Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-29net: Remove rtnl_lock() in nf_ct_iterate_destroy()Kirill Tkhai1-2/+0
rtnl_lock() doesn't protect net::ct::count, and it's not needed for__nf_ct_unconfirmed_destroy() and for nf_queue_nf_hook_drop(). Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-29net: Introduce net_rwsem to protect net_namespace_listKirill Tkhai1-0/+2
rtnl_lock() is used everywhere, and contention is very high. When someone wants to iterate over alive net namespaces, he/she has no a possibility to do that without exclusive lock. But the exclusive rtnl_lock() in such places is overkill, and it just increases the contention. Yes, there is already for_each_net_rcu() in kernel, but it requires rcu_read_lock(), and this can't be sleepable. Also, sometimes it may be need really prevent net_namespace_list growth, so for_each_net_rcu() is not fit there. This patch introduces new rw_semaphore, which will be used instead of rtnl_mutex to protect net_namespace_list. It is sleepable and allows not-exclusive iterations over net namespaces list. It allows to stop using rtnl_lock() in several places (what is made in next patches) and makes less the time, we keep rtnl_mutex. Here we just add new lock, while the explanation of we can remove rtnl_lock() there are in next patches. Fine grained locks generally are better, then one big lock, so let's do that with net_namespace_list, while the situation allows that. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-6/+20
Pull networking updates from David Miller: 1) Significantly shrink the core networking routing structures. Result of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf 2) Add netdevsim driver for testing various offloads, from Jakub Kicinski. 3) Support cross-chip FDB operations in DSA, from Vivien Didelot. 4) Add a 2nd listener hash table for TCP, similar to what was done for UDP. From Martin KaFai Lau. 5) Add eBPF based queue selection to tun, from Jason Wang. 6) Lockless qdisc support, from John Fastabend. 7) SCTP stream interleave support, from Xin Long. 8) Smoother TCP receive autotuning, from Eric Dumazet. 9) Lots of erspan tunneling enhancements, from William Tu. 10) Add true function call support to BPF, from Alexei Starovoitov. 11) Add explicit support for GRO HW offloading, from Michael Chan. 12) Support extack generation in more netlink subsystems. From Alexander Aring, Quentin Monnet, and Jakub Kicinski. 13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From Russell King. 14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso. 15) Many improvements and simplifications to the NFP driver bpf JIT, from Jakub Kicinski. 16) Support for ipv6 non-equal cost multipath routing, from Ido Schimmel. 17) Add resource abstration to devlink, from Arkadi Sharshevsky. 18) Packet scheduler classifier shared filter block support, from Jiri Pirko. 19) Avoid locking in act_csum, from Davide Caratti. 20) devinet_ioctl() simplifications from Al viro. 21) More TCP bpf improvements from Lawrence Brakmo. 22) Add support for onlink ipv6 route flag, similar to ipv4, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits) tls: Add support for encryption using async offload accelerator ip6mr: fix stale iterator net/sched: kconfig: Remove blank help texts openvswitch: meter: Use 64-bit arithmetic instead of 32-bit tcp_nv: fix potential integer overflow in tcpnv_acked r8169: fix RTL8168EP take too long to complete driver initialization. qmi_wwan: Add support for Quectel EP06 rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK ipmr: Fix ptrdiff_t print formatting ibmvnic: Wait for device response when changing MAC qlcnic: fix deadlock bug tcp: release sk_frag.page in tcp_disconnect ipv4: Get the address of interface correctly. net_sched: gen_estimator: fix lockdep splat net: macb: Handle HRESP error net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring ipv6: addrconf: break critical section in addrconf_verify_rtnl() ipv6: change route cache aging logic i40e/i40evf: Update DESC_NEEDED value to reflect larger value bnxt_en: cleanup DIM work on device shutdown ...
2018-01-19netfilter: remove messages print and boot/module load timePablo Neira Ayuso1-6/+0
Several reasons for this: * Several modules maintain internal version numbers, that they print at boot/module load time, that are not exposed to userspace, as a primitive mechanism to make revision number control from the earlier days of Netfilter. * IPset shows the protocol version at boot/module load time, instead display this via module description, as Jozsef suggested. * Remove copyright notice at boot/module load time in two spots, the Netfilter codebase is a collective development effort, if we would have to display copyrights for each contributor at boot/module load time for each extensions we have, we would probably fill up logs with lots of useless information - from a technical standpoint. So let's be consistent and remove them all. Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_conntrack: add IPS_OFFLOAD status bitPablo Neira Ayuso1-0/+20
This new bit tells us that the conntrack entry is owned by the flow table offload infrastructure. # cat /proc/net/nf_conntrack ipv4 2 tcp 6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 zone=0 use=2 Note the [OFFLOAD] tag in the listing. The timer of such conntrack entries look like stopped from userspace. In practise, to make sure the conntrack entry does not go away, the conntrack timer is periodically set to an arbitrary large value that gets refreshed on every iteration from the garbage collector, so it never expires- and they display no internal state in the case of TCP flows. This allows us to save a bitcheck from the packet path via nf_ct_is_expired(). Conntrack entries that have been offloaded to the flow table infrastructure cannot be deleted/flushed via ctnetlink. The flow table infrastructure is also responsible for releasing this conntrack entry. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-28netfilter: Eliminate cond_resched_rcu_qs() in favor of cond_resched()Paul E. McKenney1-1/+1
Now that cond_resched() also provides RCU quiescent states when needed, it can be used in place of cond_resched_rcu_qs(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: <netfilter-devel@vger.kernel.org>
2017-11-15Merge tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linuxLinus Torvalds1-1/+1
Pull module updates from Jessica Yu: "Summary of modules changes for the 4.15 merge window: - treewide module_param_call() cleanup, fix up set/get function prototype mismatches, from Kees Cook - minor code cleanups" * tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: Do not paper over type mismatches in module_param_call() treewide: Fix function prototypes for module_param_call() module: Prepare to convert all module_param_call() prototypes kernel/module: Delete an error message for a failed memory allocation in add_module_usage()