aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-06-04ipv4: Plumb support for nexthop object in a fib_infoDavid Ahern3-36/+177
Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the fib_info side of the nexthop <-> fib_info relationship. Add fi_list list_head to 'struct nexthop' to track fib_info entries using a nexthop instance. Add __remove_nexthop_fib and add it to __remove_nexthop to walk the new list_head and mark those fib entries as dead when the nexthop is deleted. Add a few nexthop helpers for use when a nexthop is added to fib_info: - nexthop_cmp to determine if 2 nexthops are the same - nexthop_path_fib_result to select a path for a multipath 'struct nexthop' - nexthop_fib_nhc to select a specific fib_nh_common within a multipath 'struct nexthop' Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop case. Update the fib_info functions to check for fi->nh and take a different path as needed: - free_fib_info_rcu - put the nexthop object reference - fib_release_info - remove the fib_info from the nexthop's fi_list - nh_comp - use nexthop_cmp when either fib_info references a nexthop object - fib_info_hashfn - use the nexthop id for the hashing vs the oif of each fib_nh in a fib_info - fib_nlmsg_size - add space for the RTA_NH_ID attribute - fib_create_info - verify nexthop reference can be taken, verify nexthop spec is valid for fib entry, and add fib_info to fi_list for a nexthop - fib_select_multipath - use the new nexthop_path_fib_result to select a path when nexthop objects are used - fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat it the same as a fib entry using 'blackhole' The bulk of the changes are in fib_semantics.c and most of that is moving the existing change_nexthops into an else branch. Update the nexthop code to walk fi_list on a nexthop deleted to remove fib entries referencing it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Prepare for fib6_nh from a nexthop objectDavid Ahern6-33/+58
Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes to use a fib6_nh based nexthop. In the end, only code not using a nexthop object in a fib_info should directly access fib_nh in a fib_info without checking the famiy and going through fib_nh_common. Those functions will be marked when it is not directly evident. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Use accessors for fib_info nexthop dataDavid Ahern7-47/+71
Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the fib_dev macro which is an alias for the first nexthop. Replacements: fi->fib_dev --> fib_info_nh(fi, 0)->fib_nh_dev fi->fib_nh --> fib_info_nh(fi, 0) fi->fib_nh[i] --> fib_info_nh(fi, i) fi->fib_nhs --> fib_info_num_path(fi) where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path returns fi->fib_nhs. Move the existing fib_info_nhc to nexthop.h and define the new ones there. A later patch adds a check if a fib_info uses a nexthop object, and defining the helpers in nexthop.h avoid circular header dependencies. After this all remaining open coded references to fi->fib_nhs and fi->fib_nh are in: - fib_create_info and helpers used to lookup an existing fib_info entry, and - the netdev event functions fib_sync_down_dev and fib_sync_up. The latter two will not be reused for nexthops, and the fib_create_info will be updated to handle a nexthop in a fib_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv6: Always allocate pcpu memory in a fib6_nhDavid Ahern1-6/+7
A recent commit had an unintended side effect with reject routes: rt6i_pcpu is expected to always be initialized for all fib6_info except the null entry. The commit mentioned below skips it for reject routes and ends up leaking references to the loopback device. For example, ip netns add foo ip -netns foo li set lo up ip -netns foo -6 ro add blackhole 2001:db8:1::1 ip netns exec foo ping6 2001:db8:1::1 ip netns del foo ends up spewing: unregister_netdevice: waiting for lo to become free. Usage count = 3 The fib_nh_common_init is not needed for reject routes (no ipv4 caching or encaps), so move the alloc_percpu_gfp after it and adjust the goto label. Fixes: f40b6ae2b612 ("ipv6: Move pcpu cached routes to fib6_nh") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net: vlan: Inherit MPLS features from parent deviceAriel Levkovich1-0/+1
During the creation of the VLAN interface net device, the various device features and offloads are being set based on the parent device's features. The code initiates the basic, vlan and encapsulation features but doesn't address the MPLS features set and they remain blank. As a result, all device offloads that have significant performance effect are disabled for MPLS traffic going via this VLAN device such as checksumming and TSO. This patch makes sure that MPLS features are also set for the VLAN device based on the parent which will allow HW offloads of checksumming and TSO to be performed on MPLS tagged packets. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: don't pass version to tls_advance_record_sn()Jakub Kicinski2-6/+5
All callers pass prot->version as the last parameter of tls_advance_record_sn(), yet tls_advance_record_sn() itself needs a pointer to prot. Pass prot from callers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: use version from protJakub Kicinski1-2/+2
ctx->prot holds the same information as per-direction contexts. Almost all code gets TLS version from this structure, convert the last two stragglers, this way we can improve the cache utilization by moving the per-direction data into cold cache lines. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: don't re-check msg decrypted status in tls_device_decrypted()Jakub Kicinski1-4/+0
tls_device_decrypted() is only called from decrypt_skb_update(), when ctx->decrypted == false, there is no need to re-check the bit. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: don't look for decrypted frames on non-offloaded socketsJakub Kicinski1-3/+5
If the RX config of a TLS socket is SW, there is no point iterating over the fragments and checking if frame is decrypted. It will always be fully encrypted. Note that in fully encrypted case the function doesn't actually touch any offload-related state, so it's safe to call for TLS_SW, today. Soon we will introduce code which can only be called for offloaded contexts. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: remove false positive warningJakub Kicinski1-2/+4
It's possible that TCP stack will decide to retransmit a packet right when that packet's data gets acked, especially in presence of packet reordering. This means that packets may be in flight, even though tls_device code has already freed their record state. Make fill_sg_in() and in turn tls_sw_fallback() not generate a warning in that case, and quietly proceed to drop such frames. Make the exit path from tls_sw_fallback() drop monitor friendly, for users to be able to troubleshoot dropped retransmissions. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: check return values from skb_copy_bits() and skb_store_bits()Jakub Kicinski1-6/+14
In light of recent bugs, we should make a better effort of checking return values. In theory none of the functions should fail today. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net/tls: fully initialize the msg wrapper skbJakub Kicinski2-6/+27
If strparser gets cornered into starting a new message from an sk_buff which already has frags, it will allocate a new skb to become the "wrapper" around the fragments of the message. This new skb does not inherit any metadata fields. In case of TLS offload this may lead to unnecessarily re-encrypting the message, as skb->decrypted is not set for the wrapper skb. Try to be conservative and copy all fields of old skb strparser's user may reasonably need. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net: ipv4: fix rcu lockdep splat due to wrong annotationFlorian Westphal1-1/+1
syzbot triggered following splat when strict netlink validation is enabled: net/ipv4/devinet.c:1766 suspicious rcu_dereference_check() usage! This occurs because we hold RTNL mutex, but no rcu read lock. The second call site holds both, so just switch to the _rtnl variant. Reported-by: syzbot+bad6e32808a3a97b1515@syzkaller.appspotmail.com Fixes: 2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04devlink: allow driver to update progress of flash updateJiri Pirko1-0/+102
Introduce a function to be called from drivers during flash. It sends notification to userspace about flash update progress. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: fix use-after-free in kfree_skb_listEric Dumazet3-7/+5
syzbot reported nasty use-after-free [1] Lets remove frag_list field from structs ip_fraglist_iter and ip6_fraglist_iter. This seens not needed anyway. [1] : BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947 CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44add9 Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf Allocated by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc_node mm/slab.c:3269 [inline] kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199 alloc_skb include/linux/skbuff.h:1058 [inline] __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 kfree_skbmem net/core/skbuff.c:625 [inline] kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619 __kfree_skb net/core/skbuff.c:682 [inline] kfree_skb net/core/skbuff.c:699 [inline] kfree_skb+0xf0/0x390 net/core/skbuff.c:693 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708 __dev_xmit_skb net/core/dev.c:3551 [inline] __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888085a3cbc0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff888085a3cbc0, ffff888085a3cca0) The buggy address belongs to the page: page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0 raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter") Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03tcp: use this_cpu_read(*X) instead of *this_cpu_ptr(X)Eric Dumazet1-2/+2
this_cpu_read(*X) is slightly faster than *this_cpu_ptr(X) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03ipv4: icmp: use this_cpu_read() in icmp_sk()Eric Dumazet1-1/+1
this_cpu_read(*X) is faster than *this_cpu_ptr(X) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03ipv6: use this_cpu_read() in rt6_get_pcpu_route()Eric Dumazet1-3/+2
this_cpu_read(*X) is faster than *this_cpu_ptr(X) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03ipv6: icmp: use this_cpu_read() in icmpv6_sk()Eric Dumazet1-2/+2
In general, this_cpu_read(*X) is faster than *this_cpu_ptr(X) Also remove the inline attibute, totally useless. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03flow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flagStanislav Fomichev1-9/+1
This flag is not used by any caller, remove it. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: ethernet: improve eth_platform_get_mac_addressHeiner Kallweit1-10/+4
pci_device_to_OF_node(to_pci_dev(dev)) is the same as dev->of_node, so we can simplify the code. In addition add an empty line before the return statement. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: ipv4: provide __rcu annotation for ifa_listFlorian Westphal5-43/+79
ifa_list is protected by rcu, yet code doesn't reflect this. Add the __rcu annotations and fix up all places that are now reported by sparse. I've done this in the same commit to not add intermediate patches that result in new warnings. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: use new in_dev_ifa iteratorsFlorian Westphal5-17/+29
Use in_dev_for_each_ifa_rcu/rtnl instead. This prevents sparse warnings once proper __rcu annotations are added. Signed-off-by: Florian Westphal <fw@strlen.de> t di# Last commands done (6 commands done): Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02netfilter: use in_dev_for_each_ifa_rcuFlorian Westphal3-7/+16
Netfilter hooks are always running under rcu read lock, use the new iterator macro so sparse won't complain once we add proper __rcu annotations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02devinet: use in_dev_for_each_ifa_rcu in more placesFlorian Westphal1-8/+19
This also replaces spots that used for_primary_ifa(). for_primary_ifa() aborts the loop on the first secondary address seen. Replace it with either the rcu or rtnl variant of in_dev_for_each_ifa(), but two places will now also consider secondary addresses too: inet_addr_onlink() and inet_ifa_byprefix(). I do not understand why they should ignore secondary addresses. Why would a secondary address not be considered 'on link'? When matching a prefix, why ignore a matching secondary address? Other places get converted as well, but gain "->flags & SECONDARY" check. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: inetdevice: provide replacement iterators for in_ifaddr walkFlorian Westphal1-15/+16
The ifa_list is protected either by rcu or rtnl lock, but the current iterators do not account for this. This adds two iterators as replacement, a later patch in the series will update them with the needed rcu/rtnl_dereference calls. Its not done in this patch yet to avoid sparse warnings -- the fields lack the proper __rcu annotation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller32-139/+370
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset container Netfilter/IPVS update for net-next: 1) Add UDP tunnel support for ICMP errors in IPVS. Julian Anastasov says: This patchset is a followup to the commit that adds UDP/GUE tunnel: "ipvs: allow tunneling with gue encapsulation". What we do is to put tunnel real servers in hash table (patch 1), add function to lookup tunnels (patch 2) and use it to strip the embedded tunnel headers from ICMP errors (patch 3). 2) Extend xt_owner to match for supplementary groups, from Lukasz Pawelczyk. 3) Remove unused oif field in flow_offload_tuple object, from Taehee Yoo. 4) Release basechain counters from workqueue to skip synchronize_rcu() call. From Florian Westphal. 5) Replace skb_make_writable() by skb_ensure_writable(). Patchset from Florian Westphal. 6) Checksum support for gue encapsulation in IPVS, from Jacky Hu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-29/+52
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-05-31 The following pull-request contains BPF updates for your *net-next* tree. Lots of exciting new features in the first PR of this developement cycle! The main changes are: 1) misc verifier improvements, from Alexei. 2) bpftool can now convert btf to valid C, from Andrii. 3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs. This feature greatly improves BPF speed on 32-bit architectures. From Jiong. 4) cgroups will now auto-detach bpf programs. This fixes issue of thousands bpf programs got stuck in dying cgroups. From Roman. 5) new bpf_send_signal() helper, from Yonghong. 6) cgroup inet skb programs can signal CN to the stack, from Lawrence. 7) miscellaneous cleanups, from many developers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31bpf: move memory size checks to bpf_map_charge_init()Roman Gushchin2-10/+2
Most bpf map types doing similar checks and bytes to pages conversion during memory allocation and charging. Let's unify these checks by moving them into bpf_map_charge_init(). Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: rework memlock-based memory accounting for mapsRoman Gushchin2-5/+8
In order to unify the existing memlock charging code with the memcg-based memory accounting, which will be added later, let's rework the current scheme. Currently the following design is used: 1) .alloc() callback optionally checks if the allocation will likely succeed using bpf_map_precharge_memlock() 2) .alloc() performs actual allocations 3) .alloc() callback calculates map cost and sets map.memory.pages 4) map_create() calls bpf_map_init_memlock() which sets map.memory.user and performs actual charging; in case of failure the map is destroyed <map is in use> 1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which performs uncharge and releases the user 2) .map_free() callback releases the memory The scheme can be simplified and made more robust: 1) .alloc() calculates map cost and calls bpf_map_charge_init() 2) bpf_map_charge_init() sets map.memory.user and performs actual charge 3) .alloc() performs actual allocations <map is in use> 1) .map_free() callback releases the memory 2) bpf_map_charge_finish() performs uncharge and releases the user The new scheme also allows to reuse bpf_map_charge_init()/finish() functions for memcg-based accounting. Because charges are performed before actual allocations and uncharges after freeing the memory, no bogus memory pressure can be created. In cases when the map structure is not available (e.g. it's not created yet, or is already destroyed), on-stack bpf_map_memory structure is used. The charge can be transferred with the bpf_map_charge_move() function. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: group memory related fields in struct bpf_map_memoryRoman Gushchin2-3/+3
Group "user" and "pages" fields of bpf_map into the bpf_map_memory structure. Later it can be extended with "memcg" and other related information. The main reason for a such change (beside cosmetics) is to pass bpf_map_memory structure to charging functions before the actual allocation of bpf_map. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: add memlock precharge for socket local storageRoman Gushchin1-2/+10
Socket local storage maps lack the memlock precharge check, which is performed before the memory allocation for most other bpf map types. Let's add it in order to unify all map types. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS callsbrakmo2-20/+40
Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning congestion notifications from the BPF programs. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31nexthop: remove redundant assignment to errColin Ian King1-1/+1
The variable err is initialized with a value that is never read and err is reassigned a few statements later. This initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller95-924/+313
The phylink conflict was between a bug fix by Russell King to make sure we have a consistent PHY interface mode, and a change in net-next to pull some code in phylink_resolve() into the helper functions phylink_mac_link_{up,down}() On the dp83867 side it's mostly overlapping changes, with the 'net' side removing a condition that was supposed to trigger for RGMII but because of how it was coded never actually could trigger. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31netfilter: nf_conntrack_bridge: fix CONFIG_IPV6=yPablo Neira Ayuso1-1/+1
This patch fixes a few problems with CONFIG_IPV6=y and CONFIG_NF_CONNTRACK_BRIDGE=m: In file included from net/netfilter/utils.c:5: include/linux/netfilter_ipv6.h: In function 'nf_ipv6_br_defrag': include/linux/netfilter_ipv6.h:110:9: error: implicit declaration of function 'nf_ct_frag6_gather'; did you mean 'nf_ct_attach'? [-Werror=implicit-function-declaration] And these too: net/ipv6/netfilter.c:242:2: error: unknown field 'br_defrag' specified in initializer net/ipv6/netfilter.c:243:2: error: unknown field 'br_fragment' specified in initializer This patch includes an original chunk from wenxu. Fixes: 764dd163ac92 ("netfilter: nf_conntrack_bridge: add support for IPv6") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Yuehaibing <yuehaibing@huawei.com> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31ipvs: add checksum support for gue encapsulationJacky Hu2-17/+137
Add checksum support for gue encapsulation with the tun_flags parameter, which could be one of the values below: IP_VS_TUNNEL_ENCAP_FLAG_NOCSUM IP_VS_TUNNEL_ENCAP_FLAG_CSUM IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM Signed-off-by: Jacky Hu <hengqing.hu@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: replace skb_make_writable with skb_ensure_writableFlorian Westphal4-28/+6
This converts all remaining users and then removes skb_make_writable. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: tcpmss, optstrip: prefer skb_ensure_writableFlorian Westphal2-16/+14
This also changes optstrip to only make the tcp header writeable rather than the entire packet. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: xt_HL: prefer skb_ensure_writableFlorian Westphal1-2/+2
Also, make the argument to be only the needed size of the header we're altering, no need to pull in the full packet into linear area. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: nf_tables: prefer skb_ensure_writableFlorian Westphal2-4/+5
.. so skb_make_writable can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: ipv4: prefer skb_ensure_writableFlorian Westphal5-6/+6
.. so skb_make_writable can be removed soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: conntrack, nat: prefer skb_ensure_writableFlorian Westphal4-17/+17
like previous patches -- convert conntrack to use the core helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: ipvs: prefer skb_ensure_writableFlorian Westphal7-18/+18
It does the same thing, use it instead so we can remove skb_make_writable. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: bridge: convert skb_make_writable to skb_ensure_writableFlorian Westphal3-3/+3
Back in the day, skb_ensure_writable did not exist. By now, both functions have the same precondition: I. skb_make_writable will test in this order: 1. wlen > skb->len -> error 2. if not cloned and wlen <= headlen -> OK 3. If cloned and wlen bytes of clone writeable -> OK After those checks, skb is either not cloned but needs to pull from nonlinear area, or writing to head would also alter data of another clone. In both cases skb_make_writable will then call __pskb_pull_tail, which will kmalloc a new memory area to use for skb->head. IOW, after successful skb_make_writable call, the requested length is in linear area and can be modified, even if skb was cloned. II. skb_ensure_writable will do this instead: 1. call pskb_may_pull. This handles case 1 above. After this, wlen is in linear area, but skb might be cloned. 2. return if skb is not cloned 3. return if wlen byte of clone are writeable. 4. fully copy the skb. So post-conditions are the same: *len bytes are writeable in linear area without altering any payload data of a clone, all header pointers might have been changed. Only differences are that skb_ensure_writable is in the core, whereas skb_make_writable lives in netfilter core and the inverted return value. skb_make_writable returns 0 on error, whereas skb_ensure_writable returns negative value. For the normal cases performance is similar: A. skb is not cloned and in linear area: pskb_may_pull is inline helper, so neither function copies. B. skb is cloned, write is in linear area and clone is writeable: both funcions return with step 3. This series removes skb_make_writable from the kernel. While at it, pass the needed value instead, its less confusing that way: There is no special-handling of "0-length" argument in either skb_make_writable or skb_ensure_writable. bridge already makes sure ethernet header is in linear area, only purpose of the make_writable() is is to copy skb->head in case of cloned skbs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: nf_tables: free base chain counters from workerFlorian Westphal1-16/+10
No need to use synchronize_rcu() here, just swap the two pointers and have the release occur from work queue after commit has completed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: nf_flow_table: remove unnecessary variable in flow_offload_tupleTaehee Yoo1-1/+0
The oifidx in the struct flow_offload_tuple is not used anymore. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31netfilter: xt_owner: Add supplementary groups optionLukasz Pawelczyk1-3/+20
The XT_OWNER_SUPPL_GROUPS flag causes GIDs specified with XT_OWNER_GID to be also checked in the supplementary groups of a process. f_cred->group_info cannot be modified during its lifetime and f_cred holds a reference to it so it's safe to use. Signed-off-by: Lukasz Pawelczyk <l.pawelczyk@samsung.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31ipvs: strip udp tunnel headers from icmp errorsJulian Anastasov1-0/+60
Recognize UDP tunnels in received ICMP errors and properly strip the tunnel headers. GUE is what we have for now. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31ipvs: add function to find tunnelsJulian Anastasov2-0/+37
Add ip_vs_find_tunnel() to match tunnel headers by family, address and optional port. Use it to properly find the tunnel real server used in received ICMP errors. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>