aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv4 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-02-11vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds4-22/+22
This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-5/+54
Daniel Borkmann says: ==================== pull-request: bpf 2018-02-09 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Two fixes for BPF sockmap in order to break up circular map references from programs attached to sockmap, and detaching related sockets in case of socket close() event. For the latter we get rid of the smap_state_change() and plug into ULP infrastructure, which will later also be used for additional features anyway such as TX hooks. For the second issue, dependency chain is broken up via map release callback to free parse/verdict programs, all from John. 2) Fix a libbpf relocation issue that was found while implementing XDP support for Suricata project. Issue was that when clang was invoked with default target instead of bpf target, then various other e.g. debugging relevant sections are added to the ELF file that contained relocation entries pointing to non-BPF related sections which libbpf trips over instead of skipping them. Test cases for libbpf are added as well, from Jesper. 3) Various misc fixes for bpftool and one for libbpf: a small addition to libbpf to make sure it recognizes all standard section prefixes. Then, the Makefile in bpftool/Documentation is improved to explicitly check for rst2man being installed on the system as we otherwise risk installing empty man pages; the man page for bpftool-map is corrected and a set of missing bash completions added in order to avoid shipping bpftool where the completions are only partially working, from Quentin. 4) Fix applying the relocation to immediate load instructions in the nfp JIT which were missing a shift, from Jakub. 5) Two fixes for the BPF kernel selftests: handle CONFIG_BPF_JIT_ALWAYS_ON=y gracefully in test_bpf.ko module and mark them as FLAG_EXPECTED_FAIL in this case; and explicitly delete the veth devices in the two tests test_xdp_{meta,redirect}.sh before dismantling the netnses as when selftests are run in batch mode, then workqueue to handle destruction might not have finished yet and thus veth creation in next test under same dev name would fail, from Yonghong. 6) Fix test_kmod.sh to check the test_bpf.ko module path before performing an insmod, and fallback to modprobe. Especially the latter is useful when having a device under test that has the modules installed instead, from Naresh. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07tcp: tracepoint: only call trace_tcp_send_reset with full socketSong Liu1-1/+2
tracepoint tcp_send_reset requires a full socket to work. However, it may be called when in TCP_TIME_WAIT: case TCP_TW_RST: tcp_v6_send_reset(sk, skb); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; To avoid this problem, this patch checks the socket with sk_fullsock() before calling trace_tcp_send_reset(). Fixes: c24b14c46bb8 ("tcp: add tracepoint trace_tcp_send_reset") Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2-2/+2
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for you net tree, they are: 1) Restore __GFP_NORETRY in xt_table allocations to mitigate effects of large memory allocation requests, from Michal Hocko. 2) Release IPv6 fragment queue in case of error in fragmentation header, this is a follow up to amend patch 83f1999caeb1, from Subash Abhinov Kasiviswanathan. 3) Flowtable infrastructure depends on NETFILTER_INGRESS as it registers a hook for each flowtable, reported by John Crispin. 4) Missing initialization of info->priv in xt_cgroup version 1, from Cong Wang. 5) Give a chance to garbage collector to run after scheduling flowtable cleanup. 6) Releasing flowtable content on nft_flow_offload module removal is not required at all, there is not dependencies between this module and flowtables, remove it. 7) Fix missing xt_rateest_mutex grabbing for hash insertions, also from Cong Wang. 8) Move nf_flow_table_cleanup() routine to flowtable core, this patch is a dependency for the next patch in this list. 9) Flowtable resources are not properly released on removal from the control plane. Fix this resource leak by scheduling removal of all entries and explicit call to the garbage collector. 10) nf_ct_nat_offset() declaration is dead code, this function prototype is not used anywhere, remove it. From Taehee Yoo. 11) Fix another flowtable resource leak on entry insertion failures, this patch also fixes a possible use-after-free. Patch from Felix Fietkau. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07netfilter: nf_tables: fix flowtable freePablo Neira Ayuso1-0/+1
Every flow_offload entry is added into the table twice. Because of this, rhashtable_free_and_destroy can't be used, since it would call kfree for each flow_offload object twice. This patch cleans up the flowtable via nf_flow_table_iterate() to schedule removal of entries by setting on the dying bit, then there is an explicitly invocation of the garbage collector to release resources. Based on patch from Felix Fietkau. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-06net: erspan: fix erspan config overwriteWilliam Tu1-9/+0
When an erspan tunnel device receives an erpsan packet with different tunnel metadata (ex: version, index, hwid, direction), existing code overwrites the tunnel device's erspan configuration with the received packet's metadata. The patch fixes it. Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel") Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre") Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode") Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support") Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-06net: erspan: fix metadata extractionWilliam Tu1-1/+4
Commit d350a823020e ("net: erspan: create erspan metadata uapi header") moves the erspan 'version' in front of the 'struct erspan_md2' for later extensibility reason. This breaks the existing erspan metadata extraction code because the erspan_md2 then has a 4-byte offset to between the erspan_metadata and erspan_base_hdr. This patch fixes it. Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel") Fixes: ef7baf5e083c ("ip6_gre: add ip6 erspan collect_md mode") Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code") Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-06net: add a UID to use for ULP socket assignmentJohn Fastabend1-5/+54
Create a UID field and enum that can be used to assign ULPs to sockets. This saves a set of string comparisons if the ULP id is known. For sockmap, which is added in the next patches, a ULP is used to hook into TCP sockets close state. In this case the ULP being added is done at map insert time and the ULP is known and done on the kernel side. In this case the named lookup is not needed. Because we don't want to expose psock internals to user space socket options a user visible flag is also added. For TLS this is set for BPF it will be cleared. Alos remove pr_notice, user gets an error code back and should check that rather than rely on logs. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-03Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds1-0/+2
Pull hardened usercopy whitelisting from Kees Cook: "Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass all hardened usercopy checks since these sizes cannot change at runtime.) This new check is WARN-by-default, so any mistakes can be found over the next several releases without breaking anyone's system. The series has roughly the following sections: - remove %p and improve reporting with offset - prepare infrastructure and whitelist kmalloc - update VFS subsystem with whitelists - update SCSI subsystem with whitelists - update network subsystem with whitelists - update process memory with whitelists - update per-architecture thread_struct with whitelists - update KVM with whitelists and fix ioctl bug - mark all other allocations as not whitelisted - update lkdtm for more sensible test overage" * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits) lkdtm: Update usercopy tests for whitelisting usercopy: Restrict non-usercopy caches to size 0 kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl kvm: whitelist struct kvm_vcpu_arch arm: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy x86: Implement thread_struct whitelist for hardened usercopy fork: Provide usercopy whitelisting for task_struct fork: Define usercopy region in thread_stack slab caches fork: Define usercopy region in mm_struct slab caches net: Restrict unwhitelisted proto caches to size 0 sctp: Copy struct sctp_sock.autoclose to userspace using put_user() sctp: Define usercopy region in SCTP proto slab cache caif: Define usercopy region in caif proto slab cache ip: Define usercopy region in IP proto slab cache net: Define usercopy region in struct proto slab cache scsi: Define usercopy region in scsi_sense_cache slab cache cifs: Define usercopy region in cifs_request slab cache vxfs: Define usercopy region in vxfs_inode slab cache ufs: Define usercopy region in ufs_inode_cache slab cache ...
2018-02-02Revert "defer call to mem_cgroup_sk_alloc()"Roman Gushchin1-1/+0
This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()"). Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks memcg socket memory accounting, as packets received before memcg pointer initialization are not accounted and are causing refcounting underflow on socket release. Actually the free-after-use problem was fixed by commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in sk_clone_lock()") for the cgroup pointer. So, let's revert it and call mem_cgroup_sk_alloc() just before cgroup_sk_alloc(). This is safe, as we hold a reference to the socket we're cloning, and it holds a reference to the memcg. Also, let's drop BUG_ON(mem_cgroup_is_root()) check from mem_cgroup_sk_alloc(). I see no reasons why bumping the root memcg counter is a good reason to panic, and there are no realistic ways to hit it. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-02netfilter: flowtable infrastructure depends on NETFILTER_INGRESSPablo Neira Ayuso1-2/+1
config NF_FLOW_TABLE depends on NETFILTER_INGRESS. If users forget to enable this toggle, flowtable registration fails with EOPNOTSUPP. Moreover, turn 'select NF_FLOW_TABLE' in every flowtable family flavour into dependency instead, otherwise this new dependency on NETFILTER_INGRESS causes a warning. This also allows us to remove the explicit dependency between family flowtables <-> NF_TABLES and NF_CONNTRACK, given they depend on the NF_FLOW_TABLE core that already expresses the general dependencies for this new infrastructure. Moreover, NF_FLOW_TABLE_INET depends on NF_FLOW_TABLE_IPV4 and NF_FLOWTABLE_IPV6, which already depends on NF_FLOW_TABLE. So we can get rid of direct dependency with NF_FLOW_TABLE. In general, let's avoid 'select', it just makes things more complicated. Reported-by: John Crispin <john@phrozen.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-01net: igmp: add a missing rcu locking sectionEric Dumazet1-0/+4
Newly added igmpv3_get_srcaddr() needs to be called under rcu lock. Timer callbacks do not ensure this locking. ============================= WARNING: suspicious RCU usage 4.15.0+ #200 Not tainted ----------------------------- ./include/linux/inetdevice.h:216 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by syzkaller616973/4074: #0: (&mm->mmap_sem){++++}, at: [<00000000bfce669e>] __do_page_fault+0x32d/0xc90 arch/x86/mm/fault.c:1355 #1: ((&im->timer)){+.-.}, at: [<00000000619d2f71>] lockdep_copy_map include/linux/lockdep.h:178 [inline] #1: ((&im->timer)){+.-.}, at: [<00000000619d2f71>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1316 #2: (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] spin_lock_bh include/linux/spinlock.h:315 [inline] #2: (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] igmpv3_send_report+0x98/0x5b0 net/ipv4/igmp.c:600 stack backtrace: CPU: 0 PID: 4074 Comm: syzkaller616973 Not tainted 4.15.0+ #200 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592 __in_dev_get_rcu include/linux/inetdevice.h:216 [inline] igmpv3_get_srcaddr net/ipv4/igmp.c:329 [inline] igmpv3_newpack+0xeef/0x12e0 net/ipv4/igmp.c:389 add_grhead.isra.27+0x235/0x300 net/ipv4/igmp.c:432 add_grec+0xbd3/0x1170 net/ipv4/igmp.c:565 igmpv3_send_report+0xd5/0x5b0 net/ipv4/igmp.c:605 igmp_send_report+0xc43/0x1050 net/ipv4/igmp.c:722 igmp_timer_expire+0x322/0x5c0 net/ipv4/igmp.c:831 call_timer_fn+0x228/0x820 kernel/time/timer.c:1326 expire_timers kernel/time/timer.c:1363 [inline] __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:541 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:938 Fixes: a46182b00290 ("net: igmp: Use correct source address on IGMPv3 reports") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller3-14/+22
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Fix OOM that syskaller triggers with ipt_replace.size = -1 and IPT_SO_SET_REPLACE socket option, from Dmitry Vyukov. 2) Check for too long extension name in xt_request_find_{match|target} that result in out-of-bound reads, from Eric Dumazet. 3) Fix memory exhaustion bug in ipset hash:*net* types when adding ranges that look like x.x.x.x-255.255.255.255, from Jozsef Kadlecsik. 4) Fix pointer leaks to userspace in x_tables, from Dmitry Vyukov. 5) Insufficient sanity checks in clusterip_tg_check(), also from Dmitry. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01inet: Avoid unitialized variable warning in inet_unhash()Geert Uytterhoeven1-4/+2
With gcc-4.1.2: net/ipv4/inet_hashtables.c: In function ‘inet_unhash’: net/ipv4/inet_hashtables.c:628: warning: ‘ilb’ may be used uninitialized in this function While this is a false positive, it can easily be avoided by using the pointer itself as the canary variable. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01tcp_bbr: fix pacing_gain to always be unity when using lt_bwNeal Cardwell1-3/+3
This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when using lt_bw and returning from the PROBE_RTT state to PROBE_BW. Previously, when using lt_bw, upon exiting PROBE_RTT and entering PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase() set a pacing gain above 1.0. In such cases this would result in a pacing rate that is 1.25x higher than intended, potentially resulting in a high loss rate for a little while until we stop using the lt_bw a bit later. This commit is a stable candidate for kernels back as far as 4.9. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: Beyers Cronje <bcronje@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds59-2262/+1427
Pull networking updates from David Miller: 1) Significantly shrink the core networking routing structures. Result of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf 2) Add netdevsim driver for testing various offloads, from Jakub Kicinski. 3) Support cross-chip FDB operations in DSA, from Vivien Didelot. 4) Add a 2nd listener hash table for TCP, similar to what was done for UDP. From Martin KaFai Lau. 5) Add eBPF based queue selection to tun, from Jason Wang. 6) Lockless qdisc support, from John Fastabend. 7) SCTP stream interleave support, from Xin Long. 8) Smoother TCP receive autotuning, from Eric Dumazet. 9) Lots of erspan tunneling enhancements, from William Tu. 10) Add true function call support to BPF, from Alexei Starovoitov. 11) Add explicit support for GRO HW offloading, from Michael Chan. 12) Support extack generation in more netlink subsystems. From Alexander Aring, Quentin Monnet, and Jakub Kicinski. 13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From Russell King. 14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso. 15) Many improvements and simplifications to the NFP driver bpf JIT, from Jakub Kicinski. 16) Support for ipv6 non-equal cost multipath routing, from Ido Schimmel. 17) Add resource abstration to devlink, from Arkadi Sharshevsky. 18) Packet scheduler classifier shared filter block support, from Jiri Pirko. 19) Avoid locking in act_csum, from Davide Caratti. 20) devinet_ioctl() simplifications from Al viro. 21) More TCP bpf improvements from Lawrence Brakmo. 22) Add support for onlink ipv6 route flag, similar to ipv4, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits) tls: Add support for encryption using async offload accelerator ip6mr: fix stale iterator net/sched: kconfig: Remove blank help texts openvswitch: meter: Use 64-bit arithmetic instead of 32-bit tcp_nv: fix potential integer overflow in tcpnv_acked r8169: fix RTL8168EP take too long to complete driver initialization. qmi_wwan: Add support for Quectel EP06 rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK ipmr: Fix ptrdiff_t print formatting ibmvnic: Wait for device response when changing MAC qlcnic: fix deadlock bug tcp: release sk_frag.page in tcp_disconnect ipv4: Get the address of interface correctly. net_sched: gen_estimator: fix lockdep splat net: macb: Handle HRESP error net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring ipv6: addrconf: break critical section in addrconf_verify_rtnl() ipv6: change route cache aging logic i40e/i40evf: Update DESC_NEEDED value to reflect larger value bnxt_en: cleanup DIM work on device shutdown ...
2018-01-31netfilter: on sockopt() acquire sock lock only in the required scopePaolo Abeni2-11/+9
Syzbot reported several deadlocks in the netfilter area caused by rtnl lock and socket lock being acquired with a different order on different code paths, leading to backtraces like the following one: ====================================================== WARNING: possible circular locking dependency detected 4.15.0-rc9+ #212 Not tainted ------------------------------------------------------ syzkaller041579/3682 is trying to acquire lock: (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] lock_sock include/net/sock.h:1463 [inline] (sk_lock-AF_INET6){+.+.}, at: [<000000008775e4dd>] do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167 but task is already holding lock: (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (rtnl_mutex){+.+.}: __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607 tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106 xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845 check_target net/ipv6/netfilter/ip6_tables.c:538 [inline] find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:580 translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749 do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline] do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928 udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 entry_SYSCALL_64_fastpath+0x29/0xa0 -> #0 (sk_lock-AF_INET6){+.+.}: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914 lock_sock_nested+0xc2/0x110 net/core/sock.c:2780 lock_sock include/net/sock.h:1463 [inline] do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167 ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922 udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 entry_SYSCALL_64_fastpath+0x29/0xa0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(sk_lock-AF_INET6); lock(rtnl_mutex); lock(sk_lock-AF_INET6); *** DEADLOCK *** 1 lock held by syzkaller041579/3682: #0: (rtnl_mutex){+.+.}, at: [<000000004342eaa9>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 The problem, as Florian noted, is that nf_setsockopt() is always called with the socket held, even if the lock itself is required only for very tight scopes and only for some operation. This patch addresses the issues moving the lock_sock() call only where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt() does not need anymore to acquire both locks. Fixes: 22265a5c3c10 ("netfilter: xt_TEE: resolve oif using netdevice notifiers") Reported-by: syzbot+a4c2dc980ac1af699b36@syzkaller.appspotmail.com Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-31tcp_nv: fix potential integer overflow in tcpnv_ackedGustavo A. R. Silva1-1/+1
Add suffix ULL to constant 80000 in order to avoid a potential integer overflow and give the compiler complete information about the proper arithmetic to use. Notice that this constant is used in a context that expects an expression of type u64. The current cast to u64 effectively applies to the whole expression as an argument of type u64 to be passed to div64_u64, but it does not prevent it from being evaluated using 32-bit arithmetic instead of 64-bit arithmetic. Also, once the expression is properly evaluated using 64-bit arithmentic, there is no need for the parentheses and the external cast to u64. Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-31netfilter: ipt_CLUSTERIP: fix out-of-bounds accesses in clusterip_tg_check()Dmitry Vyukov1-3/+13
Commit 136e92bbec0a switched local_nodes from an array to a bitmask but did not add proper bounds checks. As the result clusterip_config_init_nodelist() can both over-read ipt_clusterip_tgt_info.local_nodes and over-write clusterip_config.local_nodes. Add bounds checks for both. Fixes: 136e92bbec0a ("[NETFILTER] CLUSTERIP: use a bitmap to store node responsibility data") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-30Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-4/+4
Pull poll annotations from Al Viro: "This introduces a __bitwise type for POLL### bitmap, and propagates the annotations through the tree. Most of that stuff is as simple as 'make ->poll() instances return __poll_t and do the same to local variables used to hold the future return value'. Some of the obvious brainos found in process are fixed (e.g. POLLIN misspelled as POLL_IN). At that point the amount of sparse warnings is low and most of them are for genuine bugs - e.g. ->poll() instance deciding to return -EINVAL instead of a bitmap. I hadn't touched those in this series - it's large enough as it is. Another problem it has caught was eventpoll() ABI mess; select.c and eventpoll.c assumed that corresponding POLL### and EPOLL### were equal. That's true for some, but not all of them - EPOLL### are arch-independent, but POLL### are not. The last commit in this series separates userland POLL### values from the (now arch-independent) kernel-side ones, converting between them in the few places where they are copied to/from userland. AFAICS, this is the least disruptive fix preserving poll(2) ABI and making epoll() work on all architectures. As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and it will trigger only on what would've triggered EPOLLWRBAND on other architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered at all on sparc. With this patch they should work consistently on all architectures" * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) make kernel-side POLL... arch-independent eventpoll: no need to mask the result of epi_item_poll() again eventpoll: constify struct epoll_event pointers debugging printk in sg_poll() uses %x to print POLL... bitmap annotate poll(2) guts 9p: untangle ->poll() mess ->si_band gets POLL... bitmap stored into a user-visible long field ring_buffer_poll_wait() return value used as return value of ->poll() the rest of drivers/*: annotate ->poll() instances media: annotate ->poll() instances fs: annotate ->poll() instances ipc, kernel, mm: annotate ->poll() instances net: annotate ->poll() instances apparmor: annotate ->poll() instances tomoyo: annotate ->poll() instances sound: annotate ->poll() instances acpi: annotate ->poll() instances crypto: annotate ->poll() instances block: annotate ->poll() instances x86: annotate ->poll() instances ...
2018-01-30Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-12/+2
Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle were: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Torture-test updates. - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) torture: Save a line in stutter_wait(): while -> for torture: Eliminate torture_runnable and perf_runnable torture: Make stutter less vulnerable to compilers and races locking/locktorture: Fix num reader/writer corner cases locking/locktorture: Fix rwsem reader_delay torture: Place all torture-test modules in one MAINTAINERS group rcutorture/kvm-build.sh: Skip build directory check rcutorture: Simplify functions.sh include path rcutorture: Simplify logging rcutorture/kvm-recheck-*: Improve result directory readability check rcutorture/kvm.sh: Support execution from any directory rcutorture/kvm.sh: Use consistent help text for --qemu-args rcutorture/kvm.sh: Remove unused variable, `alldone` rcutorture: Remove unused script, config2frag.sh rcutorture/configinit: Fix build directory error message rcutorture: Preempt RCU-preempt readers more vigorously torture: Reduce #ifdefs for preempt_schedule() rcu: Remove have_rcu_nocb_mask from tree_plugin.h rcu: Add comment giving debug strategy for double call_rcu() tracing, rcu: Hide trace event rcu_nocb_wake when not used ...
2018-01-30ipmr: Fix ptrdiff_t print formattingJames Hogan1-1/+1
ipmr_vif_seq_show() prints the difference between two pointers with the format string %2zd (z for size_t), however the correct format string is %2td instead (t for ptrdiff_t). The same bug in ip6mr_vif_seq_show() was already fixed long ago by commit d430a227d272 ("bogus format in ip6mr"). Signed-off-by: James Hogan <jhogan@kernel.org> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29tcp: release sk_frag.page in tcp_disconnectLi RongQing1-0/+6
socket can be disconnected and gets transformed back to a listening socket, if sk_frag.page is not released, which will be cloned into a new socket by sk_clone_lock, but the reference count of this page is increased, lead to a use after free or double free issue Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29ipv4: Get the address of interface correctly.Tonghao Zhang1-0/+4
When using ioctl to get address of interface, we can't get it anymore. For example, the command is show as below. # ifconfig eth0 In the patch ("03aef17bb79b3"), the devinet_ioctl does not return a suitable value, even though we can find it in the kernel. Then fix it now. Fixes: 03aef17bb79b3 ("devinet_ioctl(): take copyin/copyout to caller") Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-3/+20
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller4-3/+38
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-01-26 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) A number of extensions to tcp-bpf, from Lawrence. - direct R or R/W access to many tcp_sock fields via bpf_sock_ops - passing up to 3 arguments to bpf_sock_ops functions - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks - optionally calling bpf_sock_ops program when RTO fires - optionally calling bpf_sock_ops program when packet is retransmitted - optionally calling bpf_sock_ops program when TCP state changes - access to tclass and sk_txhash - new selftest 2) div/mod exception handling, from Daniel. One of the ugly leftovers from the early eBPF days is that div/mod operations based on registers have a hard-coded src_reg == 0 test in the interpreter as well as in JIT code generators that would return from the BPF program with exit code 0. This was basically adopted from cBPF interpreter for historical reasons. There are multiple reasons why this is very suboptimal and prone to bugs. To name one: the return code mapping for such abnormal program exit of 0 does not always match with a suitable program type's exit code mapping. For example, '0' in tc means action 'ok' where the packet gets passed further up the stack, which is just undesirable for such cases (e.g. when implementing policy) and also does not match with other program types. After considering _four_ different ways to address the problem, we adapt the same behavior as on some major archs like ARMv8: X div 0 results in 0, and X mod 0 results in X. aarch64 and aarch32 ISA do not generate any traps or otherwise aborts of program execution for unsigned divides. Given the options, it seems the most suitable from all of them, also since major archs have similar schemes in place. Given this is all in the realm of undefined behavior, we still have the option to adapt if deemed necessary. 3) sockmap sample refactoring, from John. 4) lpm map get_next_key fixes, from Yonghong. 5) test cleanups, from Alexei and Prashant. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25net/ipv4: Allow send to local broadcast from a socket bound to a VRFDavid Ahern3-3/+33
Message sends to the local broadcast address (255.255.255.255) require uc_index or sk_bound_dev_if to be set to an egress device. However, responses or only received if the socket is bound to the device. This is overly constraining for processes running in an L3 domain. This patch allows a socket bound to the VRF device to send to the local broadcast address by using IP_UNICAST_IF to set the egress interface with packet receipt handled by the VRF binding. Similar to IP_MULTICAST_IF, relax the constraint on setting IP_UNICAST_IF if a socket is bound to an L3 master device. In this case allow uc_index to be set to an enslaved if sk_bound_dev_if is an L3 master device and is the master device for the ifindex. In udp and raw sendmsg, allow uc_index to override the oif if uc_index master device is oif (ie., the oif is an L3 master and the index is an L3 slave). Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25net: erspan: use bitfield instead of mask and offsetWilliam Tu1-24/+14
Originally the erspan fields are defined as a group into a __be16 field, and use mask and offset to access each field. This is more costly due to calling ntohs/htons. The patch changes it to use bitfields. Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25bpf: Add BPF_SOCK_OPS_STATE_CBLawrence Brakmo1-0/+24
Adds support for calling sock_ops BPF program when there is a TCP state change. Two arguments are used; one for the old state and another for the new state. There is a new enum in include/uapi/linux/bpf.h that exports the TCP states that prepends BPF_ to the current TCP state names. If it is ever necessary to change the internal TCP state values (other than adding more to the end), then it will become necessary to convert from the internal TCP state value to the BPF value before calling the BPF sock_ops function. There are a set of compile checks added in tcp.c to detect if the internal and BPF values differ so we can make the necessary fixes. New op: BPF_SOCK_OPS_STATE_CB. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25bpf: Add BPF_SOCK_OPS_RETRANS_CBLawrence Brakmo1-0/+4
Adds support for calling sock_ops BPF program when there is a retransmission. Three arguments are used; one for the sequence number, another for the number of segments retransmitted, and the last one for the return value of tcp_transmit_skb (0 => success). Does not include syn-ack retransmissions. New op: BPF_SOCK_OPS_RETRANS_CB. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25bpf: Add sock_ops RTO callbackLawrence Brakmo1-0/+7
Adds an optional call to sock_ops BPF program based on whether the BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags. The BPF program is passed 2 arguments: icsk_retransmits and whether the RTO has expired. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25bpf: Support passing args to sock_ops bpf functionLawrence Brakmo3-3/+3
Adds support for passing up to 4 arguments to sock_ops bpf functions. It reusues the reply union, so the bpf_sock_ops structures are not increased in size. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-25net: don't call update_pmtu unconditionallyNicolas Dichtel2-3/+2
Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to: "BUG: unable to handle kernel NULL pointer dereference at (null)" Let's add a helper to check if update_pmtu is available before calling it. Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path") Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") CC: Roman Kapl <code@rkapl.cz> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-25net: tcp: close sock if net namespace is exitingDan Streetman2-0/+18
When a tcp socket is closed, if it detects that its net namespace is exiting, close immediately and do not wait for FIN sequence. For normal sockets, a reference is taken to their net namespace, so it will never exit while the socket is open. However, kernel sockets do not take a reference to their net namespace, so it may begin exiting while the kernel socket is still open. In this case if the kernel socket is a tcp socket, it will stay open trying to complete its close sequence. The sock's dst(s) hold a reference to their interface, which are all transferred to the namespace's loopback interface when the real interfaces are taken down. When the namespace tries to take down its loopback interface, it hangs waiting for all references to the loopback interface to release, which results in messages like: unregister_netdevice: waiting for lo to become free. Usage count = 1 These messages continue until the socket finally times out and closes. Since the net namespace cleanup holds the net_mutex while calling its registered pernet callbacks, any new net namespace initialization is blocked until the current net namespace finishes exiting. After this change, the tcp socket notices the exiting net namespace, and closes immediately, releasing its dst(s) and their reference to the loopback interface, which lets the net namespace continue exiting. Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811 Signed-off-by: Dan Streetman <ddstreet@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24Merge branch 'rebased-net-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsDavid S. Miller4-85/+55
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24ipconfig: use dev_set_mtu()Al Viro1-14/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24ip_rt_ioctl(): take copyin to callerAl Viro3-19/+9
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24devinet_ioctl(): take copyin/copyout to callerAl Viro3-45/+34
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24net: separate SIOCGIFCONF handling from dev_ioctl()Al Viro1-7/+9
Only two of dev_ioctl() callers may pass SIOCGIFCONF to it. Separating that codepath from the rest of dev_ioctl() allows both to simplify dev_ioctl() itself (all other cases work with struct ifreq *) *and* seriously simplify the compat side of that beast: all it takes is passing to inet_gifconf() an extra argument - the size of individual records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)). With dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf() that's easy to arrange. As the result, compat side of SIOCGIFCONF doesn't need any allocations, copy_in_user() back and forth, etc. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24ip_tunnel: Use mark in skb by defaultThomas Winter1-3/+10
This allows marks set by connmark in iptables to be used for route lookups. Signed-off-by: Thomas Winter <thomas.winter@alliedtelesis.co.nz> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-24Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsecDavid S. Miller1-0/+1
Steffen Klassert says: ==================== pull request (net): ipsec 2018-01-24 1) Only offloads SAs after they are fully initialized. Otherwise a NIC may receive packets on a SA we can not yet handle in the stack. From Yossi Kuperman. 2) Fix negative refcount in case of a failing offload. From Aviad Yehezkel. 3) Fix inner IP ptoro version when decapsulating from interaddress family tunnels. From Yossi Kuperman. 4) Use true or false for boolean variables instead of an integer value in xfrm_get_type_offload. From Gustavo A. R. Silva. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-1/+10
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in 'net'. The esp{4,6}_offload.c conflicts were overlapping changes. The 'out' label is removed so we just return ERR_PTR(-EINVAL) directly. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP versionYossi Kuperman1-0/+1
IPSec tunnel mode supports encapsulation of IPv4 over IPv6 and vice-versa. The outer IP header is stripped and the inner IP inherits the original Ethernet header. Tcpdump fails to properly decode the inner packet in case that h_proto is different than the inner IP version. Fix h_proto to reflect the inner IP version. Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-01-22net: igmp: fix source address check for IGMPv3 reportsFelix Fietkau1-1/+1
Commit "net: igmp: Use correct source address on IGMPv3 reports" introduced a check to validate the source address of locally generated IGMPv3 packets. Instead of checking the local interface address directly, it uses inet_ifa_match(fl4->saddr, ifa), which checks if the address is on the local subnet (or equal to the point-to-point address if used). This breaks for point-to-point interfaces, so check against ifa->ifa_local directly. Cc: Kevin Cernekee <cernekee@chromium.org> Fixes: a46182b00290 ("net: igmp: Use correct source address on IGMPv3 reports") Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22gso: validate gso_type in GSO handlersWillem de Bruijn3-0/+9
Validate gso_type during segmentation as SKB_GSO_DODGY sources may pass packets where the gso_type does not match the contents. Syzkaller was able to enter the SCTP gso handler with a packet of gso_type SKB_GSO_TCPV4. On entry of transport layer gso handlers, verify that the gso_type matches the transport protocol. Fixes: 90017accff61 ("sctp: Add GSO support") Link: http://lkml.kernel.org/r/<001a1137452496ffc305617e5fe0@google.com> Reported-by: syzbot+fee64147a25aecd48055@syzkaller.appspotmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller11-1386/+456
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. Basically, a new extension for ip6tables, simplification work of nf_tables that saves us 500 LoC, allow raw table registration before defragmentation, conversion of the SNMP helper to use the ASN.1 code generator, unique 64-bit handle for all nf_tables objects and fixes to address fallout from previous nf-next batch. More specifically, they are: 1) Seven patches to remove family abstraction layer (struct nft_af_info) in nf_tables, this simplifies our codebase and it saves us 64 bytes per net namespace. 2) Add IPv6 segment routing header matching for ip6tables, from Ahmed Abdelsalam. 3) Allow to register iptable_raw table before defragmentation, some people do not want to waste cycles on defragmenting traffic that is going to be dropped, hence add a new module parameter to enable this behaviour in iptables and ip6tables. From Subash Abhinov Kasiviswanathan. This patch needed a couple of follow up patches to get things tidy from Arnd Bergmann. 4) SNMP helper uses the ASN.1 code generator, from Taehee Yoo. Several patches for this helper to prepare this change are also part of this patch series. 5) Add 64-bit handles to uniquely objects in nf_tables, from Harsha Sharma. 6) Remove log message that several netfilter subsystems print at boot/load time. 7) Restore x_tables module autoloading, that got broken in a previous patch to allow singleton NAT hook callback registration per hook spot, from Florian Westphal. Moreover, return EBUSY to report that the singleton NAT hook slot is already in instead. 8) Several fixes for the new nf_tables flowtable representation, including incorrect error check after nf_tables_flowtable_lookup(), missing Kconfig dependencies that lead to build breakage and missing initialization of priority and hooknum in flowtable object. 9) Missing NETFILTER_FAMILY_ARP dependency in Kconfig for the clusterip target. This is due to recent updates in the core to shrink the hook array size and compile it out if no specific family is enabled via .config file. Patch from Florian Westphal. 10) Remove duplicated include header files, from Wei Yongjun. 11) Sparse warning fix for the NFPROTO_INET handling from the core due to missing static function definition, also from Wei Yongjun. 12) Restore ICMPv6 Parameter Problem error reporting when defragmentation fails, from Subash Abhinov Kasiviswanathan. 13) Remove obsolete owner field initialization from struct file_operations, patch from Alexey Dobriyan. 14) Use boolean datatype where needed in the Netfilter codebase, from Gustavo A. R. Silva. 15) Remove double semicolon in dynset nf_tables expression, from Luis de Bethencourt. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19tcp: avoid min RTT bloat by skipping RTT from delayed-ACK in BBRYuchung Cheng2-1/+3
A persistent connection may send tiny amount of data (e.g. health-check) for a long period of time. BBR's windowed min RTT filter may only see RTT samples from delayed ACKs causing BBR to grossly over-estimate the path delay depending how much the ACK was delayed at the receiver. This patch skips RTT samples that are likely coming from delayed ACKs. Note that it is possible the sender never obtains a valid measure to set the min RTT. In this case BBR will continue to set cwnd to initial window which seems fine because the connection is thin stream. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19tcp: avoid min-RTT overestimation from delayed ACKsYuchung Cheng1-2/+21
This patch avoids having TCP sender or congestion control overestimate the min RTT by orders of magnitude. This happens when all the samples in the windowed filter are one-packet transfer like small request and health-check like chit-chat, which is farily common for applications using persistent connections. This patch tries to conservatively labels and skip RTT samples obtained from this type of workload. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19netfilter: remove messages print and boot/module load timePablo Neira Ayuso2-2/+0
Several reasons for this: * Several modules maintain internal version numbers, that they print at boot/module load time, that are not exposed to userspace, as a primitive mechanism to make revision number control from the earlier days of Netfilter. * IPset shows the protocol version at boot/module load time, instead display this via module description, as Jozsef suggested. * Remove copyright notice at boot/module load time in two spots, the Netfilter codebase is a collective development effort, if we would have to display copyrights for each contributor at boot/module load time for each extensions we have, we would probably fill up logs with lots of useless information - from a technical standpoint. So let's be consistent and remove them all. Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>