aboutsummaryrefslogtreecommitdiffstats
path: root/net/netfilter/ipvs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-04-09ipvs: fix rtnl_lock lockups caused by start_sync_threadJulian Anastasov2-83/+80
syzkaller reports for wrong rtnl_lock usage in sync code [1] and [2] We have 2 problems in start_sync_thread if error path is taken, eg. on memory allocation error or failure to configure sockets for mcast group or addr/port binding: 1. recursive locking: holding rtnl_lock while calling sock_release which in turn calls again rtnl_lock in ip_mc_drop_socket to leave the mcast group, as noticed by Florian Westphal. Additionally, sock_release can not be called while holding sync_mutex (ABBA deadlock). 2. task hung: holding rtnl_lock while calling kthread_stop to stop the running kthreads. As the kthreads do the same to leave the mcast group (sock_release -> ip_mc_drop_socket -> rtnl_lock) they hang. Fix the problems by calling rtnl_unlock early in the error path, now sock_release is called after unlocking both mutexes. Problem 3 (task hung reported by syzkaller [2]) is variant of problem 2: use _trylock to prevent one user to call rtnl_lock and then while waiting for sync_mutex to block kthreads that execute sock_release when they are stopped by stop_sync_thread. [1] IPVS: stopping backup sync thread 4500 ... WARNING: possible recursive locking detected 4.16.0-rc7+ #3 Not tainted -------------------------------------------- syzkaller688027/4497 is trying to acquire lock: (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 but task is already holding lock: IPVS: stopping backup sync thread 4495 ... (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rtnl_mutex); lock(rtnl_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by syzkaller688027/4497: #0: (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<00000000703f78e3>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 stack backtrace: CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_deadlock_bug kernel/locking/lockdep.c:1761 [inline] check_deadlock kernel/locking/lockdep.c:1805 [inline] validate_chain kernel/locking/lockdep.c:2401 [inline] __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643 inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413 sock_release+0x8d/0x1e0 net/socket.c:595 start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924 do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261 udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x446a69 RSP: 002b:00007fa1c3a64da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000446a69 RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000006e29fc R08: 0000000000000018 R09: 0000000000000000 R10: 00000000200000c0 R11: 0000000000000246 R12: 00000000006e29f8 R13: 00676e697279656b R14: 00007fa1c3a659c0 R15: 00000000006e2b60 [2] IPVS: sync thread started: state = BACKUP, mcast_ifn = syz_tun, syncid = 4, id = 0 IPVS: stopping backup sync thread 25415 ... INFO: task syz-executor7:25421 blocked for more than 120 seconds. Not tainted 4.16.0-rc6+ #284 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor7 D23688 25421 4408 0x00000004 Call Trace: context_switch kernel/sched/core.c:2862 [inline] __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440 schedule+0xf5/0x430 kernel/sched/core.c:3499 schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777 do_wait_for_common kernel/sched/completion.c:86 [inline] __wait_for_common kernel/sched/completion.c:107 [inline] wait_for_common kernel/sched/completion.c:118 [inline] wait_for_completion+0x415/0x770 kernel/sched/completion.c:139 kthread_stop+0x14a/0x7a0 kernel/kthread.c:530 stop_sync_thread+0x3d9/0x740 net/netfilter/ipvs/ip_vs_sync.c:1996 do_ip_vs_set_ctl+0x2b1/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2394 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1253 sctp_setsockopt+0x2ca/0x63e0 net/sctp/socket.c:4154 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:3039 SYSC_setsockopt net/socket.c:1850 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1829 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x454889 RSP: 002b:00007fc927626c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fc9276276d4 RCX: 0000000000454889 RDX: 000000000000048c RSI: 0000000000000000 RDI: 0000000000000017 RBP: 000000000072bf58 R08: 0000000000000018 R09: 0000000000000000 R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000051c R14: 00000000006f9b40 R15: 0000000000000001 Showing all locks held in the system: 2 locks held by khungtaskd/868: #0: (rcu_read_lock){....}, at: [<00000000a1a8f002>] check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline] #0: (rcu_read_lock){....}, at: [<00000000a1a8f002>] watchdog+0x1c5/0xd60 kernel/hung_task.c:249 #1: (tasklist_lock){.+.+}, at: [<0000000037c2f8f9>] debug_show_all_locks+0xd3/0x3d0 kernel/locking/lockdep.c:4470 1 lock held by rsyslogd/4247: #0: (&f->f_pos_lock){+.+.}, at: [<000000000d8d6983>] __fdget_pos+0x12b/0x190 fs/file.c:765 2 locks held by getty/4338: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4339: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4340: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4341: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4342: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4343: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4344: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 3 locks held by kworker/0:5/6494: #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] work_static include/linux/workqueue.h:198 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] set_work_data kernel/workqueue.c:619 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084 #1: ((addr_chk_work).work){+.+.}, at: [<00000000278427d5>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088 #2: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 1 lock held by syz-executor7/25421: #0: (ipvs->sync_mutex){+.+.}, at: [<00000000d414a689>] do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393 2 locks held by syz-executor7/25427: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<00000000e6d48489>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 1 lock held by syz-executor7/25435: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 1 lock held by ipvs-b:2:0/25415: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 Reported-and-tested-by: syzbot+a46d6abf9d56b1365a72@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+5fe074c01b2032ce9618@syzkaller.appspotmail.com Fixes: e0b26cc997d5 ("ipvs: call rtnl_lock early") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-4/+4
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. This batch comes with more input sanitization for xtables to address bug reports from fuzzers, preparation works to the flowtable infrastructure and assorted updates. In no particular order, they are: 1) Make sure userspace provides a valid standard target verdict, from Florian Westphal. 2) Sanitize error target size, also from Florian. 3) Validate that last rule in basechain matches underflow/policy since userspace assumes this when decoding the ruleset blob that comes from the kernel, from Florian. 4) Consolidate hook entry checks through xt_check_table_hooks(), patch from Florian. 5) Cap ruleset allocations at 512 mbytes, 134217728 rules and reject very large compat offset arrays, so we have a reasonable upper limit and fuzzers don't exercise the oom-killer. Patches from Florian. 6) Several WARN_ON checks on xtables mutex helper, from Florian. 7) xt_rateest now has a hashtable per net, from Cong Wang. 8) Consolidate counter allocation in xt_counters_alloc(), from Florian. 9) Earlier xt_table_unlock() call in {ip,ip6,arp,eb}tables, patch from Xin Long. 10) Set FLOW_OFFLOAD_DIR_* to IP_CT_DIR_* definitions, patch from Felix Fietkau. 11) Consolidate code through flow_offload_fill_dir(), also from Felix. 12) Inline ip6_dst_mtu_forward() just like ip_dst_mtu_maybe_forward() to remove a dependency with flowtable and ipv6.ko, from Felix. 13) Cache mtu size in flow_offload_tuple object, this is safe for forwarding as f87c10a8aa1e describes, from Felix. 14) Rename nf_flow_table.c to nf_flow_table_core.o, to simplify too modular infrastructure, from Felix. 15) Add rt0, rt2 and rt4 IPv6 routing extension support, patch from Ahmed Abdelsalam. 16) Remove unused parameter in nf_conncount_count(), from Yi-Hung Wei. 17) Support for counting only to nf_conncount infrastructure, patch from Yi-Hung Wei. 18) Add strict NFT_CT_{SRC_IP,DST_IP,SRC_IP6,DST_IP6} key datatypes to nft_ct. 19) Use boolean as return value from ipt_ah and from IPVS too, patch from Gustavo A. R. Silva. 20) Remove useless parameters in nfnl_acct_overquota() and nf_conntrack_broadcast_help(), from Taehee Yoo. 21) Use ipv6_addr_is_multicast() from xt_cluster, also from Taehee Yoo. 22) Statify nf_tables_obj_lookup_byhandle, patch from Fengguang Wu. 23) Fix typo in xt_limit, from Geert Uytterhoeven. 24) Do no use VLAs in Netfilter code, again from Gustavo. 25) Use ADD_COUNTER from ebtables, from Taehee Yoo. 26) Bitshift support for CONNMARK and MARK targets, from Jack Ma. 27) Use pr_*() and add pr_fmt(), from Arushi Singhal. 28) Add synproxy support to ctnetlink. 29) ICMP type and IGMP matching support for ebtables, patches from Matthias Schiffer. 30) Support for the revision infrastructure to ebtables, from Bernie Harris. 31) String match support for ebtables, also from Bernie. 32) Documentation for the new flowtable infrastructure. 33) Use generic comparison functions in ebt_stp, from Joe Perches. 34) Demodularize filter chains in nftables. 35) Register conntrack hooks in case nftables NAT chain is added. 36) Merge assignments with return in a couple of spots in the Netfilter codebase, also from Arushi. 37) Document that xtables percpu counters are stored in the same memory area, from Ben Hutchings. 38) Revert mark_source_chains() sanity checks that break existing rulesets, from Florian Westphal. 39) Use is_zero_ether_addr() in the ipset codebase, from Joe Perches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27net: Drop pernet_operations::asyncKirill Tkhai4-5/+0
Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17net: Convert ip_vs_ftp_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister ipvs app. register_ip_vs_app(), unregister_ip_vs_app() and register_ip_vs_app_inc() modify per-net structures, and there are no global structures touched. So, this looks safe to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17net: Convert ipvs_core_dev_opsKirill Tkhai1-0/+1
Exit method stops two per-net threads and cancels delayed work. Everything looks nicely per-net divided. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17net: Convert ipvs_core_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister nf hooks, /proc entries, sysctl, percpu statistics. There are several global lists, and the only list modified without exclusive locks is ip_vs_conn_tab in ip_vs_conn_flush(). We iterate the list and force the timers expire at the moment. Since there were possible several timer expirations before this patch, and since they are safe, the patch does not invent new parallelism of their destruction. These pernet_operations look safe to be converted. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-13ipvs: use true and false for boolean valuesGustavo A. R. Silva2-4/+4
Assign true or false to boolean variables instead of an integer value. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
All of the conflicts were cases of overlapping changes. In net/core/devlink.c, we have to make care that the resouce size_params have become a struct member rather than a pointer to such an object. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28ipvs: remove IPS_NAT_MASK check to fix passive FTPJulian Anastasov1-1/+1
The IPS_NAT_MASK check in 4.12 replaced previous check for nfct_nat() which was needed to fix a crash in 2.6.36-rc, see commit 7bcbf81a2296 ("ipvs: avoid oops for passive FTP"). But as IPVS does not set the IPS_SRC_NAT and IPS_DST_NAT bits, checking for IPS_NAT_MASK prevents PASV response to be properly mangled and blocks the transfer. Remove the check as it is not needed after 3.12 commit 41d73ec053d2 ("netfilter: nf_conntrack: make sequence number adjustments usuable without NAT") which changes nfct_nat() with nfct_seqadj() and especially after 3.13 commit b25adce16064 ("ipvs: correct usage/allocation of seqadj ext in ipvs"). Thanks to Li Shuang and Florian Westphal for reporting the problem! Reported-by: Li Shuang <shuali@redhat.com> Fixes: be7be6e161a2 ("netfilter: ipvs: fix incorrect conflict resolution") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-27net: Convert sysctl creating and destroying pernet_operationsKirill Tkhai2-0/+2
These pernet_operations create and destroy sysctl tables, and they are able to be executed in parallel with any others: ip_vs_lblc_ops ip_vs_lblcr_ops Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds6-13/+9
Pull networking updates from David Miller: 1) Significantly shrink the core networking routing structures. Result of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf 2) Add netdevsim driver for testing various offloads, from Jakub Kicinski. 3) Support cross-chip FDB operations in DSA, from Vivien Didelot. 4) Add a 2nd listener hash table for TCP, similar to what was done for UDP. From Martin KaFai Lau. 5) Add eBPF based queue selection to tun, from Jason Wang. 6) Lockless qdisc support, from John Fastabend. 7) SCTP stream interleave support, from Xin Long. 8) Smoother TCP receive autotuning, from Eric Dumazet. 9) Lots of erspan tunneling enhancements, from William Tu. 10) Add true function call support to BPF, from Alexei Starovoitov. 11) Add explicit support for GRO HW offloading, from Michael Chan. 12) Support extack generation in more netlink subsystems. From Alexander Aring, Quentin Monnet, and Jakub Kicinski. 13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From Russell King. 14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso. 15) Many improvements and simplifications to the NFP driver bpf JIT, from Jakub Kicinski. 16) Support for ipv6 non-equal cost multipath routing, from Ido Schimmel. 17) Add resource abstration to devlink, from Arkadi Sharshevsky. 18) Packet scheduler classifier shared filter block support, from Jiri Pirko. 19) Avoid locking in act_csum, from Davide Caratti. 20) devinet_ioctl() simplifications from Al viro. 21) More TCP bpf improvements from Lawrence Brakmo. 22) Add support for onlink ipv6 route flag, similar to ipv4, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits) tls: Add support for encryption using async offload accelerator ip6mr: fix stale iterator net/sched: kconfig: Remove blank help texts openvswitch: meter: Use 64-bit arithmetic instead of 32-bit tcp_nv: fix potential integer overflow in tcpnv_acked r8169: fix RTL8168EP take too long to complete driver initialization. qmi_wwan: Add support for Quectel EP06 rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK ipmr: Fix ptrdiff_t print formatting ibmvnic: Wait for device response when changing MAC qlcnic: fix deadlock bug tcp: release sk_frag.page in tcp_disconnect ipv4: Get the address of interface correctly. net_sched: gen_estimator: fix lockdep splat net: macb: Handle HRESP error net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring ipv6: addrconf: break critical section in addrconf_verify_rtnl() ipv6: change route cache aging logic i40e/i40evf: Update DESC_NEEDED value to reflect larger value bnxt_en: cleanup DIM work on device shutdown ...
2018-01-19netfilter: delete /proc THIS_MODULE referencesAlexey Dobriyan3-6/+0
/proc has been ignoring struct file_operations::owner field for 10 years. Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where inode->i_fop is initialized with proxy struct file_operations for regular files: - if (de->proc_fops) - inode->i_fop = de->proc_fops; + if (de->proc_fops) { + if (S_ISREG(inode->i_mode)) + inode->i_fop = &proc_reg_file_ops; + else + inode->i_fop = de->proc_fops; + } VFS stopped pinning module at this point. # ipvs Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: ipvs: Remove useless ipvsh param of frag_safe_skb_hpGao Feng2-7/+7
The param of frag_safe_skb_hp, ipvsh, isn't used now. So remove it and update the callers' codes too. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: mark expected switch fall-throughsGustavo A. R. Silva2-0/+2
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-12-02ipvs: switch to sock_recvmsg()Al Viro1-6/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2-3/+7
Pull networking updates from David Miller: "Highlights: 1) Maintain the TCP retransmit queue using an rbtree, with 1GB windows at 100Gb this really has become necessary. From Eric Dumazet. 2) Multi-program support for cgroup+bpf, from Alexei Starovoitov. 3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew Lunn. 4) Add meter action support to openvswitch, from Andy Zhou. 5) Add a data meta pointer for BPF accessible packets, from Daniel Borkmann. 6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet. 7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli. 8) More work to move the RTNL mutex down, from Florian Westphal. 9) Add 'bpftool' utility, to help with bpf program introspection. From Jakub Kicinski. 10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper Dangaard Brouer. 11) Support 'blocks' of transformations in the packet scheduler which can span multiple network devices, from Jiri Pirko. 12) TC flower offload support in cxgb4, from Kumar Sanghvi. 13) Priority based stream scheduler for SCTP, from Marcelo Ricardo Leitner. 14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg. 15) Add RED qdisc offloadability, and use it in mlxsw driver. From Nogah Frankel. 16) eBPF based device controller for cgroup v2, from Roman Gushchin. 17) Add some fundamental tracepoints for TCP, from Song Liu. 18) Remove garbage collection from ipv6 route layer, this is a significant accomplishment. From Wei Wang. 19) Add multicast route offload support to mlxsw, from Yotam Gigi" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits) tcp: highest_sack fix geneve: fix fill_info when link down bpf: fix lockdep splat net: cdc_ncm: GetNtbFormat endian fix openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start netem: remove unnecessary 64 bit modulus netem: use 64 bit divide by rate tcp: Namespace-ify sysctl_tcp_default_congestion_control net: Protect iterations over net::fib_notifier_ops in fib_seq_sum() ipv6: set all.accept_dad to 0 by default uapi: fix linux/tls.h userspace compilation error usbnet: ipheth: prevent TX queue timeouts when device not ready vhost_net: conditionally enable tx polling uapi: fix linux/rxrpc.h userspace compilation errors net: stmmac: fix LPI transitioning for dwmac4 atm: horizon: Fix irq release error net-sysfs: trigger netlink notification on ifalias change via sysfs openvswitch: Using kfree_rcu() to simplify the code openvswitch: Make local function ovs_nsh_key_attr_size() static openvswitch: Fix return value check in ovs_meter_cmd_features() ...
2017-11-13Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds5-22/+23
Pull timer updates from Thomas Gleixner: "Yet another big pile of changes: - More year 2038 work from Arnd slowly reaching the point where we need to think about the syscalls themself. - A new timer function which allows to conditionally (re)arm a timer only when it's either not running or the new expiry time is sooner than the armed expiry time. This allows to use a single timer for multiple timeout requirements w/o caring about the first expiry time at the call site. - A new NMI safe accessor to clock real time for the printk timestamp work. Can be used by tracing, perf as well if required. - A large number of timer setup conversions from Kees which got collected here because either maintainers requested so or they simply got ignored. As Kees pointed out already there are a few trivial merge conflicts and some redundant commits which was unavoidable due to the size of this conversion effort. - Avoid a redundant iteration in the timer wheel softirq processing. - Provide a mechanism to treat RTC implementations depending on their hardware properties, i.e. don't inflict the write at the 0.5 seconds boundary which originates from the PC CMOS RTC to all RTCs. No functional change as drivers need to be updated separately. - The usual small updates to core code clocksource drivers. Nothing really exciting" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits) timers: Add a function to start/reduce a timer pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday() timer: Prepare to change all DEFINE_TIMER() callbacks netfilter: ipvs: Convert timers to use timer_setup() scsi: qla2xxx: Convert timers to use timer_setup() block/aoe: discover_timer: Convert timers to use timer_setup() ide: Convert timers to use timer_setup() drbd: Convert timers to use timer_setup() mailbox: Convert timers to use timer_setup() crypto: Convert timers to use timer_setup() drivers/pcmcia: omap1: Fix error in automated timer conversion ARM: footbridge: Fix typo in timer conversion drivers/sgi-xp: Convert timers to use timer_setup() drivers/pcmcia: Convert timers to use timer_setup() drivers/memstick: Convert timers to use timer_setup() drivers/macintosh: Convert timers to use timer_setup() hwrng/xgene-rng: Convert timers to use timer_setup() auxdisplay: Convert timers to use timer_setup() sparc/led: Convert timers to use timer_setup() mips: ip22/32: Convert timers to use timer_setup() ...
2017-11-08netfilter: ipvs: Convert timers to use timer_setup()Kees Cook5-22/+23
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Wensong Zhang <wensong@linux-vs.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Cc: lvs-devel@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au>
2017-11-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2-3/+7
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree, they are: 1) Speed up table replacement on busy systems with large tables (and many cores) in x_tables. Now xt_replace_table() synchronizes by itself by waiting until all cpus had an even seqcount and we use no use seqlock when fetching old counters, from Florian Westphal. 2) Add nf_l4proto_log_invalid() and nf_ct_l4proto_log_invalid() to speed up packet processing in the fast path when logging is not enabled, from Florian Westphal. 3) Precompute masked address from configuration plane in xt_connlimit, from Florian. 4) Don't use explicit size for set selection if performance set policy is selected. 5) Allow to get elements from an existing set in nf_tables. 6) Fix incorrect check in nft_hash_deactivate(), from Florian. 7) Cache netlink attribute size result in l4proto->nla_size, from Florian. 8) Handle NFPROTO_INET in nf_ct_netns_get() from conntrack core. 9) Use power efficient workqueue in conntrack garbage collector, from Vincent Guittot. 10) Remove unnecessary parameter, in conntrack l4proto functions, also from Florian. 11) Constify struct nf_conntrack_l3proto definitions, from Florian. 12) Remove all typedefs in nf_conntrack_h323 via coccinelle semantic patch, from Harsha Sharma. 13) Don't store address in the rbtree nodes in xt_connlimit, they are never used, from Florian. 14) Fix out of bound access in the conntrack h323 helper, patch from Eric Sesterhenn. 15) Print symbols for the address returned with %pS in IPVS, from Helge Deller. 16) Proc output should only display its own netns in IPVS, from KUWAZAWA Takuya. 17) Small clean up in size_entry_mwt(), from Colin Ian King. 18) Use test_and_clear_bit from nf_nat_proto_clean() instead of separated non-atomic test and then clear bit, from Florian Westphal. 19) Consolidate prefix length maps in ipset, from Aaron Conole. 20) Fix sparse warnings in ipset, from Jozsef Kadlecsik. 21) Simplify list_set_memsize(), from simran singhal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-07Merge branch 'linus' into locking/core, to resolve conflictsIngo Molnar3-0/+3
Conflicts: include/linux/compiler-clang.h include/linux/compiler-gcc.h include/linux/compiler-intel.h include/uapi/linux/stddef.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-06netfilter: ipvs: Fix inappropriate output of procfsKUWAZAWA Takuya1-0/+4
Information about ipvs in different network namespace can be seen via procfs. How to reproduce: # ip netns add ns01 # ip netns add ns02 # ip netns exec ns01 ip a add dev lo 127.0.0.1/8 # ip netns exec ns02 ip a add dev lo 127.0.0.1/8 # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80 # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80 The ipvsadm displays information about its own network namespace only. # ip netns exec ns01 ipvsadm -Ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.1.1.1:80 wlc # ip netns exec ns02 ipvsadm -Ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.1.1.2:80 wlc But I can see information about other network namespace via procfs. # ip netns exec ns01 cat /proc/net/ip_vs IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 0A010101:0050 wlc TCP 0A010102:0050 wlc # ip netns exec ns02 cat /proc/net/ip_vs IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 0A010102:0050 wlc Signed-off-by: KUWAZAWA Takuya <albatross0@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06netfilter: ipvs: Use %pS printk format for direct addressesHelge Deller2-3/+3
The debug and error printk functions in ipvs uses wrongly the %pF instead of the %pS printk format specifier for printing symbols for the address returned by _builtin_return_address(0). Fix it for the ia64, ppc64 and parisc64 architectures. Signed-off-by: Helge Deller <deller@gmx.de> Cc: Wensong Zhang <wensong@linux-vs.org> Cc: netdev@vger.kernel.org Cc: lvs-devel@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman3-0/+3
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-25locking/atomics, net/netlink/netfilter: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()Mark Rutland1-1/+1
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't currently harmful. However, for some features it is necessary to instrument reads and writes separately, which is not possible with ACCESS_ONCE(). This distinction is critical to correct operation. It's possible to transform the bulk of kernel code using the Coccinelle script below. However, this doesn't handle comments, leaving references to ACCESS_ONCE() instances which have been removed. As a preparatory step, this patch converts netlink and netfilter code and comments to use {READ,WRITE}_ONCE() consistently. ---- virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Westphal <fw@strlen.de> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-7-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-26netfilter: ipvs: full-functionality option for ECN encapsulation in tunnelVadim Fedorenko1-2/+6
IPVS tunnel mode works as simple tunnel (see RFC 3168) copying ECN field to outer header. That's result in packet drops on egress tunnels in case the egress tunnel operates as ECN-capable with Full-functionality option (like ip_tunnel and ip6_tunnel kernel modules), according to RFC 3168 section 9.1.1 recommendation. This patch implements ECN full-functionality option into ipvs xmit code. Cc: netdev@vger.kernel.org Cc: lvs-devel@vger.kernel.org Signed-off-by: Vadim Fedorenko <vfedorenko@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08netfilter: ipvs: do not create conn for ABORT packet in sctp_conn_scheduleXin Long1-1/+2
There's no reason for ipvs to create a conn for an ABORT packet even if sysctl_sloppy_sctp is set. This patch is to accept it without creating a conn, just as ipvs does for tcp's RST packet. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-08netfilter: ipvs: fix the issue that sctp_conn_schedule drops non-INIT packetXin Long1-2/+5
Commit 5e26b1b3abce ("ipvs: support scheduling inverse and icmp SCTP packets") changed to check packet type early. It introduced a side effect: if it's not a INIT packet, ports will be set as NULL, and the packet will be dropped later. It caused that sctp couldn't create connection when ipvs module is loaded and any scheduler is registered on server. Li Shuang reproduced it by running the cmds on sctp server: # ipvsadm -A -t 1.1.1.1:80 -s rr # ipvsadm -D -t 1.1.1.1:80 then the server could't work any more. This patch is to return 1 when it's not an INIT packet. It means ipvs will accept it without creating a conn for it, just like what it does for tcp. Fixes: 5e26b1b3abce ("ipvs: support scheduling inverse and icmp SCTP packets") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31netfilter: nf_hook_ops structs can be constFlorian Westphal1-1/+1
We no longer place these on a list so they can be const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-24netfilter: Remove duplicated rcu_read_lock.Taehee Yoo7-82/+8
This patch removes duplicate rcu_read_lock(). 1. IPVS part: According to Julian Anastasov's mention, contexts of ipvs are described at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary: - packet RX/TX: does not need locks because packets come from hooks. - sync msg RX: backup server uses RCU locks while registering new connections. - ip_vs_ctl.c: configuration get/set, RCU locks needed. - xt_ipvs.c: It is a netfilter match, running from hook context. As result, rcu_read_lock and rcu_read_unlock can be removed from: - ip_vs_core.c: all - ip_vs_ctl.c: - only from ip_vs_has_real_service - ip_vs_ftp.c: all - ip_vs_proto_sctp.c: all - ip_vs_proto_tcp.c: all - ip_vs_proto_udp.c: all - ip_vs_xmit.c: all (contains only packet processing) 2. Netfilter part: There are three types of functions that are guaranteed the rcu_read_lock(). First, as result, functions are only called by nf_hook(): - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp(). - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6(). - match_lookup_rt6(), check_hlist(), hashlimit_mt_common(). - xt_osf_match_packet(). Second, functions that caller already held the rcu_read_lock(). - destroy_conntrack(), ctnetlink_conntrack_event(). - ctnl_timeout_find_get(), nfqnl_nf_hook_drop(). Third, functions that are mixed with type1 and type2. These functions are called by nf_hook() also these are called by ordinary functions that already held the rcu_read_lock(): - __ctnetlink_glue_build(), ctnetlink_expect_event(). - ctnetlink_proto_size(). Applied files are below: - nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c. - nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c. - nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c. - xt_connlimit.c, xt_hashlimit.c, xt_osf.c Detailed calltrace can be found at: http://marc.info/?l=netfilter-devel&m=149667610710350&w=2 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-01sctp: remove the typedef sctp_chunkhdr_tXin Long2-5/+5
This patch is to remove the typedef sctp_chunkhdr_t, and replace with struct sctp_chunkhdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type)., especially in sctp_new. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01sctp: remove the typedef sctp_sctphdr_tXin Long2-11/+10
This patch is to remove the typedef sctp_sctphdr_t, and replace with struct sctphdr in the places where it's using this typedef. It is also to fix some indents and use sizeof(variable) instead of sizeof(type). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08ipvs: SNAT packet replies only for NATed connectionsJulian Anastasov1-5/+14
We do not check if packet from real server is for NAT connection before performing SNAT. This causes problems for setups that use DR/TUN and allow local clients to access the real server directly, for example: - local client in director creates IPVS-DR/TUN connection CIP->VIP and the request packets are routed to RIP. Talks are finished but IPVS connection is not expired yet. - second local client creates non-IPVS connection CIP->RIP with same reply tuple RIP->CIP and when replies are received on LOCAL_IN we wrongly assign them for the first client connection because RIP->CIP matches the reply direction. As result, IPVS SNATs replies for non-IPVS connections. The problem is more visible to local UDP clients but in rare cases it can happen also for TCP or remote clients when the real server sends the reply traffic via the director. So, better to be more precise for the reply traffic. As replies are not expected for DR/TUN connections, better to not touch them. Reported-by: Nick Moriarty <nick.moriarty@york.ac.uk> Tested-by: Nick Moriarty <nick.moriarty@york.ac.uk> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-05-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-5/+17
Pablo Neira Ayuso says: ==================== Netfilter/IPVS/OVS fixes for net The following patchset contains a rather large batch of Netfilter, IPVS and OVS fixes for your net tree. This includes fixes for ctnetlink, the userspace conntrack helper infrastructure, conntrack OVS support, ebtables DNAT target, several leaks in error path among other. More specifically, they are: 1) Fix reference count leak in the CT target error path, from Gao Feng. 2) Remove conntrack entry clashing with a matching expectation, patch from Jarno Rajahalme. 3) Fix bogus EEXIST when registering two different userspace helpers, from Liping Zhang. 4) Don't leak dummy elements in the new bitmap set type in nf_tables, from Liping Zhang. 5) Get rid of module autoload from conntrack update path in ctnetlink, we don't need autoload at this late stage and it is happening with rcu read lock held which is not good. From Liping Zhang. 6) Fix deadlock due to double-acquire of the expect_lock from conntrack update path, this fixes a bug that was introduced when the central spinlock got removed. Again from Liping Zhang. 7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock protection that was selected when the central spinlock was removed was not really protecting anything at all. 8) Protect sequence adjustment under ct->lock. 9) Missing socket match with IPv6, from Peter Tirsek. 10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from Linus Luessing. 11) Don't give up on evaluating the expression on new entries added via dynset expression in nf_tables, from Liping Zhang. 12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals with non-linear skbuffs. 13) Don't allow IPv6 service in IPVS if no IPv6 support is available, from Paolo Abeni. 14) Missing mutex release in error path of xt_find_table_lock(), from Dan Carpenter. 15) Update maintainers files, Netfilter section. Add Florian to the file, refer to nftables.org and change project status from Supported to Maintained. 16) Bail out on mismatching extensions in element updates in nf_tables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller7-57/+34
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree. A large bunch of code cleanups, simplify the conntrack extension codebase, get rid of the fake conntrack object, speed up netns by selective synchronize_net() calls. More specifically, they are: 1) Check for ct->status bit instead of using nfct_nat() from IPVS and Netfilter codebase, patch from Florian Westphal. 2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao. 3) Simplify FTP IPVS helper module registration path, from Arushi Singhal. 4) Introduce nft_is_base_chain() helper function. 5) Enforce expectation limit from userspace conntrack helper, from Gao Feng. 6) Add nf_ct_remove_expect() helper function, from Gao Feng. 7) NAT mangle helper function return boolean, from Gao Feng. 8) ctnetlink_alloc_expect() should only work for conntrack with helpers, from Gao Feng. 9) Add nfnl_msg_type() helper function to nfnetlink to build the netlink message type. 10) Get rid of unnecessary cast on void, from simran singhal. 11) Use seq_puts()/seq_putc() instead of seq_printf() where possible, also from simran singhal. 12) Use list_prev_entry() from nf_tables, from simran signhal. 13) Remove unnecessary & on pointer function in the Netfilter and IPVS code. 14) Remove obsolete comment on set of rules per CPU in ip6_tables, no longer true. From Arushi Singhal. 15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng. 16) Remove unnecessary nested rcu_read_lock() in __nf_nat_decode_session(). Code running from hooks are already guaranteed to run under RCU read side. 17) Remove deadcode in nf_tables_getobj(), from Aaron Conole. 18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(), also from Aaron. 19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole. 20) Don't propagate NF_DROP error to userspace via ctnetlink in __nf_nat_alloc_null_binding() function, from Gao Feng. 21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks, from Gao Feng. 22) Kill the fake untracked conntrack objects, use ctinfo instead to annotate a conntrack object is untracked, from Florian Westphal. 23) Remove nf_ct_is_untracked(), now obsolete since we have no conntrack template anymore, from Florian. 24) Add event mask support to nft_ct, also from Florian. 25) Move nf_conn_help structure to include/net/netfilter/nf_conntrack_helper.h. 26) Add a fixed 32 bytes scratchpad area for conntrack helpers. Thus, we don't deal with variable conntrack extensions anymore. Make sure userspace conntrack helper doesn't go over that size. Remove variable size ct extension infrastructure now this code got no more clients. From Florian Westphal. 27) Restore offset and length of nf_ct_ext structure to 8 bytes now that wraparound is not possible any longer, also from Florian. 28) Allow to get rid of unassured flows under stress in conntrack, this applies to DCCP, SCTP and TCP protocols, from Florian. 29) Shrink size of nf_conntrack_ecache structure, from Florian. 30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker, from Gao Feng. 31) Register SYNPROXY hooks on demand, from Florian Westphal. 32) Use pernet hook whenever possible, instead of global hook registration, from Florian Westphal. 33) Pass hook structure to ebt_register_table() to consolidate some infrastructure code, from Florian Westphal. 34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the SYNPROXY code, to make sure device stats are not fooled, patch from Gao Feng. 35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we don't need anymore if we just select a fixed size instead of expensive runtime time calculation of this. From Florian. 36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(), from Florian. 37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from Florian. 38) Attach NAT extension on-demand from masquerade and pptp helper path, from Florian. 39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole. 40) Speed up netns by selective calls of synchronize_net(), from Florian Westphal. 41) Silence stack size warning gcc in 32-bit arch in snmp helper, from Florian. 42) Inconditionally call nf_ct_ext_destroy(), even if we have no extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabledPaolo Abeni1-5/+17
When creating a new ipvs service, ipv6 addresses are always accepted if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family is not explicitly checked. This allows the user-space to configure ipvs services even if the system is booted with ipv6.disable=1. On specific configuration, ipvs can try to call ipv6 routing code at setup time, causing the kernel to oops due to fib6_rules_ops being NULL. This change addresses the issue adding a check for the ipv6 module being enabled while validating ipv6 service operations and adding the same validation for dest operations. According to git history, this issue is apparently present since the introduction of ipv6 support, and the oops can be triggered since commit 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address is local") Fixes: 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address is local") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28ipvs: change comparison on sync_refresh_periodAaron Conole1-1/+1
The sync_refresh_period variable is unsigned, so it can never be < 0. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-28ipvs: remove unused function ip_vs_set_state_timeoutAaron Conole1-22/+0
There are no in-tree callers of this function and it isn't exported. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-04-26ipvs: convert to use pernet nf_hook apiFlorian Westphal1-10/+9
nf_(un)register_hooks has to maintain an internal hook list to add/remove those hooks from net namespaces as they are added/deleted. ipvs already uses pernet_ops, so we can switch to the (more recent) pernet hook api instead. Compile tested only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-19netfilter: ipvs: fix incorrect conflict resolutionFlorian Westphal1-1/+2
The commit ab8bc7ed864b9c4f1fcb00a22bbe4e0f66ce8003 ("netfilter: remove nf_ct_is_untracked") changed the line if (ct && !nf_ct_is_untracked(ct) && nfct_nat(ct)) { to if (ct && nfct_nat(ct)) { meanwhile, the commit 41390895e50bc4f28abe384c6b35ac27464a20ec ("netfilter: ipvs: don't check for presence of nat extension") from ipvs-next had changed the same line to if (ct && !nf_ct_is_untracked(ct) && (ct->status & IPS_NAT_MASK)) { When ipvs-next got merged into nf-next, the merge resolution took the first version, dropping the conversion of nfct_nat(). While this doesn't cause a problem at the moment, it will once we stop adding the nat extension by default. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15netfilter: remove nf_ct_is_untrackedFlorian Westphal3-8/+7
This function is now obsolete and always returns false. This change has no effect on generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-15Merge tag 'ipvs2-for-v4.12' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-nextPablo Neira Ayuso2-7/+4
Simon Horman says: ==================== Second Round of IPVS Updates for v4.12 please consider these clean-ups and enhancements to IPVS for v4.12. * Removal unused variable * Use kzalloc where appropriate * More efficient detection of presence of NAT extension ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Conflicts: net/netfilter/ipvs/ip_vs_ftp.c
2017-04-13netlink: pass extended ACK struct where availableJohannes Berg1-1/+1
This is an add-on to the previous patch that passes the extended ACK structure where it's already available by existing genl_info or extack function arguments. This was done with this spatch (with some manual adjustment of indentation): @@ expression A, B, C, D, E; identifier fn, info; @@ fn(..., struct genl_info *info, ...) { ... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, info->extack) ... } @@ expression A, B, C, D, E; identifier fn, info; @@ fn(..., struct genl_info *info, ...) { <... -nla_parse_nested(A, B, C, D, NULL) +nla_parse_nested(A, B, C, D, info->extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_parse(A, B, C, D, E, NULL) +nla_parse(A, B, C, D, E, extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { ... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, extack) ... } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_parse_nested(A, B, C, D, NULL) +nla_parse_nested(A, B, C, D, extack) ...> } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nlmsg_validate(A, B, C, D, NULL) +nlmsg_validate(A, B, C, D, extack) ...> } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_validate(A, B, C, D, NULL) +nla_validate(A, B, C, D, extack) ...> } @@ expression A, B, C; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_validate_nested(A, B, C, NULL) +nla_validate_nested(A, B, C, extack) ...> } Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13netlink: pass extended ACK struct to parsing functionsJohannes Berg1-5/+7
Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-07netfilter: Remove exceptional & on function nameArushi Singhal1-2/+2
Remove & from function pointers to conform to the style found elsewhere in the file. Done using the following semantic patch // <smpl> @r@ identifier f; @@ f(...) { ... } @@ identifier r.f; @@ - &f + f // </smpl> Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-07netfilter: Use seq_puts()/seq_putc() where possiblesimran singhal1-4/+4
For string without format specifiers, use seq_puts(). For seq_printf("\n"), use seq_putc('\n'). Signed-off-by: simran singhal <singhalsimran0@gmail.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-04-06netfilter: nat: nf_nat_mangle_{udp,tcp}_packet returns booleanGao Feng1-5/+8
nf_nat_mangle_{udp,tcp}_packet() returns int. However, it is used as bool type in many spots. Fix this by consistently handle this return value as a boolean. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-03-30ipvs: remove unused variableArushi Singhal1-4/+1
This patch uses the following coccinelle script to remove a variable that was simply used to store the return value of a function call before returning it: @@ identifier len,f; @@ -int len; ... when != len when strict -len = +return f(...); -return len; Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2017-03-30netfilter: ipvs: Replace kzalloc with kcalloc.Varsha Rao1-2/+2
Replace kzalloc with kcalloc. As kcalloc is preferred for allocating an array instead of kzalloc. This patch fixes the checkpatch issue. Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
2017-03-30netfilter: ipvs: don't check for presence of nat extensionFlorian Westphal1-1/+1
Check for the NAT status bits, they are set once conntrack needs NAT in source or reply direction, this is slightly faster than nfct_nat() as that has to check the extension area. Signed-off-by: Florian Westphal <fw@strlen.de>
2017-03-17netfilter: refcounter conversionsReshetova, Elena12-31/+31
refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>