aboutsummaryrefslogtreecommitdiffstats
path: root/net (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-07-19ipv6: avoid overflow of offset in ip6_find_1stfragoptSabrina Dubroca1-2/+6
In some cases, offset can overflow and can cause an infinite loop in ip6_find_1stfragopt(). Make it unsigned int to prevent the overflow, and cap it at IPV6_MAXPLEN, since packets larger than that should be invalid. This problem has been here since before the beginning of git history. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-19Revert "rtnetlink: Do not generate notifications for CHANGEADDR event"David Ahern1-0/+1
This reverts commit cd8966e75ed3c6b41a37047a904617bc44fa481f. The duplicate CHANGEADDR event message is sent regardless of link status whereas the setlink changes only generate a notification when the link is up. Not sending a notification when the link is down breaks dhcpcd which only processes hwaddr changes when the link is down. Fixes reported regression: https://bugzilla.kernel.org/show_bug.cgi?id=196355 Reported-by: Yaroslav Isakov <yaroslav.isakov@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-19net: Zero terminate ifr_name in dev_ifname().David S. Miller1-0/+1
The ifr.ifr_name is passed around and assumed to be NULL terminated. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-19wireless: wext: terminate ifr name coming from userspaceLevin, Alexander1-0/+2
ifr name is assumed to be a valid string by the kernel, but nothing was forcing username to pass a valid string. In turn, this would cause panics as we tried to access the string past it's valid memory. Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18netfilter: fix netfilter_net_init() returnDan Carpenter1-2/+2
We accidentally return an uninitialized variable. Fixes: cf56c2f892a8 ("netfilter: remove old pre-netns era hook api") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller5-157/+14
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Missing netlink message sanity check in nfnetlink, patch from Mateusz Jurczyk. 2) We now have netfilter per-netns hooks, so let's kill global hook infrastructure, this infrastructure is known to be racy with netns. We don't care about out of tree modules. Patch from Florian Westphal. 3) find_appropriate_src() is buggy when colissions happens after the conversion of the nat bysource to rhashtable. Also from Florian. 4) Remove forward chain in nf_tables arp family, it's useless and it is causing quite a bit of confusion, from Florian Westphal. 5) nf_ct_remove_expect() is called with the wrong parameter, causing kernel oops, patch from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18udp: preserve skb->dst if required for IP options processingPaolo Abeni1-2/+11
Eric noticed that in udp_recvmsg() we still need to access skb->dst while processing the IP options. Since commit 0a463c78d25b ("udp: avoid a cache miss on dequeue") skb->dst is no more available at recvmsg() time and bad things will happen if we enter the relevant code path. This commit address the issue, avoid clearing skb->dst if any IP options are present into the relevant skb. Since the IP CB is contained in the first skb cacheline, we can test it to decide to leverage the consume_stateless_skb() optimization, without measurable additional cost in the faster path. v1 -> v2: updated commit message tags Fixes: 0a463c78d25b ("udp: avoid a cache miss on dequeue") Reported-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-18ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check()Alexander Potapenko2-0/+2
KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(), which originated from the TCP request socket created in cookie_v6_check(): ================================================================== BUG: KMSAN: use of uninitialized memory in tcp_transmit_skb+0xf77/0x3ec0 CPU: 1 PID: 2949 Comm: syz-execprog Not tainted 4.11.0-rc5+ #2931 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 TCP: request_sock_TCPv6: Possible SYN flooding on port 20028. Sending cookies. Check SNMP counters. Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 dump_stack+0x172/0x1c0 lib/dump_stack.c:52 kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927 __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469 skb_set_hash_from_sk ./include/net/sock.h:2011 tcp_transmit_skb+0xf77/0x3ec0 net/ipv4/tcp_output.c:983 tcp_send_ack+0x75b/0x830 net/ipv4/tcp_output.c:3493 tcp_delack_timer_handler+0x9a6/0xb90 net/ipv4/tcp_timer.c:284 tcp_delack_timer+0x1b0/0x310 net/ipv4/tcp_timer.c:309 call_timer_fn+0x240/0x520 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 __run_timers+0xc13/0xf10 kernel/time/timer.c:1601 run_timer_softirq+0x36/0xa0 kernel/time/timer.c:1614 __do_softirq+0x485/0x942 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 irq_exit+0x1fa/0x230 kernel/softirq.c:405 exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:657 smp_apic_timer_interrupt+0x5a/0x80 arch/x86/kernel/apic/apic.c:966 apic_timer_interrupt+0x86/0x90 arch/x86/entry/entry_64.S:489 RIP: 0010:native_restore_fl ./arch/x86/include/asm/irqflags.h:36 RIP: 0010:arch_local_irq_restore ./arch/x86/include/asm/irqflags.h:77 RIP: 0010:__msan_poison_alloca+0xed/0x120 mm/kmsan/kmsan_instr.c:440 RSP: 0018:ffff880024917cd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 RAX: 0000000000000246 RBX: ffff8800224c0000 RCX: 0000000000000005 RDX: 0000000000000004 RSI: ffff880000000000 RDI: ffffea0000b6d770 RBP: ffff880024917d58 R08: 0000000000000dd8 R09: 0000000000000004 R10: 0000160000000000 R11: 0000000000000000 R12: ffffffff85abf810 R13: ffff880024917dd8 R14: 0000000000000010 R15: ffffffff81cabde4 </IRQ> poll_select_copy_remaining+0xac/0x6b0 fs/select.c:293 SYSC_select+0x4b4/0x4e0 fs/select.c:653 SyS_select+0x76/0xa0 fs/select.c:634 entry_SYSCALL_64_fastpath+0x13/0x94 arch/x86/entry/entry_64.S:204 RIP: 0033:0x4597e7 RSP: 002b:000000c420037ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000017 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004597e7 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 000000c420037ef0 R08: 000000c420037ee0 R09: 0000000000000059 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000042dc20 R13: 00000000000000f3 R14: 0000000000000030 R15: 0000000000000003 chained origin: save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 kmsan_save_stack mm/kmsan/kmsan.c:317 kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547 __msan_store_shadow_origin_4+0xac/0x110 mm/kmsan/kmsan_instr.c:259 tcp_create_openreq_child+0x709/0x1ae0 net/ipv4/tcp_minisocks.c:472 tcp_v6_syn_recv_sock+0x7eb/0x2a30 net/ipv6/tcp_ipv6.c:1103 tcp_get_cookie_sock+0x136/0x5f0 net/ipv4/syncookies.c:212 cookie_v6_check+0x17a9/0x1b50 net/ipv6/syncookies.c:245 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989 tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298 tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487 ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279 NF_HOOK ./include/linux/netfilter.h:257 ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322 dst_input ./include/net/dst.h:492 ip6_rcv_finish net/ipv6/ip6_input.c:69 NF_HOOK ./include/linux/netfilter.h:257 ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203 __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208 __netif_receive_skb net/core/dev.c:4246 process_backlog+0x667/0xba0 net/core/dev.c:4866 napi_poll net/core/dev.c:5268 net_rx_action+0xc95/0x1590 net/core/dev.c:5333 __do_softirq+0x485/0x942 kernel/softirq.c:284 origin: save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198 kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337 kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766 reqsk_alloc ./include/net/request_sock.h:87 inet_reqsk_alloc+0xa4/0x5b0 net/ipv4/tcp_input.c:6200 cookie_v6_check+0x4f4/0x1b50 net/ipv6/syncookies.c:169 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989 tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298 tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487 ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279 NF_HOOK ./include/linux/netfilter.h:257 ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322 dst_input ./include/net/dst.h:492 ip6_rcv_finish net/ipv6/ip6_input.c:69 NF_HOOK ./include/linux/netfilter.h:257 ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203 __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208 __netif_receive_skb net/core/dev.c:4246 process_backlog+0x667/0xba0 net/core/dev.c:4866 napi_poll net/core/dev.c:5268 net_rx_action+0xc95/0x1590 net/core/dev.c:5333 __do_softirq+0x485/0x942 kernel/softirq.c:284 ================================================================== Similar error is reported for cookie_v4_check(). Fixes: 58d607d3e52f ("tcp: provide skb->hash to synack packets") Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17netfilter: expect: fix crash when putting uninited expectationFlorian Westphal1-1/+1
We crash in __nf_ct_expect_check, it calls nf_ct_remove_expect on the uninitialised expectation instead of existing one, so del_timer chokes on random memory address. Fixes: ec0e3f01114ad32711243 ("netfilter: nf_ct_expect: Add nf_ct_remove_expect()") Reported-by: Sergey Kvachonok <ravenexp@gmail.com> Tested-by: Sergey Kvachonok <ravenexp@gmail.com> Cc: Gao Feng <fgao@ikuai8.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17netfilter: nf_tables: only allow in/output for arp packetsFlorian Westphal1-2/+1
arp packets cannot be forwarded. They can be bridged, but then they can be filtered using either ebtables or nftables bridge family. The bridge netfilter exposes a "call-arptables" switch which pushes packets into arptables, but lets not expose this for nftables, so better close this asap. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17netfilter: nat: fix src map lookupFlorian Westphal1-8/+9
When doing initial conversion to rhashtable I replaced the bucket walk with a single rhashtable_lookup_fast(). When moving to rhlist I failed to properly walk the list of identical tuples, but that is what is needed for this to work correctly. The table contains the original tuples, so the reply tuples are all distinct. We currently decide that mapping is (not) in range only based on the first entry, but in case its not we need to try the reply tuple of the next entry until we either find an in-range mapping or we checked all the entries. This bug makes nat core attempt collision resolution while it might be able to use the mapping as-is. Fixes: 870190a9ec90 ("netfilter: nat: convert nat bysrc hash to rhashtable") Reported-by: Jaco Kroon <jaco@uls.co.za> Tested-by: Jaco Kroon <jaco@uls.co.za> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17netfilter: remove old pre-netns era hook apiFlorian Westphal1-143/+0
no more users in the tree, remove this. The old api is racy wrt. module removal, all users have been converted to the netns-aware api. The old api pretended we still have global hooks but that has not been true for a long time. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcvMateusz Jurczyk1-3/+3
Verify that the length of the socket buffer is sufficient to cover the nlmsghdr structure before accessing the nlh->nlmsg_len field for further input sanitization. If the client only supplies 1-3 bytes of data in sk_buff, then nlh->nlmsg_len remains partially uninitialized and contains leftover memory from the corresponding kernel allocation. Operating on such data may result in indeterminate evaluation of the nlmsg_len < NLMSG_HDRLEN expression. The bug was discovered by a runtime instrumentation designed to detect use of uninitialized memory in the kernel. The patch prevents this and other similar tools (e.g. KMSAN) from flagging this behavior in the future. Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-15tcp_bbr: init pacing rate on first RTT sampleNeal Cardwell1-1/+9
Fixes the following behavior: for connections that had no RTT sample at the time of initializing congestion control, BBR was initializing the pacing rate to a high nominal rate (based an a guess of RTT=1ms, in case this is LAN traffic). Then BBR never adjusted the pacing rate downward upon obtaining an actual RTT sample, if the connection never filled the pipe (e.g. all sends were small app-limited writes()). This fix adjusts the pacing rate upon obtaining the first RTT sample. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15tcp_bbr: remove sk_pacing_rate=0 transient during initNeal Cardwell1-1/+0
Fix a corner case noticed by Eric Dumazet, where BBR's setting sk->sk_pacing_rate to 0 during initialization could theoretically cause packets in the sending host to hang if there were packets "in flight" in the pacing infrastructure at the time the BBR congestion control state is initialized. This could occur if the pacing infrastructure happened to race with bbr_init() in a way such that the pacer read the 0 rather than the immediately following non-zero pacing rate. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15tcp_bbr: introduce bbr_init_pacing_rate_from_rtt() helperNeal Cardwell1-5/+18
Introduce a helper to initialize the BBR pacing rate unconditionally, based on the current cwnd and RTT estimate. This is a pure refactor, but is needed for two following fixes. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15tcp_bbr: introduce bbr_bw_to_pacing_rate() helperNeal Cardwell1-3/+11
Introduce a helper to convert a BBR bandwidth and gain factor to a pacing rate in bytes per second. This is a pure refactor, but is needed for two following fixes. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15tcp_bbr: cut pacing rate only if filled pipeNeal Cardwell1-2/+1
In bbr_set_pacing_rate(), which decides whether to cut the pacing rate, there was some code that considered exiting STARTUP to be equivalent to the notion of filling the pipe (i.e., bbr_full_bw_reached()). Specifically, as the code was structured, exiting STARTUP and going into PROBE_RTT could cause us to cut the pacing rate down to something silly and low, based on whatever bandwidth samples we've had so far, when it's possible that all of them have been small app-limited bandwidth samples that are not representative of the bandwidth available in the path. (The code was correct at the time it was written, but the state machine changed without this spot being adjusted correspondingly.) Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15openvswitch: Fix for force/commit action failuresGreg Rose1-15/+36
When there is an established connection in direction A->B, it is possible to receive a packet on port B which then executes ct(commit,force) without first performing ct() - ie, a lookup. In this case, we would expect that this packet can delete the existing entry so that we can commit a connection with direction B->A. However, currently we only perform a check in skb_nfct_cached() for whether OVS_CS_F_TRACKED is set and OVS_CS_F_INVALID is not set, ie that a lookup previously occurred. In the above scenario, a lookup has not occurred but we should still be able to statelessly look up the existing entry and potentially delete the entry if it is in the opposite direction. This patch extends the check to also hint that if the action has the force flag set, then we will lookup the existing entry so that the force check at the end of skb_nfct_cached has the ability to delete the connection. Fixes: dd41d330b03 ("openvswitch: Add force commit.") CC: Pravin Shelar <pshelar@nicira.com> CC: dev@openvswitch.org Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15ipv4: ip_do_fragment: fix headroom testsVasily Averin1-4/+4
Some time ago David Woodhouse reported skb_under_panic when we try to push ethernet header to fragmented ipv6 skbs. It was fixed for ipv6 by Florian Westphal in commit 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak") However similar problem still exist in ipv4. It does not trigger skb_under_panic due paranoid check in ip_finish_output2, however according to Alexey Kuznetsov current state is abnormal and ip_fragment should be fixed too. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14sctp: fix an array overflow when all ext chunks are setXin Long1-2/+2
Marcelo noticed an array overflow caused by commit c28445c3cb07 ("sctp: add reconf_enable in asoc ep and netns"), in which sctp would add SCTP_CID_RECONF into extensions when reconf_enable is set in sctp_make_init and sctp_make_init_ack. Then now when all ext chunks are set, 4 ext chunk ids can be put into extensions array while extensions array size is 3. It would cause a kernel panic because of this overflow. This patch is to fix it by defining extensions array size is 4 in both sctp_make_init and sctp_make_init_ack. Fixes: c28445c3cb07 ("sctp: add reconf_enable in asoc ep and netns") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14net sched actions: rename act_get_notify() to tcf_get_notify()Roman Mashak1-2/+2
Make name consistent with other TC event notification routines, such as tcf_add_notify() and tcf_del_notify() Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14net/packet: Fix Tx queue selection for AF_PACKETIván Briano1-4/+2
When PACKET_QDISC_BYPASS is not used, Tx queue selection will be done before the packet is enqueued, taking into account any mappings set by a queuing discipline such as mqprio without hardware offloading. This selection may be affected by a previously saved queue_mapping, either on the Rx path, or done before the packet reaches the device, as it's currently the case for AF_PACKET. In order for queue selection to work as expected when using traffic control, there can't be another selection done before that point is reached, so move the call to packet_pick_tx_queue to packet_direct_xmit, leaving the default xmit path as it was before PACKET_QDISC_BYPASS was introduced. A forward declaration of packet_pick_tx_queue() is introduced to avoid the need to reorder the functions within the file. Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Iván Briano <ivan.briano@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14net: bridge: fix dest lookup when vlan proto doesn't matchNikolay Aleksandrov2-2/+4
With 802.1ad support the vlan_ingress code started checking for vlan protocol mismatch which causes the current tag to be inserted and the bridge vlan protocol & pvid to be set. The vlan tag insertion changes the skb mac_header and thus the lookup mac dest pointer which was loaded prior to calling br_allowed_ingress in br_handle_frame_finish is VLAN_HLEN bytes off now, pointing to the last two bytes of the destination mac and the first four of the source mac causing lookups to always fail and broadcasting all such packets to all ports. Same thing happens for locally originated packets when passing via br_dev_xmit. So load the dest pointer after the vlan checks and possible skb change. Fixes: 8580e2117c06 ("bridge: Prepare for 802.1ad vlan filtering support") Reported-by: Anitha Narasimha Murthy <anitha@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14netpoll: shut up a kernel warning on refcountWANG Cong1-1/+1
When we convert atomic_t to refcount_t, a new kernel warning on "increment on 0" is introduced in the netpoll code, zap_completion_queue(). In fact for this special case, we know the refcount is 0 and we just have to set it to 1 to satisfy the following dev_kfree_skb_any(), so we can just use refcount_set(..., 1) instead. Fixes: 633547973ffc ("net: convert sk_buff.users from atomic_t to refcount_t") Reported-by: Dave Jones <davej@codemonkey.org.uk> Cc: Reshetova, Elena <elena.reshetova@intel.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13net: set fib rule refcount after mallocDavid Ahern1-2/+1
The configure callback of fib_rules_ops can change the refcnt of a fib rule. For instance, mlxsw takes a refcnt when adding the processing of the rule to a work queue. Thus the rule refcnt can not be reset to to 1 afterwards. Move the refcnt setting to after the allocation. Fixes: 5361e209dd30 ("net: avoid one splat in fib_nl_delrule()") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13dccp: make const array error_code staticColin Ian King1-1/+1
Don't populate array error_code on the stack but make it static. Makes the object code smaller by almost 250 bytes: Before: text data bss dec hex filename 10366 983 0 11349 2c55 net/dccp/input.o After: text data bss dec hex filename 10161 1039 0 11200 2bc0 net/dccp/input.o Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13bpf: fix return in bpf_skb_adjust_netKefeng Wang1-1/+1
The bpf_skb_adjust_net() ignores the return value of bpf_skb_net_shrink/grow, and always return 0, fix it by return 'ret'. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-6/+7
Pull networking fixes from David Miller: 1) Fix 64-bit division in mlx5 IPSEC offload support, from Ilan Tayari and Arnd Bergmann. 2) Fix race in statistics gathering in bnxt_en driver, from Michael Chan. 3) Can't use a mutex in RCU reader protected section on tap driver, from Cong WANG. 4) Fix mdb leak in bridging code, from Eduardo Valentin. 5) Fix free of wrong pointer variable in nfp driver, from Dan Carpenter. 6) Buffer overflow in brcmfmac driver, from Arend van SPriel. 7) ioremap_nocache() return value needs to be checked in smsc911x driver, from Alexey Khoroshilov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits) net: stmmac: revert "support future possible different internal phy mode" sfc: don't read beyond unicast address list datagram: fix kernel-doc comments socket: add documentation for missing elements smsc911x: Add check for ioremap_nocache() return code brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() net: hns: Bugfix for Tx timeout handling in hns driver net: ipmr: ipmr_get_table() returns NULL nfp: freeing the wrong variable mlxsw: spectrum_switchdev: Check status of memory allocation mlxsw: spectrum_switchdev: Remove unused variable mlxsw: spectrum_router: Fix use-after-free in route replace mlxsw: spectrum_router: Add missing rollback samples/bpf: fix a build issue bridge: mdb: fix leak on complete_info ptr on fail path tap: convert a mutex to a spinlock cxgb4: fix BUG() on interrupt deallocating path of ULD qed: Fix printk option passed when printing ipv6 addresses net: Fix minor code bug in timestamping.txt net: stmmac: Make 'alloc_dma_[rt]x_desc_resources()' look even closer ...
2017-07-12datagram: fix kernel-doc commentsstephen hemminger1-3/+3
An underscore in the kernel-doc comment section has special meaning and mis-use generates an errors. ./net/core/datagram.c:207: ERROR: Unknown target name: "msg". ./net/core/datagram.c:379: ERROR: Unknown target name: "msg". ./net/core/datagram.c:816: ERROR: Unknown target name: "t". Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-12net: ipmr: ipmr_get_table() returns NULLDan Carpenter1-2/+2
The ipmr_get_table() function doesn't return error pointers it returns NULL on error. Fixes: 4f75ba6982bc ("net: ipmr: Add ipmr_rtm_getroute") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-11bridge: mdb: fix leak on complete_info ptr on fail pathEduardo Valentin1-1/+2
We currently get the following kmemleak report: unreferenced object 0xffff8800039d9820 (size 32): comm "softirq", pid 0, jiffies 4295212383 (age 792.416s) hex dump (first 32 bytes): 00 0c e0 03 00 88 ff ff ff 02 00 00 00 00 00 00 ................ 00 00 00 01 ff 11 00 02 86 dd 00 00 ff ff ff ff ................ backtrace: [<ffffffff8152b4aa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff811d8ec8>] kmem_cache_alloc_trace+0xb8/0x1c0 [<ffffffffa0389683>] __br_mdb_notify+0x2a3/0x300 [bridge] [<ffffffffa038a0ce>] br_mdb_notify+0x6e/0x70 [bridge] [<ffffffffa0386479>] br_multicast_add_group+0x109/0x150 [bridge] [<ffffffffa0386518>] br_ip6_multicast_add_group+0x58/0x60 [bridge] [<ffffffffa0387fb5>] br_multicast_rcv+0x1d5/0xdb0 [bridge] [<ffffffffa037d7cf>] br_handle_frame_finish+0xcf/0x510 [bridge] [<ffffffffa03a236b>] br_nf_hook_thresh.part.27+0xb/0x10 [br_netfilter] [<ffffffffa03a3738>] br_nf_hook_thresh+0x48/0xb0 [br_netfilter] [<ffffffffa03a3fb9>] br_nf_pre_routing_finish_ipv6+0x109/0x1d0 [br_netfilter] [<ffffffffa03a4400>] br_nf_pre_routing_ipv6+0xd0/0x14c [br_netfilter] [<ffffffffa03a3c27>] br_nf_pre_routing+0x197/0x3d0 [br_netfilter] [<ffffffff814a2952>] nf_iterate+0x52/0x60 [<ffffffff814a29bc>] nf_hook_slow+0x5c/0xb0 [<ffffffffa037ddf4>] br_handle_frame+0x1a4/0x2c0 [bridge] This happens when switchdev_port_obj_add() fails. This patch frees complete_info object in the fail path. Reviewed-by: Vallish Vaidyeshwara <vallish@amazon.com> Signed-off-by: Eduardo Valentin <eduval@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-11Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds8-303/+1657
Pull ceph updates from Ilya Dryomov: "The main item here is support for v12.y.z ("Luminous") clusters: RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS feature bits, and various other changes in the RADOS client protocol. On top of that we have a new fsc mount option to allow supplying fscache uniquifier (similar to NFS) and the usual pile of filesystem fixes from Zheng" * tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits) libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS libceph: osd_state is 32 bits wide in luminous crush: remove an obsolete comment crush: crush_init_workspace starts with struct crush_work libceph, crush: per-pool crush_choose_arg_map for crush_do_rule() crush: implement weight and id overrides for straw2 libceph: apply_upmap() libceph: compute actual pgid in ceph_pg_to_up_acting_osds() libceph: pg_upmap[_items] infrastructure libceph: ceph_decode_skip_* helpers libceph: kill __{insert,lookup,remove}_pg_mapping() libceph: introduce and switch to decode_pg_mapping() libceph: don't pass pgid by value libceph: respect RADOS_BACKOFF backoffs libceph: make DEFINE_RB_* helpers more general libceph: avoid unnecessary pi lookups in calc_target() libceph: use target pi for calc_target() calculations libceph: always populate t->target_{oid,oloc} in calc_target() libceph: make sure need_resend targets reflect latest map libceph: delete from need_resend_linger before check_linger_pool_dne() ...
2017-07-08mpls: fix uninitialized in_label var warning in mpls_getrouteRoopa Prabhu1-4/+8
Fix the below warning generated by static checker: net/mpls/af_mpls.c:2111 mpls_getroute() error: uninitialized symbol 'in_label'." Fixes: 397fc9e5cefe ("mpls: route get support") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08bonding: avoid NETDEV_CHANGEMTU event when unregistering slaveWANG Cong1-1/+2
As Hongjun/Nicolas summarized in their original patch: " When a device changes from one netns to another, it's first unregistered, then the netns reference is updated and the dev is registered in the new netns. Thus, when a slave moves to another netns, it is first unregistered. This triggers a NETDEV_UNREGISTER event which is caught by the bonding driver. The driver calls bond_release(), which calls dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in the old netns). " This is a very special case, because the device is being unregistered no one should still care about the NETDEV_CHANGEMTU event triggered at this point, we can avoid broadcasting this event on this path, and avoid touching inetdev_event()/addrconf_notify() path. It requires to export __dev_set_mtu() to bonding driver. Reported-by: Hongjun Li <hongjun.li@6wind.com> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-08rds: tcp: use sock_create_lite() to create the accept socketSowmini Varadhan1-1/+1
There are two problems with calling sock_create_kern() from rds_tcp_accept_one() 1. it sets up a new_sock->sk that is wasteful, because this ->sk is going to get replaced by inet_accept() in the subsequent ->accept() 2. The new_sock->sk is a leaked reference in sock_graft() which expects to find a null parent->sk Avoid these problems by calling sock_create_lite(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-07libceph: osd_state is 32 bits wide in luminousIlya Dryomov2-10/+21
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07crush: remove an obsolete commentIlya Dryomov1-5/+0
Reflects ceph.git commit dca1ae1e0a6b02029c3a7f9dec4114972be26d50. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07crush: crush_init_workspace starts with struct crush_workIlya Dryomov1-1/+1
It is not just a pointer to crush_work, it is the whole structure. That is not a problem since it only contains a pointer. But it will be a problem if new data members are added to crush_work. Reflects ceph.git commit ee957dd431bfbeb6dadaf77764db8e0757417328. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()Ilya Dryomov2-3/+200
If there is no crush_choose_arg_map for a given pool, a NULL pointer is passed to preserve existing crush_do_rule() behavior. Reflects ceph.git commits 55fb91d64071552ea1bc65ab4ea84d3c8b73ab4b, dbe36e08be00c6519a8c89718dd47b0219c20516. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07crush: implement weight and id overrides for straw2Ilya Dryomov2-19/+57
bucket_straw2_choose needs to use weights that may be different from weight_items. For instance to compensate for an uneven distribution caused by a low number of values. Or to fix the probability biais introduced by conditional probabilities (see http://tracker.ceph.com/issues/15653 for more information). We introduce a weight_set for each straw2 bucket to set the desired weight for a given item at a given position. The weight of a given item when picking the first replica (first position) may be different from the weight the second replica (second position). For instance the weight matrix for a given bucket containing items 3, 7 and 13 could be as follows: position 0 position 1 item 3 0x10000 0x100000 item 7 0x40000 0x10000 item 13 0x40000 0x10000 When crush_do_rule picks the first of two replicas (position 0), item 7, 3 are four times more likely to be choosen by bucket_straw2_choose than item 13. When choosing the second replica (position 1), item 3 is ten times more likely to be choosen than item 7, 13. By default the weight_set of each bucket exactly matches the content of item_weights for each position to ensure backward compatibility. bucket_straw2_choose compares items by using their id. The same ids are also used to index buckets and they must be unique. For each item in a bucket an array of ids can be provided for placement purposes and they are used instead of the ids. If no replacement ids are provided, the legacy behavior is preserved. Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: apply_upmap()Ilya Dryomov1-2/+95
Previously, pg_to_raw_osds() didn't filter for existent OSDs because raw_to_up_osds() would filter for "up" ("up" is predicated on "exists") and raw_to_up_osds() was called directly after pg_to_raw_osds(). Now, with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds() output can affect apply_upmap(). Introduce remove_nonexistent_osds() to deal with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: compute actual pgid in ceph_pg_to_up_acting_osds()Ilya Dryomov1-6/+6
Move raw_pg_to_pg() call out of get_temp_osds() and into ceph_pg_to_up_acting_osds(), for upcoming apply_upmap(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: pg_upmap[_items] infrastructureIlya Dryomov2-3/+155
pg_temp and pg_upmap encodings are the same (PG -> array of osds), except for the incremental remove: it's an empty mapping in new_pg_temp for pg_temp and a separate old_pg_upmap set for pg_upmap. (This isn't to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't looked at as an example for pg_upmap encoding.) Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap. __decode_pg_temp() stores into pg_temp union member, but since pg_upmap union member is identical, reading through pg_upmap later is OK. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: ceph_decode_skip_* helpersIlya Dryomov1-22/+3
Some of these won't be as efficient as they could be (e.g. ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32) once instead of advancing by sizeof(u32) len times), but that's fine and not worth a bunch of extra macro code. Replace skip_name_map() with ceph_decode_skip_map as an example. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: kill __{insert,lookup,remove}_pg_mapping()Ilya Dryomov1-72/+15
Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: introduce and switch to decode_pg_mapping()Ilya Dryomov1-67/+83
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: don't pass pgid by valueIlya Dryomov1-10/+10
Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping counterparts: take const struct ceph_pg *. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: respect RADOS_BACKOFF backoffsIlya Dryomov4-0/+684
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07libceph: avoid unnecessary pi lookups in calc_target()Ilya Dryomov2-28/+34
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>