aboutsummaryrefslogtreecommitdiffstatshomepage
AgeCommit message (Collapse)AuthorFilesLines
2020-10-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski18-37/+76
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Incorrect netlink report logic in flowtable and genID. 2) Add a selftest to check that wireguard passes the right sk to ip_route_me_harder, from Jason A. Donenfeld. 3) Pass the actual sk to ip_route_me_harder(), also from Jason. 4) Missing expression validation of updates via nft --check. 5) Update byte and packet counters regardless of whether they match, from Stefano Brivio. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flagswenxu1-3/+0
The tunnel device such as vxlan, bareudp and geneve in the lwt mode set the outer df only based TUNNEL_DONT_FRAGMENT. And this was also the behavior for gre device before switching to use ip_md_tunnel_xmit in commit 962924fa2b7a ("ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit") When the ip_gre in lwt mode xmit with ip_md_tunnel_xmi changed the rule and make the discrepancy between handling of DF by different tunnels. So in the ip_md_tunnel_xmit should follow the same rule like other tunnels. Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") Signed-off-by: wenxu <wenxu@ucloud.cn> Link: https://lore.kernel.org/r/1604028728-31100-1-git-send-email-wenxu@ucloud.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31cadence: force nonlinear buffers to be clonedMark Deneen1-1/+2
In my test setup, I had a SAMA5D27 device configured with ip forwarding, and second device with usb ethernet (r8152) sending ICMP packets.  If the packet was larger than about 220 bytes, the SAMA5 device would "oops" with the following trace: kernel BUG at net/core/skbuff.c:1863! Internal error: Oops - BUG: 0 [#1] ARM Modules linked in: xt_MASQUERADE ppp_async ppp_generic slhc iptable_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 can_raw can bridge stp llc ipt_REJECT nf_reject_ipv4 sd_mod cdc_ether usbnet usb_storage r8152 scsi_mod mii o ption usb_wwan usbserial micrel macb at91_sama5d2_adc phylink gpio_sama5d2_piobu m_can_platform m_can industrialio_triggered_buffer kfifo_buf of_mdio can_dev fixed_phy sdhci_of_at91 sdhci_pltfm libphy sdhci mmc_core ohci_at91 ehci_atmel o hci_hcd iio_rescale industrialio sch_fq_codel spidev prox2_hal(O) CPU: 0 PID: 0 Comm: swapper Tainted: G           O      5.9.1-prox2+ #1 Hardware name: Atmel SAMA5 PC is at skb_put+0x3c/0x50 LR is at macb_start_xmit+0x134/0xad0 [macb] pc : [<c05258cc>]    lr : [<bf0ea5b8>]    psr: 20070113 sp : c0d01a60  ip : c07232c0  fp : c4250000 r10: c0d03cc8  r9 : 00000000  r8 : c0d038c0 r7 : 00000000  r6 : 00000008  r5 : c59b66c0  r4 : 0000002a r3 : 8f659eff  r2 : c59e9eea  r1 : 00000001  r0 : c59b66c0 Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none Control: 10c53c7d  Table: 2640c059  DAC: 00000051 Process swapper (pid: 0, stack limit = 0x75002d81) <snipped stack> [<c05258cc>] (skb_put) from [<bf0ea5b8>] (macb_start_xmit+0x134/0xad0 [macb]) [<bf0ea5b8>] (macb_start_xmit [macb]) from [<c053e504>] (dev_hard_start_xmit+0x90/0x11c) [<c053e504>] (dev_hard_start_xmit) from [<c0571180>] (sch_direct_xmit+0x124/0x260) [<c0571180>] (sch_direct_xmit) from [<c053eae4>] (__dev_queue_xmit+0x4b0/0x6d0) [<c053eae4>] (__dev_queue_xmit) from [<c05a5650>] (ip_finish_output2+0x350/0x580) [<c05a5650>] (ip_finish_output2) from [<c05a7e24>] (ip_output+0xb4/0x13c) [<c05a7e24>] (ip_output) from [<c05a39d0>] (ip_forward+0x474/0x500) [<c05a39d0>] (ip_forward) from [<c05a13d8>] (ip_sublist_rcv_finish+0x3c/0x50) [<c05a13d8>] (ip_sublist_rcv_finish) from [<c05a19b8>] (ip_sublist_rcv+0x11c/0x188) [<c05a19b8>] (ip_sublist_rcv) from [<c05a2494>] (ip_list_rcv+0xf8/0x124) [<c05a2494>] (ip_list_rcv) from [<c05403c4>] (__netif_receive_skb_list_core+0x1a0/0x20c) [<c05403c4>] (__netif_receive_skb_list_core) from [<c05405c4>] (netif_receive_skb_list_internal+0x194/0x230) [<c05405c4>] (netif_receive_skb_list_internal) from [<c0540684>] (gro_normal_list.part.0+0x14/0x28) [<c0540684>] (gro_normal_list.part.0) from [<c0541280>] (napi_complete_done+0x16c/0x210) [<c0541280>] (napi_complete_done) from [<bf14c1c0>] (r8152_poll+0x684/0x708 [r8152]) [<bf14c1c0>] (r8152_poll [r8152]) from [<c0541424>] (net_rx_action+0x100/0x328) [<c0541424>] (net_rx_action) from [<c01012ec>] (__do_softirq+0xec/0x274) [<c01012ec>] (__do_softirq) from [<c012d6d4>] (irq_exit+0xcc/0xd0) [<c012d6d4>] (irq_exit) from [<c0160960>] (__handle_domain_irq+0x58/0xa4) [<c0160960>] (__handle_domain_irq) from [<c0100b0c>] (__irq_svc+0x6c/0x90) Exception stack(0xc0d01ef0 to 0xc0d01f38) 1ee0:                                     00000000 0000003d 0c31f383 c0d0fa00 1f00: c0d2eb80 00000000 c0d2e630 4dad8c49 4da967b0 0000003d 0000003d 00000000 1f20: fffffff5 c0d01f40 c04e0f88 c04e0f8c 30070013 ffffffff [<c0100b0c>] (__irq_svc) from [<c04e0f8c>] (cpuidle_enter_state+0x7c/0x378) [<c04e0f8c>] (cpuidle_enter_state) from [<c04e12c4>] (cpuidle_enter+0x28/0x38) [<c04e12c4>] (cpuidle_enter) from [<c014f710>] (do_idle+0x194/0x214) [<c014f710>] (do_idle) from [<c014fa50>] (cpu_startup_entry+0xc/0x14) [<c014fa50>] (cpu_startup_entry) from [<c0a00dc8>] (start_kernel+0x46c/0x4a0) Code: e580c054 8a000002 e1a00002 e8bd8070 (e7f001f2) ---[ end trace 146c8a334115490c ]--- The solution was to force nonlinear buffers to be cloned.  This was previously reported by Klaus Doth (https://www.spinics.net/lists/netdev/msg556937.html) but never formally submitted as a patch. This is the third revision, hopefully the formatting is correct this time! Suggested-by: Klaus Doth <krnl@doth.eu> Fixes: 653e92a9175e ("net: macb: add support for padding and fcs computation") Signed-off-by: Mark Deneen <mdeneen@saucontech.com> Link: https://lore.kernel.org/r/20201030155814.622831-1-mdeneen@saucontech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31Merge branch 'ipv6-reply-icmp-error-if-fragment-doesn-t-contain-all-headers'Jakub Kicinski3-2/+40
Hangbin Liu says: ==================== IPv6: reply ICMP error if fragment doesn't contain all headers When our Engineer run latest IPv6 Core Conformance test, test v6LC.1.3.6: First Fragment Doesn’t Contain All Headers[1] failed. The test purpose is to verify that the node (Linux for example) should properly process IPv6 packets that don’t include all the headers through the Upper-Layer header. Based on RFC 8200, Section 4.5 Fragment Header - If the first fragment does not include all headers through an Upper-Layer header, then that fragment should be discarded and an ICMP Parameter Problem, Code 3, message should be sent to the source of the fragment, with the Pointer field set to zero. The first patch add a definition for ICMPv6 Parameter Problem, code 3. The second patch add a check for the 1st fragment packet to make sure Upper-Layer header exist. [1] Page 68, v6LC.1.3.6: First Fragment Doesn’t Contain All Headers part A, B, C and D at https://ipv6ready.org/docs/Core_Conformance_5_0_0.pdf [2] My reproducer: import sys, os from scapy.all import * def send_frag_dst_opt(src_ip6, dst_ip6): ip6 = IPv6(src = src_ip6, dst = dst_ip6, nh = 44) frag_1 = IPv6ExtHdrFragment(nh = 60, m = 1) dst_opt = IPv6ExtHdrDestOpt(nh = 58) frag_2 = IPv6ExtHdrFragment(nh = 58, offset = 4, m = 1) icmp_echo = ICMPv6EchoRequest(seq = 1) pkt_1 = ip6/frag_1/dst_opt pkt_2 = ip6/frag_2/icmp_echo send(pkt_1) send(pkt_2) def send_frag_route_opt(src_ip6, dst_ip6): ip6 = IPv6(src = src_ip6, dst = dst_ip6, nh = 44) frag_1 = IPv6ExtHdrFragment(nh = 43, m = 1) route_opt = IPv6ExtHdrRouting(nh = 58) frag_2 = IPv6ExtHdrFragment(nh = 58, offset = 4, m = 1) icmp_echo = ICMPv6EchoRequest(seq = 2) pkt_1 = ip6/frag_1/route_opt pkt_2 = ip6/frag_2/icmp_echo send(pkt_1) send(pkt_2) if __name__ == '__main__': src = sys.argv[1] dst = sys.argv[2] conf.iface = sys.argv[3] send_frag_dst_opt(src, dst) send_frag_route_opt(src, dst) ==================== Link: https://lore.kernel.org/r/20201027123313.3717941-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31IPv6: reply ICMP error if the first fragment don't include all headersHangbin Liu2-2/+39
Based on RFC 8200, Section 4.5 Fragment Header: - If the first fragment does not include all headers through an Upper-Layer header, then that fragment should be discarded and an ICMP Parameter Problem, Code 3, message should be sent to the source of the fragment, with the Pointer field set to zero. Checking each packet header in IPv6 fast path will have performance impact, so I put the checking in ipv6_frag_rcv(). As the packet may be any kind of L4 protocol, I only checked some common protocols' header length and handle others by (offset + 1) > skb->len. Also use !(frag_off & htons(IP6_OFFSET)) to catch atomic fragments (fragmented packet with only one fragment). When send ICMP error message, if the 1st truncated fragment is ICMP message, icmp6_send() will break as is_ineligible() return true. So I added a check in is_ineligible() to let fragment packet with nexthdr ICMP but no ICMP header return false. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31ICMPv6: Add ICMPv6 Parameter Problem, code 3 definitionHangbin Liu1-0/+1
Based on RFC7112, Section 6: IANA has added the following "Type 4 - Parameter Problem" message to the "Internet Control Message Protocol version 6 (ICMPv6) Parameters" registry: CODE NAME/DESCRIPTION 3 IPv6 First Fragment has incomplete IPv6 Header Chain Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31net: atm: fix update of position index in lec_seq_nextColin Ian King1-3/+2
The position index in leq_seq_next is not updated when the next entry is fetched an no more entries are available. This causes seq_file to report the following error: "seq_file: buggy .next function lec_seq_next [lec] did not update position index" Fix this by always updating the position index. [ Note: this is an ancient 2002 bug, the sha is from the tglx/history repo ] Fixes 4aea2cbff417 ("[ATM]: Move lan seq_file ops to lec.c [1/3]") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20201027114925.21843-1-colin.king@canonical.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31netfilter: ipset: Update byte and packet counters regardless of whether they matchStefano Brivio1-1/+2
In ip_set_match_extensions(), for sets with counters, we take care of updating counters themselves by calling ip_set_update_counter(), and of checking if the given comparison and values match, by calling ip_set_match_counter() if needed. However, if a given comparison on counters doesn't match the configured values, that doesn't mean the set entry itself isn't matching. This fix restores the behaviour we had before commit 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching"), without reintroducing the issue fixed there: back then, mtype_data_match() first updated counters in any case, and then took care of matching on counters. Now, if the IPSET_FLAG_SKIP_COUNTER_UPDATE flag is set, ip_set_update_counter() will anyway skip counter updates if desired. The issue observed is illustrated by this reproducer: ipset create c hash:ip counters ipset add c 192.0.2.1 iptables -I INPUT -m set --match-set c src --bytes-gt 800 -j DROP if we now send packets from 192.0.2.1, bytes and packets counters for the entry as shown by 'ipset list' are always zero, and, no matter how many bytes we send, the rule will never match, because counters themselves are not updated. Reported-by: Mithil Mhatre <mmhatre@redhat.com> Fixes: 4750005a85f7 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30net: stmmac: Fix channel lock initializationMarek Szyprowski1-0/+1
Commit 0366f7e06a6b ("net: stmmac: add ethtool support for get/set channels") refactored channel initialization, but during that operation, the spinlock initialization got lost. Fix this. This fixes the following lockdep warning: meson8b-dwmac ff3f0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 331 Comm: kworker/1:2H Not tainted 5.9.0-rc3+ #1858 Hardware name: Hardkernel ODROID-N2 (DT) Workqueue: kblockd blk_mq_run_work_fn Call trace: dump_backtrace+0x0/0x1d0 show_stack+0x14/0x20 dump_stack+0xe8/0x154 register_lock_class+0x58c/0x590 __lock_acquire+0x7c/0x1790 lock_acquire+0xf4/0x440 _raw_spin_lock_irqsave+0x80/0xb0 stmmac_tx_timer+0x4c/0xb0 [stmmac] call_timer_fn+0xc4/0x3e8 run_timer_softirq+0x2b8/0x6c0 efi_header_end+0x114/0x5f8 irq_exit+0x104/0x110 __handle_domain_irq+0x60/0xb8 gic_handle_irq+0x58/0xb0 el1_irq+0xbc/0x180 _raw_spin_unlock_irqrestore+0x48/0x90 mmc_blk_rw_wait+0x70/0x160 mmc_blk_mq_issue_rq+0x510/0x830 mmc_mq_queue_rq+0x13c/0x278 blk_mq_dispatch_rq_list+0x2a0/0x698 __blk_mq_do_dispatch_sched+0x254/0x288 __blk_mq_sched_dispatch_requests+0x190/0x1d8 blk_mq_sched_dispatch_requests+0x34/0x70 __blk_mq_run_hw_queue+0xcc/0x148 blk_mq_run_work_fn+0x20/0x28 process_one_work+0x2a8/0x718 worker_thread+0x48/0x460 kthread+0x134/0x160 ret_from_fork+0x10/0x1c Fixes: 0366f7e06a6b ("net: stmmac: add ethtool support for get/set channels") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20201029185011.4749-1-m.szyprowski@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30stmmac: intel: Fix kernel panic on pci probeWong Vee Khee1-7/+7
The commit "stmmac: intel: Adding ref clock 1us tic for LPI cntr" introduced a regression which leads to the kernel panic duing loading of the dwmac_intel module. Move the code block after pci resources is obtained. Fixes: b4c5f83ae3f3 ("stmmac: intel: Adding ref clock 1us tic for LPI cntr") Cc: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Wong Vee Khee <vee.khee.wong@intel.com> Link: https://lore.kernel.org/r/20201029093228.1741-1-vee.khee.wong@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30gianfar: Account for Tx PTP timestamp in the skb headroomClaudiu Manoil1-1/+1
When PTP timestamping is enabled on Tx, the controller inserts the Tx timestamp at the beginning of the frame buffer, between SFD and the L2 frame header. This means that the skb provided by the stack is required to have enough headroom otherwise a new skb needs to be created by the driver to accommodate the timestamp inserted by h/w. Up until now the driver was relying on the second option, using skb_realloc_headroom() to create a new skb to accommodate PTP frames. Turns out that this method is not reliable, as reallocation of skbs for PTP frames along with the required overhead (skb_set_owner_w, consume_skb) is causing random crashes in subsequent skb_*() calls, when multiple concurrent TCP streams are run at the same time on the same device (as seen in James' report). Note that these crashes don't occur with a single TCP stream, nor with multiple concurrent UDP streams, but only when multiple TCP streams are run concurrently with the PTP packet flow (doing skb reallocation). This patch enforces the first method, by requesting enough headroom from the stack to accommodate PTP frames, and so avoiding skb_realloc_headroom() & co, and the crashes no longer occur. There's no reason not to set needed_headroom to a large enough value to accommodate PTP frames, so in this regard this patch is a fix. Reported-by: James Jurack <james.jurack@ametek.com> Fixes: bee9e58c9e98 ("gianfar:don't add FCB length to hard_header_len") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201020173605.1173-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30gianfar: Replace skb_realloc_headroom with skb_cow_head for PTPClaudiu Manoil1-10/+2
When PTP timestamping is enabled on Tx, the controller inserts the Tx timestamp at the beginning of the frame buffer, between SFD and the L2 frame header. This means that the skb provided by the stack is required to have enough headroom otherwise a new skb needs to be created by the driver to accommodate the timestamp inserted by h/w. Up until now the driver was relying on skb_realloc_headroom() to create new skbs to accommodate PTP frames. Turns out that this method is not reliable in this context at least, as skb_realloc_headroom() for PTP frames can cause random crashes, mostly in subsequent skb_*() calls, when multiple concurrent TCP streams are run at the same time with the PTP flow on the same device (as seen in James' report). I also noticed that when the system is loaded by sending multiple TCP streams, the driver receives cloned skbs in large numbers. skb_cow_head() instead proves to be stable in this scenario, and not only handles cloned skbs too but it's also more efficient and widely used in other drivers. The commit introducing skb_realloc_headroom in the driver goes back to 2009, commit 93c1285c5d92 ("gianfar: reallocate skb when headroom is not enough for fcb"). For practical purposes I'm referencing a newer commit (from 2012) that brings the code to its current structure (and fixes the PTP case). Fixes: 9c4886e5e63b ("gianfar: Fix invalid TX frames returned on error queue when time stamping") Reported-by: James Jurack <james.jurack@ametek.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20201029081057.8506-1-claudiu.manoil@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30net: fec: fix MDIO probing for some FEC hardware blocksGreg Ungerer2-13/+22
Some (apparently older) versions of the FEC hardware block do not like the MMFR register being cleared to avoid generation of MII events at initialization time. The action of clearing this register results in no future MII events being generated at all on the problem block. This means the probing of the MDIO bus will find no PHYs. Create a quirk that can be checked at the FECs MII init time so that the right thing is done. The quirk is set as appropriate for the FEC hardware blocks that are known to need this. Fixes: f166f890c8f0 ("net: ethernet: fec: Replace interrupt driven MDIO with polled IO") Signed-off-by: Greg Ungerer <gerg@linux-m68k.org> Acked-by: Fugang Duan <fugand.duan@nxp.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Clemens Gruber <clemens.gruber@pqgruber.com> Link: https://lore.kernel.org/r/20201028052232.1315167-1-gerg@linux-m68k.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30ip6_tunnel: set inner ipproto before ip6_tnl_encapAlexander Ovechkin1-2/+2
ip6_tnl_encap assigns to proto transport protocol which encapsulates inner packet, but we must pass to set_inner_ipproto protocol of that inner packet. Calling set_inner_ipproto after ip6_tnl_encap might break gso. For example, in case of encapsulating ipv6 packet in fou6 packet, inner_ipproto would be set to IPPROTO_UDP instead of IPPROTO_IPV6. This would lead to incorrect calling sequence of gso functions: ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> udp6_ufo_fragment instead of: ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> ip6ip6_gso_segment Fixes: 6c11fbf97e69 ("ip6_tunnel: add MPLS transmit support") Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Link: https://lore.kernel.org/r/20201029171012.20904-1-ovov@yandex-team.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30netfilter: nf_tables: missing validation from the abort pathPablo Neira Ayuso3-10/+36
If userspace does not include the trailing end of batch message, then nfnetlink aborts the transaction. This allows to check that ruleset updates trigger no errors. After this patch, invoking this command from the prerouting chain: # nft -c add rule x y fib saddr . oif type local fails since oif is not supported there. This patch fixes the lack of rule validation from the abort/check path to catch configuration errors such as the one above. Fixes: a654de8fdc18 ("netfilter: nf_tables: fix chain dependency validation") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30netfilter: use actual socket sk rather than skb sk when routing harderJason A. Donenfeld12-24/+26
If netfilter changes the packet mark when mangling, the packet is rerouted using the route_me_harder set of functions. Prior to this commit, there's one big difference between route_me_harder and the ordinary initial routing functions, described in the comment above __ip_queue_xmit(): /* Note: skb->sk can be different from sk, in case of tunnels */ int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl, That function goes on to correctly make use of sk->sk_bound_dev_if, rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a tunnel will receive a packet in ndo_start_xmit with an initial skb->sk. It will make some transformations to that packet, and then it will send the encapsulated packet out of a *new* socket. That new socket will basically always have a different sk_bound_dev_if (otherwise there'd be a routing loop). So for the purposes of routing the encapsulated packet, the routing information as it pertains to the socket should come from that socket's sk, rather than the packet's original skb->sk. For that reason __ip_queue_xmit() and related functions all do the right thing. One might argue that all tunnels should just call skb_orphan(skb) before transmitting the encapsulated packet into the new socket. But tunnels do *not* do this -- and this is wisely avoided in skb_scrub_packet() too -- because features like TSQ rely on skb->destructor() being called when that buffer space is truely available again. Calling skb_orphan(skb) too early would result in buffers filling up unnecessarily and accounting info being all wrong. Instead, additional routing must take into account the new sk, just as __ip_queue_xmit() notes. So, this commit addresses the problem by fishing the correct sk out of state->sk -- it's already set properly in the call to nf_hook() in __ip_local_out(), which receives the sk as part of its normal functionality. So we make sure to plumb state->sk through the various route_me_harder functions, and then make correct use of it following the example of __ip_queue_xmit(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30wireguard: selftests: check that route_me_harder packets use the right skJason A. Donenfeld2-0/+10
If netfilter changes the packet mark, the packet is rerouted. The ip_route_me_harder family of functions fails to use the right sk, opting to instead use skb->sk, resulting in a routing loop when used with tunnels. With the next change fixing this issue in netfilter, test for the relevant condition inside our test suite, since wireguard was where the bug was discovered. Reported-by: Chen Minqiang <ptpt52@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30netfilter: nftables: fix netlink report logic in flowtable and genidPablo Neira Ayuso1-2/+2
The netlink report should be sent regardless the available listeners. Fixes: 84d7fce69388 ("netfilter: nf_tables: export rule-set generation ID") Fixes: 3b49e2e94e6e ("netfilter: nf_tables: add flow table netlink frontend") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-29Merge tag 'fallthrough-fixes-clang-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linuxLinus Torvalds2-0/+4
Pull fallthrough fix from Gustavo A. R. Silva: "This fixes a ton of fall-through warnings when building with Clang 12.0.0 and -Wimplicit-fallthrough" * tag 'fallthrough-fixes-clang-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: include: jhash/signal: Fix fall-through warnings for Clang
2020-10-29Merge tag 'net-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds39-178/+261
Pull networking fixes from Jakub Kicinski: "Current release regressions: - r8169: fix forced threading conflicting with other shared interrupts; we tried to fix the use of raise_softirq_irqoff from an IRQ handler on RT by forcing hard irqs, but this driver shares legacy PCI IRQs so drop the _irqoff() instead - tipc: fix memory leak caused by a recent syzbot report fix to tipc_buf_append() Current release - bugs in new features: - devlink: Unlock on error in dumpit() and fix some error codes - net/smc: fix null pointer dereference in smc_listen_decline() Previous release - regressions: - tcp: Prevent low rmem stalls with SO_RCVLOWAT. - net: protect tcf_block_unbind with block lock - ibmveth: Fix use of ibmveth in a bridge; the self-imposed filtering to only send legal frames to the hypervisor was too strict - net: hns3: Clear the CMDQ registers before unmapping BAR region; incorrect cleanup order was leading to a crash - bnxt_en - handful of fixes to fixes: - Send HWRM_FUNC_RESET fw command unconditionally, even if there are PCIe errors being reported - Check abort error state in bnxt_open_nic(). - Invoke cancel_delayed_work_sync() for PFs also. - Fix regression in workqueue cleanup logic in bnxt_remove_one(). - mlxsw: Only advertise link modes supported by both driver and device, after removal of 56G support from the driver 56G was not cleared from advertised modes - net/smc: fix suppressed return code Previous release - always broken: - netem: fix zero division in tabledist, caused by integer overflow - bnxt_en: Re-write PCI BARs after PCI fatal error. - cxgb4: set up filter action after rewrites - net: ipa: command payloads already mapped Misc: - s390/ism: fix incorrect system EID, it's okay to change since it was added in current release - vsock: use ns_capable_noaudit() on socket create to suppress false positive audit messages" * tag 'net-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits) r8169: fix issue with forced threading in combination with shared interrupts netem: fix zero division in tabledist ibmvnic: fix ibmvnic_set_mac mptcp: add missing memory scheduling in the rx path tipc: fix memory leak caused by tipc_buf_append() gtp: fix an use-before-init in gtp_newlink() net: protect tcf_block_unbind with block lock ibmveth: Fix use of ibmveth in a bridge. net/sched: act_mpls: Add softdep on mpls_gso.ko ravb: Fix bit fields checking in ravb_hwtstamp_get() devlink: Unlock on error in dumpit() devlink: Fix some error codes chelsio/chtls: fix memory leaks in CPL handlers chelsio/chtls: fix deadlock issue net: hns3: Clear the CMDQ registers before unmapping BAR region bnxt_en: Send HWRM_FUNC_RESET fw command unconditionally. bnxt_en: Check abort error state in bnxt_open_nic(). bnxt_en: Re-write PCI BARs after PCI fatal error. bnxt_en: Invoke cancel_delayed_work_sync() for PFs also. bnxt_en: Fix regression in workqueue cleanup logic in bnxt_remove_one(). ...
2020-10-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds13-44/+103
Pull rdma fixes from Jason Gunthorpe: "The good news is people are testing rc1 in the RDMA world - the bad news is testing of the for-next area is not as good as I had hoped, as we really should have caught at least the rdma_connect_locked() issue before now. Notable merge window regressions that didn't get caught/fixed in time for rc1: - Fix in kernel users of rxe, they were broken by the rapid fix to undo the uABI breakage in rxe from another patch - EFA userspace needs to read the GID table but was broken with the new GID table logic - Fix user triggerable deadlock in mlx5 using devlink reload - Fix deadlock in several ULPs using rdma_connect from the CM handler callbacks - Memory leak in qedr" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/qedr: Fix memory leak in iWARP CM RDMA: Add rdma_connect_locked() RDMA/uverbs: Fix false error in query gid IOCTL RDMA/mlx5: Fix devlink deadlock on net namespace deletion RDMA/rxe: Fix small problem in network_type patch
2020-10-29r8169: fix issue with forced threading in combination with shared interruptsHeiner Kallweit1-2/+2
As reported by Serge flag IRQF_NO_THREAD causes an error if the interrupt is actually shared and the other driver(s) don't have this flag set. This situation can occur if a PCI(e) legacy interrupt is used in combination with forced threading. There's no good way to deal with this properly, therefore we have to remove flag IRQF_NO_THREAD. For fixing the original forced threading issue switch to napi_schedule(). Fixes: 424a646e072a ("r8169: fix operation under forced interrupt threading") Link: https://www.spinics.net/lists/netdev/msg694960.html Reported-by: Serge Belyshev <belyshev@depni.sinp.msu.ru> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Serge Belyshev <belyshev@depni.sinp.msu.ru> Link: https://lore.kernel.org/r/b5b53bfe-35ac-3768-85bf-74d1290cf394@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29netem: fix zero division in tabledistAleksandr Nogikh1-1/+8
Currently it is possible to craft a special netlink RTM_NEWQDISC command that can result in jitter being equal to 0x80000000. It is enough to set the 32 bit jitter to 0x02000000 (it will later be multiplied by 2^6) or just set the 64 bit jitter via TCA_NETEM_JITTER64. This causes an overflow during the generation of uniformly distributed numbers in tabledist(), which in turn leads to division by zero (sigma != 0, but sigma * 2 is 0). The related fragment of code needs 32-bit division - see commit 9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to 64 bit is not an option. Fix the issue by keeping the value of jitter within the range that can be adequately handled by tabledist() - [0;INT_MAX]. As negative std deviation makes no sense, take the absolute value of the passed value and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit arithmetic in order to prevent overflows. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29ibmvnic: fix ibmvnic_set_macLijun Pan1-2/+6
Jakub Kicinski brought up a concern in ibmvnic_set_mac(). ibmvnic_set_mac() does this: ether_addr_copy(adapter->mac_addr, addr->sa_data); if (adapter->state != VNIC_PROBED) rc = __ibmvnic_set_mac(netdev, addr->sa_data); So if state == VNIC_PROBED, the user can assign an invalid address to adapter->mac_addr, and ibmvnic_set_mac() will still return 0. The fix is to validate ethernet address at the beginning of ibmvnic_set_mac(), and move the ether_addr_copy to the case of "adapter->state != VNIC_PROBED". Fixes: c26eba03e407 ("ibmvnic: Update reset infrastructure to support tunable parameters") Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Link: https://lore.kernel.org/r/20201027220456.71450-1-ljp@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29mptcp: add missing memory scheduling in the rx pathPaolo Abeni1-0/+10
When moving the skbs from the subflow into the msk receive queue, we must schedule there the required amount of memory. Try to borrow the required memory from the subflow, if needed, so that we leverage the existing TCP heuristic. Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/f6143a6193a083574f11b00dbf7b5ad151bc4ff4.1603810630.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29include: jhash/signal: Fix fall-through warnings for ClangGustavo A. R. Silva2-0/+4
In preparation to enable -Wimplicit-fallthrough for Clang, explicitly add break statements instead of letting the code fall through to the next case. This patch adds four break statements that, together, fix almost 40,000 warnings when building Linux 5.10-rc1 with Clang 12.0.0 and this[1] change reverted. Notice that in order to enable -Wimplicit-fallthrough for Clang, such change[1] is meant to be reverted at some point. So, this patch helps to move in that direction. Something important to mention is that there is currently a discrepancy between GCC and Clang when dealing with switch fall-through to empty case statements or to cases that only contain a break/continue/return statement[2][3][4]. Now that the -Wimplicit-fallthrough option has been globally enabled[5], any compiler should really warn on missing either a fallthrough annotation or any of the other case-terminating statements (break/continue/return/ goto) when falling through to the next case statement. Making exceptions to this introduces variation in case handling which may continue to lead to bugs, misunderstandings, and a general lack of robustness. The point of enabling options like -Wimplicit-fallthrough is to prevent human error and aid developers in spotting bugs before their code is even built/ submitted/committed, therefore eliminating classes of bugs. So, in order to really accomplish this, we should, and can, move in the direction of addressing any error-prone scenarios and get rid of the unintentional fallthrough bug-class in the kernel, entirely, even if there is some minor redundancy. Better to have explicit case-ending statements than continue to have exceptions where one must guess as to the right result. The compiler will eliminate any actual redundancy. [1] commit e2079e93f562c ("kbuild: Do not enable -Wimplicit-fallthrough for clang for now") [2] https://github.com/ClangBuiltLinux/linux/issues/636 [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432 [4] https://godbolt.org/z/xgkvIh [5] commit a035d552a93b ("Makefile: Globally enable fall-through warning") Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29Merge tag 'afs-fixes-20201029' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsLinus Torvalds8-95/+188
Pull AFS fixes from David Howells: - Fix copy_file_range() to an afs file now returning EINVAL if the splice_write file op isn't supplied. - Fix a deref-before-check in afs_unuse_cell(). - Fix a use-after-free in afs_xattr_get_acl(). - Fix afs to not try to clear PG_writeback when laundering a page. - Fix afs to take a ref on a page that it sets PG_private on and to drop that ref when clearing PG_private. This is done through recently added helpers. - Fix a page leak if write_begin() fails. - Fix afs_write_begin() to not alter the dirty region info stored in page->private, but rather do this in afs_write_end() instead when we know what we actually changed. - Fix afs_invalidatepage() to alter the dirty region info on a page when partial page invalidation occurs so that we don't inadvertantly include a span of zeros that will get written back if a page gets laundered due to a remote 3rd-party induced invalidation. We mustn't, however, reduce the dirty region if the page has been seen to be mapped (ie. we got called through the page_mkwrite vector) as the page might still be mapped and we might lose data if the file is extended again. - Fix the dirty region info to have a lower resolution if the size of the page is too large for this to be encoded (e.g. powerpc32 with 64K pages). Note that this might not be the ideal way to handle this, since it may allow some leakage of undirtied zero bytes to the server's copy in the case of a 3rd-party conflict. To aid the last two fixes, two additional changes: - Wrap the manipulations of the dirty region info stored in page->private into helper functions. - Alter the encoding of the dirty region so that the region bounds can be stored with one fewer bit, making a bit available for the indication of mappedness. * tag 'afs-fixes-20201029' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix dirty-region encoding on ppc32 with 64K pages afs: Fix afs_invalidatepage to adjust the dirty region afs: Alter dirty range encoding in page->private afs: Wrap page->private manipulations in inline functions afs: Fix where page->private is set during write afs: Fix page leak on afs_write_begin() failure afs: Fix to take ref on page when PG_private is set afs: Fix afs_launder_page to not clear PG_writeback afs: Fix a use after free in afs_xattr_get_acl() afs: Fix tracing deref-before-check afs: Fix copy_file_range()
2020-10-29tipc: fix memory leak caused by tipc_buf_append()Tung Nguyen1-3/+2
Commit ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") replaced skb_unshare() with skb_copy() to not reduce the data reference counter of the original skb intentionally. This is not the correct way to handle the cloned skb because it causes memory leak in 2 following cases: 1/ Sending multicast messages via broadcast link The original skb list is cloned to the local skb list for local destination. After that, the data reference counter of each skb in the original list has the value of 2. This causes each skb not to be freed after receiving ACK: tipc_link_advance_transmq() { ... /* release skb */ __skb_unlink(skb, &l->transmq); kfree_skb(skb); <-- memory exists after being freed } 2/ Sending multicast messages via replicast link Similar to the above case, each skb cannot be freed after purging the skb list: tipc_mcast_xmit() { ... __skb_queue_purge(pkts); <-- memory exists after being freed } This commit fixes this issue by using skb_unshare() instead. Besides, to avoid use-after-free error reported by KASAN, the pointer to the fragment is set to NULL before calling skb_unshare() to make sure that the original skb is not freed after freeing the fragment 2 times in case skb_unshare() returns NULL. Fixes: ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()") Acked-by: Jon Maloy <jmaloy@redhat.com> Reported-by: Thang Hoang Ngo <thang.h.ngo@dektech.com.au> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29gtp: fix an use-before-init in gtp_newlink()Masahiro Fujiwara1-8/+8
*_pdp_find() from gtp_encap_recv() would trigger a crash when a peer sends GTP packets while creating new GTP device. RIP: 0010:gtp1_pdp_find.isra.0+0x68/0x90 [gtp] <SNIP> Call Trace: <IRQ> gtp_encap_recv+0xc2/0x2e0 [gtp] ? gtp1_pdp_find.isra.0+0x90/0x90 [gtp] udp_queue_rcv_one_skb+0x1fe/0x530 udp_queue_rcv_skb+0x40/0x1b0 udp_unicast_rcv_skb.isra.0+0x78/0x90 __udp4_lib_rcv+0x5af/0xc70 udp_rcv+0x1a/0x20 ip_protocol_deliver_rcu+0xc5/0x1b0 ip_local_deliver_finish+0x48/0x50 ip_local_deliver+0xe5/0xf0 ? ip_protocol_deliver_rcu+0x1b0/0x1b0 gtp_encap_enable() should be called after gtp_hastable_new() otherwise *_pdp_find() will access the uninitialized hash table. Fixes: 1e3a3abd8b28 ("gtp: make GTP sockets in gtp_newlink optional") Signed-off-by: Masahiro Fujiwara <fujiwara.masahiro@gmail.com> Link: https://lore.kernel.org/r/20201027114846.3924-1-fujiwara.masahiro@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29Merge tag 'ext4_for_linus_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds10-135/+78
Pull ext4 fixes from Ted Ts'o: "Bug fixes for the new ext4 fast commit feature, plus a fix for the 'data=journal' bug fix. Also use the generic casefolding support which has now landed in fs/libfs.c for 5.10" * tag 'ext4_for_linus_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/... ext4: use generic casefolding support ext4: do not use extent after put_bh ext4: use IS_ERR() for error checking of path ext4: fix mmap write protection for data=journal mode jbd2: fix a kernel-doc markup ext4: use s_mount_flags instead of s_mount_state for fast commit state ext4: make num of fast commit blocks configurable ext4: properly check for dirty state in ext4_inode_datasync_dirty() ext4: fix double locking in ext4_fc_commit_dentry_updates()
2020-10-29afs: Fix dirty-region encoding on ppc32 with 64K pagesDavid Howells2-9/+20
The dirty region bounds stored in page->private on an afs page are 15 bits on a 32-bit box and can, at most, represent a range of up to 32K within a 32K page with a resolution of 1 byte. This is a problem for powerpc32 with 64K pages enabled. Further, transparent huge pages may get up to 2M, which will be a problem for the afs filesystem on all 32-bit arches in the future. Fix this by decreasing the resolution. For the moment, a 64K page will have a resolution determined from PAGE_SIZE. In the future, the page will need to be passed in to the helper functions so that the page size can be assessed and the resolution determined dynamically. Note that this might not be the ideal way to handle this, since it may allow some leakage of undirtied zero bytes to the server's copy in the case of a 3rd-party conflict. Fixing that would require a separately allocated record and is a more complicated fix. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-29afs: Fix afs_invalidatepage to adjust the dirty regionDavid Howells4-14/+79
Fix afs_invalidatepage() to adjust the dirty region recorded in page->private when truncating a page. If the dirty region is entirely removed, then the private data is cleared and the page dirty state is cleared. Without this, if the page is truncated and then expanded again by truncate, zeros from the expanded, but no-longer dirty region may get written back to the server if the page gets laundered due to a conflicting 3rd-party write. It mustn't, however, shorten the dirty region of the page if that page is still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is stored in page->private to record this. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Alter dirty range encoding in page->privateDavid Howells2-4/+4
Currently, page->private on an afs page is used to store the range of dirtied data within the page, where the range includes the lower bound, but excludes the upper bound (e.g. 0-1 is a range covering a single byte). This, however, requires a superfluous bit for the last-byte bound so that on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea being that having both numbers the same would indicate an empty range. This is unnecessary as the PG_private bit is clear if it's an empty range (as is PG_dirty). Alter the way the dirty range is encoded in page->private such that the upper bound is reduced by 1 (e.g. 0-0 is then specified the same single byte range mentioned above). Applying this to both bounds frees up two bits, one of which can be used in a future commit. This allows the afs filesystem to be compiled on ppc32 with 64K pages; without this, the following warnings are seen: ../fs/afs/internal.h: In function 'afs_page_dirty_to': ../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow] 881 | return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK; | ^~ ../fs/afs/internal.h: In function 'afs_page_dirty': ../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow] 886 | return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from; | ^~ Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Wrap page->private manipulations in inline functionsDavid Howells3-34/+44
The afs filesystem uses page->private to store the dirty range within a page such that in the event of a conflicting 3rd-party write to the server, we write back just the bits that got changed locally. However, there are a couple of problems with this: (1) I need a bit to note if the page might be mapped so that partial invalidation doesn't shrink the range. (2) There aren't necessarily sufficient bits to store the entire range of data altered (say it's a 32-bit system with 64KiB pages or transparent huge pages are in use). So wrap the accesses in inline functions so that future commits can change how this works. Also move them out of the tracing header into the in-directory header. There's not really any need for them to be in the tracing header. Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Fix where page->private is set during writeDavid Howells1-15/+26
In afs, page->private is set to indicate the dirty region of a page. This is done in afs_write_begin(), but that can't take account of whether the copy into the page actually worked. Fix this by moving the change of page->private into afs_write_end(). Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29afs: Fix page leak on afs_write_begin() failureDavid Howells1-12/+11
Fix the leak of the target page in afs_write_begin() when it fails. Fixes: 15b4650e55e0 ("afs: convert to new aops") Signed-off-by: David Howells <dhowells@redhat.com> cc: Nick Piggin <npiggin@gmail.com>
2020-10-29afs: Fix to take ref on page when PG_private is setDavid Howells4-26/+18
Fix afs to take a ref on a page when it sets PG_private on it and to drop the ref when removing the flag. Note that in afs_write_begin(), a lot of the time, PG_private is already set on a page to which we're going to add some data. In such a case, we leave the bit set and mustn't increment the page count. As suggested by Matthew Wilcox, use attach/detach_page_private() where possible. Fixes: 31143d5d515e ("AFS: implement basic file write support") Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-28Merge tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds1-14/+22
Pull tracing fix from Steven Rostedt: "Fix synthetic event "strcat" overrun New synthetic event code used strcat() and miscalculated the ending, causing the concatenation to write beyond the allocated memory. Instead of using strncat(), the code is switched over to seq_buf which has all the mechanisms in place to protect against writing more than what is allocated, and cleans up the code a bit" * tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing, synthetic events: Replace buggy strcat() with seq_buf operations
2020-10-28ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...Theodore Ts'o1-0/+2
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: use generic casefolding supportDaniel Rosenberg5-93/+17
This switches ext4 over to the generic support provided in libfs. Since casefolded dentries behave the same in ext4 and f2fs, we decrease the maintenance burden by unifying them, and any optimizations will immediately apply to both. Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20201028050820.1636571-1-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: do not use extent after put_bhyangerkun1-15/+15
ext4_ext_search_right() will read more extent blocks and call put_bh after we get the information we need. However, ret_ex will break this and may cause use-after-free once pagecache has been freed. Fix it by copying the extent structure if needed. Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20201028055617.2569255-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2020-10-28ext4: use IS_ERR() for error checking of pathHarshad Shirwadkar1-2/+4
With this fix, fast commit recovery code uses IS_ERR() for path returned by ext4_find_extent. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201027204342.2794949-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: fix mmap write protection for data=journal modeJan Kara1-2/+3
Commit afb585a97f81 "ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write() to track inode ranges whose mappings need to get write-protected during transaction commits. However the added calls use wrong start of a range (0 instead of page offset) and so write protection is not necessarily effective. Use correct range start to fix the problem. Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201027132751.29858-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28jbd2: fix a kernel-doc markupMauro Carvalho Chehab1-1/+1
The kernel-doc markup that documents _fc_replay_callback is missing an asterisk, causing this warning: ../include/linux/jbd2.h:1271: warning: Function parameter or member 'j_fc_replay_callback' not described in 'journal_s' When building the docs. Fixes: 609f928af48f ("jbd2: fast commit recovery path") Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/6055927ada2015b55b413cdd2670533bdc9a8da2.1603791716.git.mchehab+huawei@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: use s_mount_flags instead of s_mount_state for fast commit stateHarshad Shirwadkar3-15/+15
Ext4's fast commit related transient states should use sb->s_mount_flags instead of persistent sb->s_mount_state. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201027044915.2553163-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: make num of fast commit blocks configurableHarshad Shirwadkar2-2/+15
This patch reserves a field in the jbd2 superblock for number of fast commit blocks. When this value is non-zero, Ext4 uses this field to set the number of fast commit blocks. Fixes: 6866d7b3f2bb ("ext4/jbd2: add fast commit initialization") Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201027044915.2553163-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: properly check for dirty state in ext4_inode_datasync_dirty()Andrea Righi1-4/+6
ext4_inode_datasync_dirty() needs to return 'true' if the inode is dirty, 'false' otherwise, but the logic seems to be incorrectly changed by commit aa75f4d3daae ("ext4: main fast-commit commit path"). This introduces a problem with swap files that are always failing to be activated, showing this error in dmesg: [ 34.406479] swapon: file is not committed Simple test case to reproduce the problem: # fallocate -l 8G swapfile # chmod 0600 swapfile # mkswap swapfile # swapon swapfile Fix the logic to return the proper state of the inode. Link: https://lore.kernel.org/lkml/20201024131333.GA32124@xps-13-7390 Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201027044915.2553163-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: fix double locking in ext4_fc_commit_dentry_updates()Harshad Shirwadkar1-1/+0
Fixed double locking of sbi->s_fc_lock in the above function as reported by kernel-test-robot. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201023161339.1449437-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28RDMA/qedr: Fix memory leak in iWARP CMAlok Prasad1-0/+1
Fixes memory leak in iWARP CM Fixes: e411e0587e0d ("RDMA/qedr: Add iWARP connection management functions") Link: https://lore.kernel.org/r/20201021115008.28138-1-palok@marvell.com Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28RDMA: Add rdma_connect_locked()Jason Gunthorpe6-29/+48
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the handler triggers a completion and another thread does rdma_connect() or the handler directly calls rdma_connect(). In all cases rdma_connect() needs to hold the handler_mutex, but when handler's are invoked this is already held by the core code. This causes ULPs using the 2nd method to deadlock. Provide a rdma_connect_locked() and have all ULPs call it from their handlers. Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state") Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>