aboutsummaryrefslogtreecommitdiffstats
path: root/block (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-07gfs2: read journal in large chunksAbhi Das8-139/+219
Use bios to read in the journal into the address space of the journal inode (jd_inode), sequentially and in large chunks. This is faster for locating the journal head that the previous binary search approach. When performing recovery, we keep the journal in the address space until recovery is done, which further speeds up things. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Fix iomap write page reclaim deadlockAndreas Gruenbacher2-44/+58
Since commit 64bc06bb32ee ("gfs2: iomap buffered write support"), gfs2 is doing buffered writes by starting a transaction in iomap_begin, writing a range of pages, and ending that transaction in iomap_end. This approach suffers from two problems: (1) Any allocations necessary for the write are done in iomap_begin, so when the data aren't journaled, there is no need for keeping the transaction open until iomap_end. (2) Transactions keep the gfs2 log flush lock held. When iomap_file_buffered_write calls balance_dirty_pages, this can end up calling gfs2_write_inode, which will try to flush the log. This requires taking the log flush lock which is already held, resulting in a deadlock. Fix both of these issues by not keeping transactions open from iomap_begin to iomap_end. Instead, start a small transaction in page_prepare and end it in page_done when necessary. Reported-by: Edwin Török <edvin.torok@citrix.com> Fixes: 64bc06bb32ee ("gfs2: iomap buffered write support") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-05-07gfs2: fix race between gfs2_freeze_func and unmountAbhi Das2-3/+6
As part of the freeze operation, gfs2_freeze_func() is left blocking on a request to hold the sd_freeze_gl in SH. This glock is held in EX by the gfs2_freeze() code. A subsequent call to gfs2_unfreeze() releases the EXclusively held sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and resume its operation. gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete. If a umount is issued right after unfreeze, it could result in an inconsistent filesystem because some journal data (statfs update) isn't written out. Refer to commit 24972557b12c for a more detailed explanation of how freeze/unfreeze work. This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to complete before returning to the user. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Rename gfs2_trans_{add_unrevoke => remove_revoke}Andreas Gruenbacher6-9/+9
Rename gfs2_trans_add_unrevoke to gfs2_trans_remove_revoke: there is no such thing as an "unrevoke" object; all this function does is remove existing revoke objects plus some bookkeeping. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Rename sd_log_le_{revoke,ordered}Andreas Gruenbacher6-15/+15
Rename sd_log_le_revoke to sd_log_revokes and sd_log_le_ordered to sd_log_ordered: not sure what le stands for here, but it doesn't add clarity, and if it stands for list entry, it's actually confusing as those are both list heads but not list entries. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Remove unnecessary extern declarationsAndreas Gruenbacher2-8/+3
Make log operations statuc; they are only used locally. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Remove misleading comments in gfs2_evict_inodeAndreas Gruenbacher1-5/+0
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Replace gl_revokes with a GLF flagBob Peterson6-15/+31
The gl_revokes value determines how many outstanding revokes a glock has on the superblock revokes list; this is used to avoid unnecessary log flushes. However, gl_revokes is only ever tested for being zero, and it's only decremented in revoke_lo_after_commit, which removes all revokes from the list, so we know that the gl_revoke values of all the glocks on the list will reach zero. Therefore, we can replace gl_revokes with a bit flag. This saves an atomic counter in struct gfs2_glock. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Fix occasional glock use-after-freeAndreas Gruenbacher3-3/+7
This patch has to do with the life cycle of glocks and buffers. When gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata object is assigned to track the buffer, and that is queued to various lists, including the glock's gl_ail_list to indicate it's on the active items list. Once the page associated with the buffer has been written, it is removed from the ail list, but its life isn't over until a revoke has been successfully written. So after the block is written, its bufdata object is moved from the glock's gl_ail_list to a file-system-wide list of pending revokes, sd_log_le_revoke. At that point the glock still needs to track how many revokes it contributed to that list (in gl_revokes) so that things like glock go_sync can ensure all the metadata has been not only written, but also revoked before the glock is granted to a different node. This is to guarantee journal replay doesn't replay the block once the glock has been granted to another node. Ross Lagerwall recently discovered a race in which an inode could be evicted, and its glock freed after its ail list had been synced, but while it still had unwritten revokes on the sd_log_le_revoke list. The evict decremented the glock reference count to zero, which allowed the glock to be freed. After the revoke was written, function revoke_lo_after_commit tried to adjust the glock's gl_revokes counter and clear its GLF_LFLUSH flag, at which time it referenced the freed glock. This patch fixes the problem by incrementing the glock reference count in gfs2_add_revoke when the glock's first bufdata object is moved from the glock to the global revokes list. Later, when the glock's last such bufdata object is freed, the reference count is decremented. This guarantees that whichever process finishes last (the revoke writing or the evict) will properly free the glock, and neither will reference the glock after it has been freed. Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-05-07gfs2: clean_journal improperly set sd_log_flush_headBob Peterson9-19/+57
This patch fixes regressions in 588bff95c94efc05f9e1a0b19015c9408ed7c0ef. Due to that patch, function clean_journal was setting the value of sd_log_flush_head, but that's only valid if it is replaying the node's own journal. If it's replaying another node's journal, that's completely wrong and will lead to multiple problems. This patch tries to clean up the mess by passing the value of the logical journal block number into gfs2_write_log_header so the function can treat non-owned journals generically. For the local journal, the journal extent map is used for best performance. For other nodes from other journals, new function gfs2_lblk_to_dblk is called to figure it out using gfs2_iomap_get. This patch also tries to establish more consistency when passing journal block parameters by changing several unsigned int types to a consistent u32. Fixes: 588bff95c94e ("GFS2: Reduce code redundancy writing log headers") Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07ALSA: hda/intel: add CometLake PCI IDsPierre-Louis Bossart1-0/+6
Add PCI IDs for LP and H skews. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2019-05-07gfs2: Fix lru_count going negativeRoss Lagerwall1-9/+13
Under certain conditions, lru_count may drop below zero resulting in a large amount of log spam like this: vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \ negative objects to delete nr=-1 This happens as follows: 1) A glock is moved from lru_list to the dispose list and lru_count is decremented. 2) The dispose function calls cond_resched() and drops the lru lock. 3) Another thread takes the lru lock and tries to add the same glock to lru_list, checking if the glock is on an lru list. 4) It is on a list (actually the dispose list) and so it avoids incrementing lru_count. 5) The glock is moved to lru_list. 5) The original thread doesn't dispose it because it has been re-added to the lru list but the lru_count has still decreased by one. Fix by checking if the LRU flag is set on the glock rather than checking if the glock is on some list and rearrange the code so that the LRU flag is added/removed precisely when the glock is added/removed from lru_list. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07gfs2: Fix loop in gfs2_rbm_find (v2)Andreas Gruenbacher1-29/+25
Fix the resource group wrap-around logic in gfs2_rbm_find that commit e579ed4f44 broke. The bug can lead to unnecessary repeated scanning of the same bitmaps; there is a risk that future changes will turn this into an endless loop. This is an updated version of commit 2d29f6b96d ("gfs2: Fix loop in gfs2_rbm_find") which ended up being reverted because it introduced a performance regression in iozone (see commit e74c98ca2d). Changes since v1: - Simplify the wrap-around logic. - Handle the case where each resource group only has a single bitmap block (small filesystem). - Update rd_extfail_pt whenever we scan the entire bitmap, even when we don't start the scan at the very beginning of the bitmap. Fixes: e579ed4f446e ("GFS2: Introduce rbm field bii") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-05-07cxgb4: Fix error path in cxgb4_init_moduleYueHaibing1-3/+12
BUG: unable to handle kernel paging request at ffffffffa016a270 PGD 3270067 P4D 3270067 PUD 3271063 PMD 230bbd067 PTE 0 Oops: 0000 [#1 CPU: 0 PID: 6134 Comm: modprobe Not tainted 5.1.0+ #33 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:atomic_notifier_chain_register+0x24/0x60 Code: 1f 80 00 00 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb e8 ae b4 38 01 48 8b 53 38 48 8d 4b 38 48 85 d2 74 20 45 8b 44 24 10 <44> 3b 42 10 7e 08 eb 13 44 39 42 10 7c 0d 48 8d 4a 08 48 8b 52 08 RSP: 0018:ffffc90000e2bc60 EFLAGS: 00010086 RAX: 0000000000000292 RBX: ffffffff83467240 RCX: ffffffff83467278 RDX: ffffffffa016a260 RSI: ffffffff83752140 RDI: ffffffff83467240 RBP: ffffc90000e2bc70 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 00000000014fa61f R12: ffffffffa01c8260 R13: ffff888231091e00 R14: 0000000000000000 R15: ffffc90000e2be78 FS: 00007fbd8d7cd540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa016a270 CR3: 000000022c7e3000 CR4: 00000000000006f0 Call Trace: register_inet6addr_notifier+0x13/0x20 cxgb4_init_module+0x6c/0x1000 [cxgb4 ? 0xffffffffa01d7000 do_one_initcall+0x6c/0x3cc ? do_init_module+0x22/0x1f1 ? rcu_read_lock_sched_held+0x97/0xb0 ? kmem_cache_alloc_trace+0x325/0x3b0 do_init_module+0x5b/0x1f1 load_module+0x1db1/0x2690 ? m_show+0x1d0/0x1d0 __do_sys_finit_module+0xc5/0xd0 __x64_sys_finit_module+0x15/0x20 do_syscall_64+0x6b/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe If pci_register_driver fails, register inet6addr_notifier is pointless. This patch fix the error path in cxgb4_init_module. Fixes: b5a02f503caa ("cxgb4 : Update ipv6 address handling api") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: phy: improve pause mode reporting in phy_print_statusHeiner Kallweit1-1/+27
So far we report symmetric pause only, and we don't consider the local pause capabilities. Let's properly consider local and remote capabilities, and report also asymmetric pause. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindingsMaxime Chevallier1-1/+1
The phy_mode "2000base-x" is actually supposed to be "1000base-x", even though the commit title of the original patch says otherwise. Fixes: 55601a880690 ("net: phy: Add 2000base-x, 2500base-x and rxaui modes") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: macb: Change interrupt and napi enable order in openHarini Katakam1-3/+3
Current order in open: -> Enable interrupts (macb_init_hw) -> Enable NAPI -> Start PHY Sequence of RX handling: -> RX interrupt occurs -> Interrupt is cleared and interrupt bits disabled in handler -> NAPI is scheduled -> In NAPI, RX budget is processed and RX interrupts are re-enabled With the above, on QEMU or fixed link setups (where PHY state doesn't matter), there's a chance macb RX interrupt occurs before NAPI is enabled. This will result in NAPI being scheduled before it is enabled. Fix this macb open by changing the order. Fixes: ae1f2a56d273 ("net: macb: Added support for many RX queues") Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: ll_temac: Improve error message on error IRQEsben Haabendal1-2/+8
The channel status register value can be very helpful when debugging SDMA problems. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net/sched: remove block pointer from common offload structurePieter Jansen van Vuuren8-36/+25
Based on feedback from Jiri avoid carrying a pointer to the tcf_block structure in the tc_cls_common_offload structure. Instead store a flag in driver private data which indicates if offloads apply to a shared block at block binding time. Suggested-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: ethernet: support of_get_mac_address new ERR_PTR errorPetr Štetiar45-45/+45
There was NVMEM support added to of_get_mac_address, so it could now return ERR_PTR encoded error values, so we need to adjust all current users of of_get_mac_address to this new fact. While at it, remove superfluous is_valid_ether_addr as the MAC address returned from of_get_mac_address is always valid and checked by is_valid_ether_addr anyway. Fixes: d01f449c008a ("of_net: add NVMEM support to of_get_mac_address") Signed-off-by: Petr Štetiar <ynezz@true.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: usb: smsc: fix warning reported by kbuild test robotPetr Štetiar2-2/+2
This patch fixes following warning reported by kbuild test robot: In function ‘memcpy’, inlined from ‘smsc75xx_init_mac_address’ at drivers/net/usb/smsc75xx.c:778:3, inlined from ‘smsc75xx_bind’ at drivers/net/usb/smsc75xx.c:1501:2: ./include/linux/string.h:355:9: warning: argument 2 null where non-null expected [-Wnonnull] return __builtin_memcpy(p, q, size); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/net/usb/smsc75xx.c: In function ‘smsc75xx_bind’: ./include/linux/string.h:355:9: note: in a call to built-in function ‘__builtin_memcpy’ I've replaced the offending memcpy with ether_addr_copy, because I'm 100% sure, that of_get_mac_address can't return NULL as it returns valid pointer or ERR_PTR encoded value, nothing else. I'm hesitant to just change IS_ERR into IS_ERR_OR_NULL check, as this would make the warning disappear also, but it would be confusing to check for impossible return value just to make a compiler happy. Fixes: adfb3cb2c52e ("net: usb: support of_get_mac_address new ERR_PTR error") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Petr Štetiar <ynezz@true.cz> Reviewed-by: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR checkPetr Štetiar1-1/+1
Commit 284eb160681c ("staging: octeon-ethernet: support of_get_mac_address new ERR_PTR error") has introduced checking for ERR_PTR encoded error value from of_get_mac_address with IS_ERR macro, which is not sufficient in this case, as the mac variable is set to NULL initialy and if the kernel is compiled without DT support this NULL would get passed to IS_ERR, which would lead to the wrong decision and would pass that NULL pointer and invalid MAC address further. Fixes: 284eb160681c ("staging: octeon-ethernet: support of_get_mac_address new ERR_PTR error") Signed-off-by: Petr Štetiar <ynezz@true.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: support of_get_mac_address new ERR_PTR errorPetr Štetiar1-1/+1
There was NVMEM support added to of_get_mac_address, so it could now return ERR_PTR encoded error values, so we need to adjust all current users of of_get_mac_address to this new fact. While at it, remove superfluous is_valid_ether_addr as the MAC address returned from of_get_mac_address is always valid and checked by is_valid_ether_addr anyway. Fixes: d01f449c008a ("of_net: add NVMEM support to of_get_mac_address") Signed-off-by: Petr Štetiar <ynezz@true.cz> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_statsNathan Chancellor1-1/+3
Clang warns: drivers/net/dsa/sja1105/sja1105_ethtool.c:316:39: warning: suggest braces around initialization of subobject [-Wmissing-braces] struct sja1105_port_status status = {0}; ^ {} 1 warning generated. One way to fix these warnings is to add additional braces like Clang suggests; however, there has been a bit of push back from some maintainers[1][2], who just prefer memset as it is unambiguous, doesn't depend on a particular compiler version[3], and properly initializes all subobjects. Do that here so there are no more warnings. [1]: https://lore.kernel.org/lkml/022e41c0-8465-dc7a-a45c-64187ecd9684@amd.com/ [2]: https://lore.kernel.org/lkml/20181128.215241.702406654469517539.davem@davemloft.net/ [3]: https://lore.kernel.org/lkml/20181116150432.2408a075@redhat.com/ Fixes: 52c34e6e125c ("net: dsa: sja1105: Add support for ethtool port counters") Link: https://github.com/ClangBuiltLinux/linux/issues/471 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07vrf: sit mtu should not be updated when vrf netdev is the linkStephen Suryaputra1-1/+1
VRF netdev mtu isn't typically set and have an mtu of 65536. When the link of a tunnel is set, the tunnel mtu is changed from 1480 to the link mtu minus tunnel header. In the case of VRF netdev is the link, then the tunnel mtu becomes 65516. So, fix it by not setting the tunnel mtu in this case. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: Fix error cleanup path in dsa_init_moduleYueHaibing1-2/+9
BUG: unable to handle kernel paging request at ffffffffa01c5430 PGD 3270067 P4D 3270067 PUD 3271063 PMD 230bc5067 PTE 0 Oops: 0000 [#1 CPU: 0 PID: 6159 Comm: modprobe Not tainted 5.1.0+ #33 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:raw_notifier_chain_register+0x16/0x40 Code: 63 f8 66 90 e9 5d ff ff ff 90 90 90 90 90 90 90 90 90 90 90 55 48 8b 07 48 89 e5 48 85 c0 74 1c 8b 56 10 3b 50 10 7e 07 eb 12 <39> 50 10 7c 0d 48 8d 78 08 48 8b 40 08 48 85 c0 75 ee 48 89 46 08 RSP: 0018:ffffc90001c33c08 EFLAGS: 00010282 RAX: ffffffffa01c5420 RBX: ffffffffa01db420 RCX: 4fcef45928070a8b RDX: 0000000000000000 RSI: ffffffffa01db420 RDI: ffffffffa01b0068 RBP: ffffc90001c33c08 R08: 000000003e0a33d0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000094443661 R12: ffff88822c320700 R13: ffff88823109be80 R14: 0000000000000000 R15: ffffc90001c33e78 FS: 00007fab8bd08540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa01c5430 CR3: 00000002297ea000 CR4: 00000000000006f0 Call Trace: register_netdevice_notifier+0x43/0x250 ? 0xffffffffa01e0000 dsa_slave_register_notifier+0x13/0x70 [dsa_core ? 0xffffffffa01e0000 dsa_init_module+0x2e/0x1000 [dsa_core do_one_initcall+0x6c/0x3cc ? do_init_module+0x22/0x1f1 ? rcu_read_lock_sched_held+0x97/0xb0 ? kmem_cache_alloc_trace+0x325/0x3b0 do_init_module+0x5b/0x1f1 load_module+0x1db1/0x2690 ? m_show+0x1d0/0x1d0 __do_sys_finit_module+0xc5/0xd0 __x64_sys_finit_module+0x15/0x20 do_syscall_64+0x6b/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Cleanup allocated resourses if there are errors, otherwise it will trgger memleak. Fixes: c9eb3e0f8701 ("net: dsa: Add support for learning FDB through notification") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07l2tp: Fix possible NULL pointer dereferenceYueHaibing1-1/+2
BUG: unable to handle kernel NULL pointer dereference at 0000000000000128 PGD 0 P4D 0 Oops: 0000 [#1 CPU: 0 PID: 5697 Comm: modprobe Tainted: G W 5.1.0-rc7+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:__lock_acquire+0x53/0x10b0 Code: 8b 1c 25 40 5e 01 00 4c 8b 6d 10 45 85 e4 0f 84 bd 06 00 00 44 8b 1d 7c d2 09 02 49 89 fe 41 89 d2 45 85 db 0f 84 47 02 00 00 <48> 81 3f a0 05 70 83 b8 00 00 00 00 44 0f 44 c0 83 fe 01 0f 86 3a RSP: 0018:ffffc90001c07a28 EFLAGS: 00010002 RAX: 0000000000000000 RBX: ffff88822f038440 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000128 RBP: ffffc90001c07a88 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 R13: 0000000000000000 R14: 0000000000000128 R15: 0000000000000000 FS: 00007fead0811540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000128 CR3: 00000002310da000 CR4: 00000000000006f0 Call Trace: ? __lock_acquire+0x24e/0x10b0 lock_acquire+0xdf/0x230 ? flush_workqueue+0x71/0x530 flush_workqueue+0x97/0x530 ? flush_workqueue+0x71/0x530 l2tp_exit_net+0x170/0x2b0 [l2tp_core ? l2tp_exit_net+0x93/0x2b0 [l2tp_core ops_exit_list.isra.6+0x36/0x60 unregister_pernet_operations+0xb8/0x110 unregister_pernet_device+0x25/0x40 l2tp_init+0x55/0x1000 [l2tp_core ? 0xffffffffa018d000 do_one_initcall+0x6c/0x3cc ? do_init_module+0x22/0x1f1 ? rcu_read_lock_sched_held+0x97/0xb0 ? kmem_cache_alloc_trace+0x325/0x3b0 do_init_module+0x5b/0x1f1 load_module+0x1db1/0x2690 ? m_show+0x1d0/0x1d0 __do_sys_finit_module+0xc5/0xd0 __x64_sys_finit_module+0x15/0x20 do_syscall_64+0x6b/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fead031a839 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe8d9acca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 0000560078398b80 RCX: 00007fead031a839 RDX: 0000000000000000 RSI: 000056007659dc2e RDI: 0000000000000003 RBP: 000056007659dc2e R08: 0000000000000000 R09: 0000560078398b80 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000 R13: 00005600783a04a0 R14: 0000000000040000 R15: 0000560078398b80 Modules linked in: l2tp_core(+) e1000 ip_tables ipv6 [last unloaded: l2tp_core CR2: 0000000000000128 ---[ end trace 8322b2b8bf83f8e1 If alloc_workqueue fails in l2tp_init, l2tp_net_ops is unregistered on failure path. Then l2tp_exit_net is called which will flush NULL workqueue, this patch add a NULL check to fix it. Fixes: 67e04c29ec0d ("l2tp: unregister l2tp_net_ops on failure path") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07taprio: add null check on sched_nest to avoid potential null pointer dereferenceColin Ian King1-0/+2
The call to nla_nest_start_noflag can return a null pointer and currently this is not being checked and this can lead to a null pointer dereference when the null pointer sched_nest is passed to function nla_nest_end. Fix this by adding in a null pointer check. Addresses-Coverity: ("Dereference null return value") Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: mvpp2: cls: fix less than zero check on a u32 variableColin Ian King1-2/+4
The signed return from the call to mvpp2_cls_c2_port_flow_index is being assigned to the u32 variable c2.index and then checked for a negative error condition which is always going to be false. Fix this by assigning the return to the int variable index and checking this instead. Addresses-Coverity: ("Unsigned compared against 0") Fixes: 90b509b39ac9 ("net: mvpp2: cls: Add Classification offload support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07RDMA/ipoib: Allow user space differentiate between valid dev_portLeon Romanovsky1-1/+12
Systemd triggers the following warning during IPoIB device load: mlx5_core 0000:00:0c.0 ib0: "systemd-udevd" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info. This is caused due to user space attempt to differentiate old systems without dev_port and new systems with dev_port. In case dev_port will be zero, the systemd will try to read dev_id instead. There is no need to print a warning in such case, because it is valid situation and it is needed to ensure systemd compatibility with old kernels. Link: https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-net_id.c#L358 Cc: <stable@vger.kernel.org> # 4.19 Fixes: f6350da41dc7 ("IB/ipoib: Log sysfs 'dev_id' accesses from userspace") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07net_sched: sch_fq: handle non connected flowsEric Dumazet1-2/+13
FQ packet scheduler assumed that packets could be classified based on their owning socket. This means that if a UDP server uses one UDP socket to send packets to different destinations, packets all land in one FQ flow. This is unfair, since each TCP flow has a unique bucket, meaning that in case of pressure (fully utilised uplink), TCP flows have more share of the bandwidth. If we instead detect unconnected sockets, we can use a stochastic hash based on the 4-tuple hash. This also means a QUIC server using one UDP socket will properly spread the outgoing packets to different buckets, and in-kernel pacing based on EDT model will no longer risk having big rb-tree on one flow. Note that UDP application might provide the skb->hash in an ancillary message at sendmsg() time to avoid the cost of a dissection in fq packet scheduler. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net_sched: sch_fq: do not assume EDT packets are orderedEric Dumazet1-12/+83
TCP stack makes sure packets for a given flow are monotically increasing, but we want to allow UDP packets to use EDT as well, so that QUIC servers can use in-kernel pacing. This patch adds a per-flow rb-tree on which packets might be stored. We still try to use the linear list for the typical cases where packets are queued with monotically increasing skb->tstamp, since queue/dequeue packets on a standard list is O(1). Note that the ability to store packets in arbitrary EDT order will allow us to implement later a per TCP socket mechanism adding delays (with jitter eventually) and reorders, to implement convenient network emulators. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07IB/core, ipoib: Do not overreact to SM LID change eventDennis Dalessandro3-4/+1
When IPoIB receives an SM LID change event, it reacts by flushing its path record cache and rejoining multicast groups. This is the same behavior it performs when it receives a reregistration event. This behavior is unnecessary as an SM may have database backup or synchronization mechanisms which permit the SM location or LID to change without loss of multicast membership and without impact to path records. Both opensm and the OPA FM issue reregistration events if a new SM is started (or restarted with a new config) or an SM event occurs which results in loss of multicast membership records by the SM (such as opensm failover) or the SM encounters new nodes with Active ports (such as after joining 2 fabrics by connecting switches via ISLs). Hence this event can be depended on as the trigger for IPoIB cache and multicast flushing. It appears that some drivers, such as qib, and hfi1 issue the IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not. Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed that Mellanox EDR HCAs do not generate SM change events and that opensm does generate reregistration. An SM LID change event is generated by the mentioned drivers to reflect that sm_lid and/or sm_sl in the local port info has changed. The intent of this event is to permit applications and ULPs which have a local copy of this information (or an address handle using it) to update their information. The intent is that the reregistration event (caused by the SM via a bit in Set(PortInfo)) be used to inform nodes that they need to rejoin multicast groups, resubscribe for notices and potentially update path records. When an SM migrates or fails over, a SM LID change event can occur. In response IPoIB discards path records and multicast membership and loses connectivity until these records are restored via SA requests. In very large fabrics, it may take minutes for the SM to be ready and for the SA responses to be supplied. This can result in undesirable and unnecessary IPoIB connectivity impacts. It also can result in an unnecessary storm of SA queries from all nodes in a cluster potentially followed by yet another storm if the SM issues the reregistration request. The fact the Mellanox HCAs do not even generate this event, is further evidence that on modern IB fabrics there will be no ill side effects from the proposed changes below to reduce the reaction by 3 kernel components to this event. So these changes should be benign for Mellanox IB fabrics and will benefit OPA fabrics while also making ib_core and ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event APIs. Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib. IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do anything on SM LID change. IPoIB makes use of other ib_core components to issue SA requests for it and those components correctly track SM LID and SM LID changes. Also in ib_core multicast handling, remove the test for IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the error state, which will trigger rejoins. This code is used by IPoIB as well as the connection manager and other clients of multicast groups. This kernel module centralizes group membership status and joins since a node can only join a given group once but multiple ULPs or applications may want to join the same group. It makes use of the sa_query.c component in ib_core, which correctly trackes SM LID and SL. This component does not track SM LID nor SL itself and hence need not react to their changes. Similarly in the ib_core cache code remove the handling for the IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function which is ultimately called is updating local copies of the pkey table, gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As such it does not need to be called on an SM LID change. It technically also does not need to be called on a reregistration. The LID_CHANGE, PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR, PORT_ACTICE) should be sufficient triggers. It is worth noting that the alternative of simply having the hfi1 and qib drivers not generate the SM LID change event was explored. While this would duplicate what Mellanox drivers do now, it is not the correct behavior and removes the ability for an SM to migrate without requiring reregistration. Since both opensm and OPA SM have mechanisms to backup or synchronize registration information, it is desirable to let them perform SM migrations (with LID or SL changes) without requiring reregistration when they deem it appropriate. Suggested-by: Todd Rimmer <todd.rimmer@intel.com> Tested-by: Michael Brooks <michael.brooks@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Todd Rimmer <todd.rimmer@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07net: hns3: use devm_kcalloc when allocating desc_cbYunsheng Lin1-4/+4
This patch uses devm_kcalloc instead of kcalloc when allocating ring->desc_cb, because devm_kcalloc not only ensure to free the memory when the dev is deallocted, but also allocate the memory from it's device memory node. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: some cleanup for struct hns3_enet_ringYunsheng Lin2-9/+2
This patch removes some unused field in struct hns3_enet_ring, use ring->dev for ring_to_dev macro, and use dev consistently in hns3_fill_desc. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: unify the page reusing for page size 4K and 64KYunsheng Lin1-32/+13
When page size is 64K, RX buffer is currently not reused when the page_offset is moved to last buffer. This patch adds checking to decide whether the buffer page can be reused when last_offset is moved beyond last offset. If the driver is the only user of page when page_offset is moved to beyond last offset, then buffer can be reused and page_offset is set to zero. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: optimize the barrier using when cleaning TX BDYunsheng Lin1-14/+15
Currently, a barrier is used when cleaning each TX BD, which may cause performance degradation. This patch optimizes it to use one barrier when cleaning TX BD each round. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: fix error handling for desc fillingYunsheng Lin1-11/+7
When desc filling fails in hns3_nic_net_xmit, it will call hns3_clear_desc to unmap the dma mapping. But currently the ring->next_to_use points to the desc where the desc filling or dma mapping return error, which means the desc that ring->next_to_use points to has not done the dma mapping, the desc that need unmapping is before the ring->next_to_use. This patch fixes it by calling ring_ptr_move_bw(next_to_use) before doing unmapping operation, and set desc_cb->dma to zero to avoid freeing it again when unloading. Also, when filling skb head or frag fails, both need to unmap all the way back to next_to_use_head, so remove one desc filling error handling. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: combine len and checksum handling for inner and outer header.Yunsheng Lin1-111/+81
When filling len and checksum info to description, there is some similar checking or calculation. So this patch adds hns3_set_l2l3l4 to fill the inner(/normal) header's len and checksum info. If it is a encapsulation skb, it calls hns3_set_outer_l2l3l4 to handle the outer header's len and checksum info, in order to avoid some similar checking or calculation. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: refactor BD filling for l2l3l4 infoYunsheng Lin1-39/+23
This patch separates the inner and outer l2l3l4 len handling in hns3_set_l2l3l4_len, this is a preparation to combine the l2l3l4 len and checksum handling for inner and outer header. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: fix for tunnel type handling in hns3_rx_checksumYunsheng Lin1-6/+8
According to hardware user manual, the tunnel packet type is available in the rx.ol_info field of struct hns3_desc. Currently the tunnel packet type is decided by the rx.l234_info, which may cause RX checksum handling error. This patch fixes it by using the correct field in struct hns3_desc to decide the tunnel packet type. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: add linearizing checking for TSO caseYunsheng Lin1-0/+45
HW requires every continuous 8 buffer data to be larger than MSS, we simplify it by ensuring skb_headlen + the first continuous 7 frags to to be larger than GSO header len + mss, and the remaining continuous 7 frags to be larger than MSS except the last 7 frags. This patch adds hns3_skb_need_linearized to handle it for TSO case. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: add counter for times RX pages gets allocatedYunsheng Lin3-0/+6
Currently, using "ethtool --statistics" can show how many time RX page have been reused, but there is no counter for RX page not being reused. This patch adds non_reuse_pg counter to better debug the performance issue, because it is hard to determine when the RX page is reused or not if there is no such counter. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: use napi_schedule_irqoff in hard interrupts handlersYunsheng Lin1-1/+1
napi_schedule_irqoff is introduced to be used from hard interrupts handlers or when irqs are already masked, see: https://lists.openwall.net/netdev/2014/10/29/2 So this patch replaces napi_schedule with napi_schedule_irqoff. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: hns3: unify maybe_stop_tx for TSO and non-TSO caseYunsheng Lin3-86/+50
Currently, maybe_stop_tx ops for TSO and non-TSO case share some BD calculation code, so this patch unifies the maybe_stop_tx by removing the maybe_stop_tx ops. skb_is_gso() can be used to differentiate the case between TSO and non-TSO case if there is need to handle special case for TSO case. This patch also add tx_copy field in "ethtool --statistics" to help better debug the performance issue caused by calling skb_copy. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: lantiq: Add Forwarding Database accessHauke Mehrtens1-0/+98
This adds functions to add and remove static entries to and from the forwarding database and dump the full forwarding database. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: lantiq: Add fast age functionHauke Mehrtens1-0/+40
Fast aging per port is not supported directly by the hardware, it is only possible to configure a global aging time. Do the fast aging by iterating over the MAC forwarding table and remove all dynamic entries for a given port. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: lantiq: Add VLAN aware bridge offloadingHauke Mehrtens1-4/+197
The VLAN aware bridge offloading is similar to the VLAN unaware offloading, this makes it possible to offload the VLAN bridge functionalities. The hardware supports up to 64 VLAN bridge entries, we already use one entry for each LAN port to prevent forwarding of packets between the ports when the ports are not in a bridge, so in the end we have 57 possible VLANs. The VLAN filtering is currently only active when the ports are in a bridge, VLAN filtering for ports not in a bridge is not implemented. It is currently not possible to change between VLAN filtering and not filtering while the port is already in a bridge, this would make the driver more complicated. The VLANs are only defined on bridge entries, so we will not add anything into the hardware when the port joins a bridge if it is doing VLAN filtering, but only when an allowed VLAN is added. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: lantiq: Add VLAN unaware bridge offloadingHauke Mehrtens1-7/+468
This allows to offload bridges with DSA to the switch hardware and do the packet forwarding in hardware. This implements generic functions to access the switch hardware tables, which are used to control many features of the switch. This patch activates the MAC learning by removing the MAC address table lock, to prevent uncontrolled forwarding of packets between all the LAN ports, they are added into individual bridge tables entries with individual flow ids and the switch will do the MAC learning for each port separately before they are added to a real bridge. Each bridge consist of an entry in the active VLAN table and the VLAN mapping table, table entries with the same index are matching. In the VLAN unaware mode we configure everything with VLAN ID 0, but we use different flow IDs, the switch should handle all VLANs as normal payload and ignore them. When the hardware looks for the port of the destination MAC address it only takes the entries which have the same flow ID of the ingress packet. The bridges are configured with 64 possible entries with these information: Table Index, 0...63 VLAN ID, 0...4095: VLAN ID 0 is untagged flow ID, 0..63: Same flow IDs share entries in MAC learning table port map, one bit for each port number tagged port map, one bit for each port number Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07net: dsa: lantiq: Allow special tags only on CPU portHauke Mehrtens1-2/+4
Allow the special tag in ingress only on the CPU port and not on all ports. A packet with a special tag could circumvent the hardware forwarding and should only be allowed on the CPU port where Linux controls the port. Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200)" Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>