aboutsummaryrefslogtreecommitdiffstats
path: root/net/sched (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-09-16net: sched: take reference to psample group in flow_action infraVlad Buslov2-2/+38
With recent patch set that removed rtnl lock dependency from cls hardware offload API rtnl lock is only taken when reading action data and can be released after action-specific data is parsed into intermediate representation. However, sample action psample group is passed by pointer without obtaining reference to it first, which makes it possible to concurrently overwrite the action and deallocate object pointed by psample_group pointer after rtnl lock is released but before driver finished using the pointer. To prevent such race condition, obtain reference to psample group while it is used by flow_action infra. Extend psample API with function psample_group_take() that increments psample group reference counter. Extend struct tc_action_ops with new get_psample_group() API. Implement the API for action sample using psample_group_take() and already existing psample_group_put() as a destructor. Use it in tc_setup_flow_action() to take reference to psample group pointed to by entry->sample.psample_group and release it in tc_cleanup_flow_action(). Disable bh when taking psample_groups_lock. The lock is now taken while holding action tcf_lock that is used by data path and requires bh to be disabled, so doing the same for psample_groups_lock is necessary to preserve SOFTIRQ-irq-safety. Fixes: 918190f50eb6 ("net: sched: flower: don't take rtnl lock for cls hw offloads API") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16net: sched: extend flow_action_entry with destructorVlad Buslov1-32/+45
Generalize flow_action_entry cleanup by extending the structure with pointer to destructor function. Set the destructor in tc_setup_flow_action(). Refactor tc_cleanup_flow_action() to call entry->destructor() instead of using switch that dispatches by entry->id and manually executes cleanup. This refactoring is necessary for following patches in this series that require destructor to use tc_action->ops callbacks that can't be easily obtained in tc_cleanup_flow_action(). Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller3-3/+10
Minor overlapping changes in the btusb and ixgbe drivers. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13netfilter: update include directives.Jeremy Sowden1-1/+1
Include some headers in files which require them, and remove others which are not required. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-10sch_hhf: ensure quantum and hhf_non_hh_weight are non-zeroCong Wang1-1/+1
In case of TCA_HHF_NON_HH_WEIGHT or TCA_HHF_QUANTUM is zero, it would make no progress inside the loop in hhf_dequeue() thus kernel would get stuck. Fix this by checking this corner case in hhf_change(). Fixes: 10239edf86f1 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc") Reported-by: syzbot+bc6297c11f19ee807dc2@syzkaller.appspotmail.com Reported-by: syzbot+041483004a7f45f1f20a@syzkaller.appspotmail.com Reported-by: syzbot+55be5f513bed37fc4367@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Terry Lam <vtlam@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-10net_sched: check cops->tcf_block in tc_bind_tclass()Cong Wang1-0/+2
At least sch_red and sch_tbf don't implement ->tcf_block() while still have a non-zero tc "class". Instead of adding nop implementations to each of such qdisc's, we can just relax the check of cops->tcf_block() in tc_bind_tclass(). They don't support TC filter anyway. Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06net: sched: fix reordering issuesEric Dumazet1-2/+7
Whenever MQ is not used on a multiqueue device, we experience serious reordering problems. Bisection found the cited commit. The issue can be described this way : - A single qdisc hierarchy is shared by all transmit queues. (eg : tc qdisc replace dev eth0 root fq_codel) - When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting a different transmit queue than the one used to build a packet train, we stop building the current list and save the 'bad' skb (P1) in a special queue. (bad_txq) - When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this skb (P1), it checks if the associated transmit queues is still in frozen state. If the queue is still blocked (by BQL or NIC tx ring full), we leave the skb in bad_txq and return NULL. - dequeue_skb() calls q->dequeue() to get another packet (P2) The other packet can target the problematic queue (that we found in frozen state for the bad_txq packet), but another cpu just ran TX completion and made room in the txq that is now ready to accept new packets. - Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent at next round. In practice P2 is the lead of a big packet train (P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/ To solve this problem, we have to block the dequeue process as long as the first packet in bad_txq can not be sent. Reordering issues disappear and no side effects have been seen. Fixes: a53851e2c321 ("net: sched: explicit locking in gso_cpu fallback") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrateDavid Dai1-4/+23
For high speed adapter like Mellanox CX-5 card, it can reach upto 100 Gbits per second bandwidth. Currently htb already supports 64bit rate in tc utility. However police action rate and peakrate are still limited to 32bit value (upto 32 Gbits per second). Add 2 new attributes TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support so that tc utility can use them for 64bit rate and peakrate value to break the 32bit limit, and still keep the backward binary compatibility. Tested-by: David Dai <zdai@linux.vnet.ibm.com> Signed-off-by: David Dai <zdai@linux.vnet.ibm.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06net: openvswitch: Set OvS recirc_id from tc chain indexPaul Blakey2-0/+25
Offloaded OvS datapath rules are translated one to one to tc rules, for example the following simplified OvS rule: recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2) Will be translated to the following tc rule: $ tc filter add dev dev1 ingress \ prio 1 chain 0 proto ip \ flower tcp ct_state -trk \ action ct pipe \ action goto chain 2 Received packets will first travel though tc, and if they aren't stolen by it, like in the above rule, they will continue to OvS datapath. Since we already did some actions (action ct in this case) which might modify the packets, and updated action stats, we would like to continue the proccessing with the correct recirc_id in OvS (here recirc_id(2)) where we left off. To support this, introduce a new skb extension for tc, which will be used for translating tc chain to ovs recirc_id to handle these miss cases. Last tc chain index will be set by tc goto chain action and read by OvS datapath. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05net/sched: cbs: remove redundant assignment to variable port_rateColin Ian King1-1/+1
Variable port_rate is being initialized with a value that is never read and is being re-assigned a little later on. The assignment is redundant and hence can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller22-52/+70
r8152 conflicts are the NAPI fixes in 'net' overlapping with some tasklet stuff in net-next Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rateVladimir Oltean1-8/+11
The discussion to be made is absolutely the same as in the case of previous patch ("taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte"). Nothing is lost when setting a default. Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Fixes: e0a7683d30e9 ("net/sched: cbs: fix port_rate miscalculation") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byteVladimir Oltean1-10/+13
The taprio budget needs to be adapted at runtime according to interface link speed. But that handling is problematic. For one thing, installing a qdisc on an interface that doesn't have carrier is not illegal. But taprio prints the following stack trace: [ 31.851373] ------------[ cut here ]------------ [ 31.856024] WARNING: CPU: 1 PID: 207 at net/sched/sch_taprio.c:481 taprio_dequeue+0x1a8/0x2d4 [ 31.864566] taprio: dequeue() called with unknown picos per byte. [ 31.864570] Modules linked in: [ 31.873701] CPU: 1 PID: 207 Comm: tc Not tainted 5.3.0-rc5-01199-g8838fe023cd6 #1689 [ 31.881398] Hardware name: Freescale LS1021A [ 31.885661] [<c03133a4>] (unwind_backtrace) from [<c030d8cc>] (show_stack+0x10/0x14) [ 31.893368] [<c030d8cc>] (show_stack) from [<c10ac958>] (dump_stack+0xb4/0xc8) [ 31.900555] [<c10ac958>] (dump_stack) from [<c0349d04>] (__warn+0xe0/0xf8) [ 31.907395] [<c0349d04>] (__warn) from [<c0349d64>] (warn_slowpath_fmt+0x48/0x6c) [ 31.914841] [<c0349d64>] (warn_slowpath_fmt) from [<c0f38db4>] (taprio_dequeue+0x1a8/0x2d4) [ 31.923150] [<c0f38db4>] (taprio_dequeue) from [<c0f227b0>] (__qdisc_run+0x90/0x61c) [ 31.930856] [<c0f227b0>] (__qdisc_run) from [<c0ec82ac>] (net_tx_action+0x12c/0x2bc) [ 31.938560] [<c0ec82ac>] (net_tx_action) from [<c0302298>] (__do_softirq+0x130/0x3c8) [ 31.946350] [<c0302298>] (__do_softirq) from [<c03502a0>] (irq_exit+0xbc/0xd8) [ 31.953536] [<c03502a0>] (irq_exit) from [<c03a4808>] (__handle_domain_irq+0x60/0xb4) [ 31.961328] [<c03a4808>] (__handle_domain_irq) from [<c0754478>] (gic_handle_irq+0x58/0x9c) [ 31.969638] [<c0754478>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90) [ 31.977076] Exception stack(0xe8167b20 to 0xe8167b68) [ 31.982100] 7b20: e9d4bd80 00000cc0 000000cf 00000000 e9d4bd80 c1f38958 00000cc0 c1f38960 [ 31.990234] 7b40: 00000001 000000cf 00000004 e9dc0800 00000000 e8167b70 c0f478ec c0f46d94 [ 31.998363] 7b60: 60070013 ffffffff [ 32.001833] [<c0301a8c>] (__irq_svc) from [<c0f46d94>] (netlink_trim+0x18/0xd8) [ 32.009104] [<c0f46d94>] (netlink_trim) from [<c0f478ec>] (netlink_broadcast_filtered+0x34/0x414) [ 32.017930] [<c0f478ec>] (netlink_broadcast_filtered) from [<c0f47cec>] (netlink_broadcast+0x20/0x28) [ 32.027102] [<c0f47cec>] (netlink_broadcast) from [<c0eea378>] (rtnetlink_send+0x34/0x88) [ 32.035238] [<c0eea378>] (rtnetlink_send) from [<c0f25890>] (notify_and_destroy+0x2c/0x44) [ 32.043461] [<c0f25890>] (notify_and_destroy) from [<c0f25e08>] (qdisc_graft+0x398/0x470) [ 32.051595] [<c0f25e08>] (qdisc_graft) from [<c0f27a00>] (tc_modify_qdisc+0x3a4/0x724) [ 32.059470] [<c0f27a00>] (tc_modify_qdisc) from [<c0ee4c84>] (rtnetlink_rcv_msg+0x260/0x2ec) [ 32.067864] [<c0ee4c84>] (rtnetlink_rcv_msg) from [<c0f4a988>] (netlink_rcv_skb+0xb8/0x110) [ 32.076172] [<c0f4a988>] (netlink_rcv_skb) from [<c0f4a170>] (netlink_unicast+0x1b4/0x22c) [ 32.084392] [<c0f4a170>] (netlink_unicast) from [<c0f4a5e4>] (netlink_sendmsg+0x33c/0x380) [ 32.092614] [<c0f4a5e4>] (netlink_sendmsg) from [<c0ea9f40>] (sock_sendmsg+0x14/0x24) [ 32.100403] [<c0ea9f40>] (sock_sendmsg) from [<c0eaa780>] (___sys_sendmsg+0x214/0x228) [ 32.108279] [<c0eaa780>] (___sys_sendmsg) from [<c0eabad0>] (__sys_sendmsg+0x50/0x8c) [ 32.116068] [<c0eabad0>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54) [ 32.123938] Exception stack(0xe8167fa8 to 0xe8167ff0) [ 32.128960] 7fa0: b6fa68c8 000000f8 00000003 bea142d0 00000000 00000000 [ 32.137093] 7fc0: b6fa68c8 000000f8 0052154c 00000128 5d6468a2 00000000 00000028 00558c9c [ 32.145224] 7fe0: 00000070 bea14278 00530d64 b6e17e64 [ 32.150659] ---[ end trace 2139c9827c3e5177 ]--- This happens because the qdisc ->dequeue callback gets called. Which again is not illegal, the qdisc will dequeue even when the interface is up but doesn't have carrier (and hence SPEED_UNKNOWN), and the frames will be dropped further down the stack in dev_direct_xmit(). And, at the end of the day, for what? For calculating the initial budget of an interface which is non-operational at the moment and where frames will get dropped anyway. So if we can't figure out the link speed, default to SPEED_10 and move along. We can also remove the runtime check now. Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31taprio: Fix kernel panic in taprio_destroyVladimir Oltean1-4/+4
taprio_init may fail earlier than this line: list_add(&q->taprio_list, &taprio_list); i.e. due to the net device not being multi queue. Attempting to remove q from the global taprio_list when it is not part of it will result in a kernel panic. Fix it by matching list_add and list_del better to one another in the order of operations. This way we can keep the deletion unconditional and with lower complexity - O(1). Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30net: sched: cls_matchall: cleanup flow_action before deallocatingVlad Buslov1-0/+2
Recent rtnl lock removal patch changed flow_action infra to require proper cleanup besides simple memory deallocation. However, matchall classifier was not updated to call tc_cleanup_flow_action(). Add proper cleanup to mall_replace_hw_filter() and mall_reoffload(). Fixes: 5a6ff4b13d59 ("net: sched: take reference to action dev before calling offloads") Reported-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30sched: act_vlan: implement stats_update callbackJiri Pirko1-0/+14
Implement this callback in order to get the offloaded stats added to the kernel stats. Reported-by: Pengfei Liu <pengfeil@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-28net/sched: pfifo_fast: fix wrong dereference in pfifo_fast_enqueueDavide Caratti1-2/+6
Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of 'TCQ_F_NOLOCK' bit in the parent qdisc, we can't assume anymore that per-cpu counters are there in the error path of skb_array_produce(). Otherwise, the following splat can be seen: Unable to handle kernel paging request at virtual address 0000600dea430008 Mem abort info: ESR = 0x96000005 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000005 CM = 0, WnR = 0 user pgtable: 64k pages, 48-bit VAs, pgdp = 000000007b97530e [0000600dea430008] pgd=0000000000000000, pud=0000000000000000 Internal error: Oops: 96000005 [#1] SMP [...] pstate: 10000005 (nzcV daif -PAN -UAO) pc : pfifo_fast_enqueue+0x524/0x6e8 lr : pfifo_fast_enqueue+0x46c/0x6e8 sp : ffff800d39376fe0 x29: ffff800d39376fe0 x28: 1ffff001a07d1e40 x27: ffff800d03e8f188 x26: ffff800d03e8f200 x25: 0000000000000062 x24: ffff800d393772f0 x23: 0000000000000000 x22: 0000000000000403 x21: ffff800cca569a00 x20: ffff800d03e8ee00 x19: ffff800cca569a10 x18: 00000000000000bf x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: ffff1001a726edd0 x13: 1fffe4000276a9a4 x12: 0000000000000000 x11: dfff200000000000 x10: ffff800d03e8f1a0 x9 : 0000000000000003 x8 : 0000000000000000 x7 : 00000000f1f1f1f1 x6 : ffff1001a726edea x5 : ffff800cca56a53c x4 : 1ffff001bf9a8003 x3 : 1ffff001bf9a8003 x2 : 1ffff001a07d1dcb x1 : 0000600dea430000 x0 : 0000600dea430008 Process ping (pid: 6067, stack limit = 0x00000000dc0aa557) Call trace: pfifo_fast_enqueue+0x524/0x6e8 htb_enqueue+0x660/0x10e0 [sch_htb] __dev_queue_xmit+0x123c/0x2de0 dev_queue_xmit+0x24/0x30 ip_finish_output2+0xc48/0x1720 ip_finish_output+0x548/0x9d8 ip_output+0x334/0x788 ip_local_out+0x90/0x138 ip_send_skb+0x44/0x1d0 ip_push_pending_frames+0x5c/0x78 raw_sendmsg+0xed8/0x28d0 inet_sendmsg+0xc4/0x5c0 sock_sendmsg+0xac/0x108 __sys_sendto+0x1ac/0x2a0 __arm64_sys_sendto+0xc4/0x138 el0_svc_handler+0x13c/0x298 el0_svc+0x8/0xc Code: f9402e80 d538d081 91002000 8b010000 (885f7c03) Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags', before dereferencing 'qdisc->cpu_qstats'. Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too") CC: Paolo Abeni <pabeni@redhat.com> CC: Stefano Brivio <sbrivio@redhat.com> Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-28net: sched: act_sample: fix psample group handling on overwriteVlad Buslov1-1/+5
Action sample doesn't properly handle psample_group pointer in overwrite case. Following issues need to be fixed: - In tcf_sample_init() function RCU_INIT_POINTER() is used to set s->psample_group, even though we neither setting the pointer to NULL, nor preventing concurrent readers from accessing the pointer in some way. Use rcu_swap_protected() instead to safely reset the pointer. - Old value of s->psample_group is not released or deallocated in any way, which results resource leak. Use psample_group_put() on non-NULL value obtained with rcu_swap_protected(). - The function psample_group_put() that released reference to struct psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu grace period when deallocating it. Extend struct psample_group with rcu head and use kfree_rcu when freeing it. Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-28net/sched: pfifo_fast: fix wrong dereference when qdisc is resetDavide Caratti1-4/+7
Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of 'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu counters are present when 'reset()' is called for pfifo_fast qdiscs. Otherwise, the following script: # tc q a dev lo handle 1: root htb default 100 # tc c a dev lo parent 1: classid 1:100 htb \ > rate 95Mbit ceil 100Mbit burst 64k [...] # tc f a dev lo parent 1: protocol arp basic classid 1:100 [...] # tc q a dev lo parent 1:100 handle 100: pfifo_fast [...] # tc q d dev lo root can generate the following splat: Unable to handle kernel paging request at virtual address dfff2c01bd148000 Mem abort info: ESR = 0x96000004 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [dfff2c01bd148000] address between user and kernel address ranges Internal error: Oops: 96000004 [#1] SMP [...] pstate: 80000005 (Nzcv daif -PAN -UAO) pc : pfifo_fast_reset+0x280/0x4d8 lr : pfifo_fast_reset+0x21c/0x4d8 sp : ffff800d09676fa0 x29: ffff800d09676fa0 x28: ffff200012ee22e4 x27: dfff200000000000 x26: 0000000000000000 x25: ffff800ca0799958 x24: ffff1001940f332b x23: 0000000000000007 x22: ffff200012ee1ab8 x21: 0000600de8a40000 x20: 0000000000000000 x19: ffff800ca0799900 x18: 0000000000000000 x17: 0000000000000002 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: ffff1001b922e6e2 x11: 1ffff001b922e6e1 x10: 0000000000000000 x9 : 1ffff001b922e6e1 x8 : dfff200000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 1fffe400025dc45c x4 : 1fffe400025dc357 x3 : 00000c01bd148000 x2 : 0000600de8a40000 x1 : 0000000000000007 x0 : 0000600de8a40004 Call trace: pfifo_fast_reset+0x280/0x4d8 qdisc_reset+0x6c/0x370 htb_reset+0x150/0x3b8 [sch_htb] qdisc_reset+0x6c/0x370 dev_deactivate_queue.constprop.5+0xe0/0x1a8 dev_deactivate_many+0xd8/0x908 dev_deactivate+0xe4/0x190 qdisc_graft+0x88c/0xbd0 tc_get_qdisc+0x418/0x8a8 rtnetlink_rcv_msg+0x3a8/0xa78 netlink_rcv_skb+0x18c/0x328 rtnetlink_rcv+0x28/0x38 netlink_unicast+0x3c4/0x538 netlink_sendmsg+0x538/0x9a0 sock_sendmsg+0xac/0xf8 ___sys_sendmsg+0x53c/0x658 __sys_sendmsg+0xc8/0x140 __arm64_sys_sendmsg+0x74/0xa8 el0_svc_handler+0x164/0x468 el0_svc+0x10/0x14 Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863) Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags', before dereferencing 'qdisc->cpu_qstats'. Changes since v1: - coding style improvements, thanks to Stefano Brivio Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too") CC: Paolo Abeni <pabeni@redhat.com> Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27net_sched: fix a NULL pointer deref in ipt actionCong Wang19-23/+24
The net pointer in struct xt_tgdtor_param is not explicitly initialized therefore is still NULL when dereferencing it. So we have to find a way to pass the correct net pointer to ipt_destroy_target(). The best way I find is just saving the net pointer inside the per netns struct tcf_idrinfo, which could make this patch smaller. Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset") Reported-and-tested-by: itugrok@yahoo.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: flower: don't take rtnl lock for cls hw offloads APIVlad Buslov1-37/+16
Don't manually take rtnl lock in flower classifier before calling cls hardware offloads API. Instead, pass rtnl lock status via 'rtnl_held' parameter. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: copy tunnel info when setting flow_action entry->tunnelVlad Buslov1-1/+8
In order to remove dependency on rtnl lock, modify tc_setup_flow_action() to copy tunnel info, instead of just saving pointer to tunnel_key action tunnel info. This is necessary to prevent concurrent action overwrite from releasing tunnel info while it is being used by rtnl-unlocked driver. Implement helper tcf_tunnel_info_copy() that is used to copy tunnel info with all its options to dynamically allocated memory block. Modify tc_cleanup_flow_action() to free dynamically allocated tunnel info. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: take reference to action dev before calling offloadsVlad Buslov2-0/+34
In order to remove dependency on rtnl lock when calling hardware offload API, take reference to action mirred dev when initializing flow_action structure in tc_setup_flow_action(). Implement function tc_cleanup_flow_action(), use it to release the device after hardware offload API is done using it. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: take rtnl lock in tc_setup_flow_action()Vlad Buslov3-8/+19
In order to allow using new flow_action infrastructure from unlocked classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held' argument. Take rtnl lock before accessing tc_action data. This is necessary to protect from concurrent action replace. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: conditionally obtain rtnl lock in cls hw offloads APIVlad Buslov1-0/+65
In order to remove dependency on rtnl lock from offloads code of classifiers, take rtnl lock conditionally before executing driver callbacks. Only obtain rtnl lock if block is bound to devices that require it. Block bind/unbind code is rtnl-locked and obtains block->cb_lock while holding rtnl lock. Obtain locks in same order in tc_setup_cb_*() functions to prevent deadlock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: add API for registering unlocked offload block callbacksVlad Buslov1-0/+6
Extend struct flow_block_offload with "unlocked_driver_cb" flag to allow registering and unregistering block hardware offload callbacks that do not require caller to hold rtnl lock. Extend tcf_block with additional lockeddevcnt counter that is incremented for each non-unlocked driver callback attached to device. This counter is necessary to conditionally obtain rtnl lock before calling hardware callbacks in following patches. Register mlx5 tc block offload callbacks as "unlocked". Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: notify classifier on successful offload add/deleteVlad Buslov2-9/+43
To remove dependency on rtnl lock, extend classifier ops with new ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while holding cb_lock every time filter if successfully added to or deleted from hardware. Implement the new API in flower classifier. Use it to manage hw_filters list under cb_lock protection, instead of relying on rtnl lock to synchronize with concurrent fl_reoffload() call. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: refactor block offloads counter usageVlad Buslov5-91/+217
Without rtnl lock protection filters can no longer safely manage block offloads counter themselves. Refactor cls API to protect block offloadcnt with tcf_block->cb_lock that is already used to protect driver callback list and nooffloaddevcnt counter. The counter can be modified by concurrent tasks by new functions that execute block callbacks (which is safe with previous patch that changed its type to atomic_t), however, block bind/unbind code that checks the counter value takes cb_lock in write mode to exclude any concurrent modifications. This approach prevents race conditions between bind/unbind and callback execution code but allows for concurrency for tc rule update path. Move block offload counter, filter in hardware counter and filter flags management from classifiers into cls hardware offloads API. Make functions tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API private. Implement following new cls API to be used instead: tc_setup_cb_add() - non-destructive filter add. If filter that wasn't already in hardware is successfully offloaded, increment block offloads counter, set filter in hardware counter and flag. On failure, previously offloaded filter is considered to be intact and offloads counter is not decremented. tc_setup_cb_replace() - destructive filter replace. Release existing filter block offload counter and reset its in hardware counter and flag. Set new filter in hardware counter and flag. On failure, previously offloaded filter is considered to be destroyed and offload counter is decremented. tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block offloads counter. tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and call tc_cls_offload_cnt_update() if cb() didn't return an error. Refactor all offload-capable classifiers to atomically offload filters to hardware, change block offload counter, and set filter in hardware counter and flag by means of the new cls API functions. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: change tcf block offload counter type to atomic_tVlad Buslov1-1/+1
As a preparation for running proto ops functions without rtnl lock, change offload counter type to atomic. This is necessary to allow updating the counter by multiple concurrent users when offloading filters to hardware from unlocked classifiers. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26net: sched: protect block offload-related fields with rw_semaphoreVlad Buslov1-9/+36
In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock' rwsem and use it to protect flow_block->cb_list and related counters from concurrent modification. The lock is taken in read mode for read-only traversal of cb_list in tc_setup_cb_call() and write mode in all other cases. This approach ensures that: - cb_list is not changed concurrently while filters is being offloaded on block. - block->nooffloaddevcnt is checked while holding the lock in read mode, but is only changed by bind/unbind code when holding the cb_lock in write mode to prevent concurrent modification. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-1/+14
Merge conflict of mlx5 resolved using instructions in merge commit 9566e650bf7fdf58384bb06df634f7531ca3a97e. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09taprio: remove unused variable 'entry_list_policy'YueHaibing1-4/+0
net/sched/sch_taprio.c:680:32: warning: entry_list_policy defined but not used [-Wunused-const-variable=] One of the points of commit a3d43c0d56f1 ("taprio: Add support adding an admin schedule") is that it removes support (it now returns "not supported") for schedules using the TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY attribute (which were never used), the parsing of those types of schedules was the only user of this policy. So removing this policy should be fine. Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08net sched: update skbedit action for batched events operationsRoman Mashak1-0/+12
Add get_fill_size() routine used to calculate the action size when building a batch of events. Fixes: ca9b0e27e ("pkt_action: add new action skbedit") Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08fq_codel: remove set but not used variables 'prev_ecn_mark' and 'prev_drop_count'YueHaibing1-4/+0
Fixes gcc '-Wunused-but-set-variable' warning: net/sched/sch_fq_codel.c: In function fq_codel_dequeue: net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable] net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable] They are not used since commit 77ddaff218fc ("fq_codel: Kill useless per-flow dropped statistic") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08net: sched: sch_taprio: fix memleak in error path for sched list parseIvan Khoronzhuk1-1/+2
In error case, all entries should be freed from the sched list before deleting it. For simplicity use rcu way. Fixes: 5a781ccbd19e46 ("tc: Add support for configuring the taprio scheduler") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08flow_offload: support get multi-subsystem blockwenxu1-1/+8
It provide a callback list to find the blocks of tc and nft subsystems Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08flow_offload: move tc indirect block to flow offloadwenxu1-215/+25
move tc indirect block to flow_offload and rename it to flow indirect block.The nf_tables can use the indr block architecture. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08cls_api: add flow_indr_block_call functionwenxu1-10/+17
This patch make indr_block_call don't access struct tc_indr_block_cb and tc_indr_block_dev directly Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08cls_api: remove the tcf_block cachewenxu1-8/+8
Remove the tcf_block in the tc_indr_block_dev for muti-subsystem support. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08cls_api: modify the tc_indr_block_ing_cmd parameters.wenxu1-11/+15
This patch make tc_indr_block_ing_cmd can't access struct tc_indr_block_dev and tc_indr_block_cb. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller19-78/+117
Just minor overlapping changes in the conflicts here. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: add ingress mirred action to hardware IRJohn Hurley1-0/+6
TC mirred actions (redirect and mirred) can send to egress or ingress of a device. Currently only egress is used for hw offload rules. Modify the intermediate representation for hw offload to include mirred actions that go to ingress. This gives drivers access to such rules and can decide whether or not to offload them. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: add skbedit of ptype action to hardware IRJohn Hurley1-0/+3
TC rules can impliment skbedit actions. Currently actions that modify the skb mark are passed to offloading drivers via the hardware intermediate representation in the flow_offload API. Extend this to include skbedit actions that modify the packet type of the skb. Such actions may be used to set the ptype to HOST when redirecting a packet to ingress. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06fq_codel: Kill useless per-flow dropped statisticDave Taht1-7/+1
It is almost impossible to get anything other than a 0 out of flow->dropped statistic with a tc class dump, as it resets to 0 on every round. It also conflates ecn marks with drops. It would have been useful had it kept a cumulative drop count, but it doesn't. This patch doesn't change the API, it just stops tracking a stat and state that is impossible to measure and nobody uses. Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06Increase fq_codel count in the bulk dropperDave Taht1-0/+2
In the field fq_codel is often used with a smaller memory or packet limit than the default, and when the bulk dropper is hit, the drop pattern bifircates into one that more slowly increases the codel drop rate and hits the bulk dropper more than it should. The scan through the 1024 queues happens more often than it needs to. This patch increases the codel count in the bulk dropper, but does not change the drop rate there, relying on the next codel round to deliver the next packet at the original drop rate (after that burst of loss), then escalate to a higher signaling rate. Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net sched: update vlan action for batched events operationsRoman Mashak1-0/+9
Add get_fill_size() routine used to calculate the action size when building a batch of events. Fixes: c7e2b9689 ("sched: introduce vlan action") Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-05net: sched: use temporary variable for actions indexesDmytro Linkin18-75/+100
Currently init call of all actions (except ipt) init their 'parm' structure as a direct pointer to nla data in skb. This leads to race condition when some of the filter actions were initialized successfully (and were assigned with idr action index that was written directly into nla data), but then were deleted and retried (due to following action module missing or classifier-initiated retry), in which case action init code tries to insert action to idr with index that was assigned on previous iteration. During retry the index can be reused by another action that was inserted concurrently, which causes unintended action sharing between filters. To fix described race condition, save action idr index to temporary stack-allocated variable instead on nla data. Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-29net: sched: Fix a possible null-pointer dereference in dequeue_func()Jia-Ju Bai1-3/+3
In dequeue_func(), there is an if statement on line 74 to check whether skb is NULL: if (skb) When skb is NULL, it is used on line 77: prefetch(&skb->end); Thus, a possible null-pointer dereference may occur. To fix this bug, skb->end is used when skb is not NULL. This bug is found by a static analysis tool STCheck written by us. Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM") Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25ife: error out when nla attributes are emptyCong Wang1-0/+5
act_ife at least requires TCA_IFE_PARMS, so we have to bail out when there is no attribute passed in. Reported-by: syzbot+fbb5b288c9cb6a2eeac4@syzkaller.appspotmail.com Fixes: ef6980b6becb ("introduce IFE action") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-23net: sched: include mpls actions in hardware intermediate representationJohn Hurley1-0/+25
A recent addition to TC actions is the ability to manipulate the MPLS headers on packets. In preparation to offload such actions to hardware, update the IR code to accept and prepare the new actions. Note that no driver currently impliments the MPLS dec_ttl action so this is not included. Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>