From 3d9a0d2f8212879407e58d67f460d8920eb6543d Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 25 Sep 2014 23:04:56 -0700 Subject: dql: dql_queued() should write first to reduce bus transactions MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit While doing high throughput test on a BQL enabled NIC, I found a very high cost in ndo_start_xmit() when accessing BQL data. It turned out the problem was caused by compiler trying to be smart, but involving a bad MESI transaction : 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count 58.23 │ add %edx,%edi 0.58 │ cmp %edi,0xc4(%rax) 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count 0.72 │ js bd8 I got an incredible 10 % gain [1] by making sure cpu do not attempt to get the cache line in Shared mode, but directly requests for ownership. New code : mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count add %edx,0xc0(%rax) // RMW dql->num_queued += count mov 0xc4(%rax),%ecx // LOAD dql->adj_limit mov 0xc0(%rax),%edx // LOAD dql->num_queued cmp %edx,%ecx The TX completion was running from another cpu, with high interrupts rate. Note that I am using barrier() as a soft hint, as mb() here could be too heavy cost. [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled. Signed-off-by: Eric Dumazet Acked-by: Jesper Dangaard Brouer Signed-off-by: David S. Miller --- include/linux/dynamic_queue_limits.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'include/linux/dynamic_queue_limits.h') diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h index 5621547d631b..a4be70398ce1 100644 --- a/include/linux/dynamic_queue_limits.h +++ b/include/linux/dynamic_queue_limits.h @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql, unsigned int count) { BUG_ON(count > DQL_MAX_OBJECT); - dql->num_queued += count; dql->last_obj_cnt = count; + + /* We want to force a write first, so that cpu do not attempt + * to get cache line containing last_obj_cnt, num_queued, adj_limit + * in Shared state, but directly does a Request For Ownership + * It is only a hint, we use barrier() only. + */ + barrier(); + + dql->num_queued += count; } /* Returns how many objects can be queued, < 0 indicates over limit. */ static inline int dql_avail(const struct dql *dql) { - return dql->adj_limit - dql->num_queued; + return ACCESS_ONCE(dql->adj_limit) - ACCESS_ONCE(dql->num_queued); } /* Record number of completed objects and recalculate the limit. */ -- cgit v1.2.3-59-g8ed1b