From 5f1a485d5905aa641f33009019b3699076666a4c Mon Sep 17 00:00:00 2001
From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Date: Tue, 13 Nov 2007 20:40:55 -0800
Subject: [PKT_SCHED]: Check subqueue status before calling hard_start_xmit

The only qdiscs that check subqueue state before dequeue'ing are PRIO
and RR.  The other qdiscs, including the default pfifo_fast qdisc,
will allow traffic bound for subqueue 0 through to hard_start_xmit.
The check for netif_queue_stopped() is done above in pkt_sched.h, so
it is unnecessary for qdisc_restart().  However, if the underlying
driver is multiqueue capable, and only sets queue states on subqueues,
this will allow packets to enter the driver when it's currently unable
to process packets, resulting in expensive requeues and driver
entries.  This patch re-adds the check for the subqueue status before
calling hard_start_xmit, so we can try and avoid the driver entry when
the queues are stopped.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/sch_generic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

(limited to 'net')

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index fa1a6f45dc41..e595e6570ce0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,7 +134,7 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	int ret;
+	int ret = NETDEV_TX_BUSY;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
@@ -145,7 +145,8 @@ static inline int qdisc_restart(struct net_device *dev)
 	spin_unlock(&dev->queue_lock);
 
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev);
+	if (!netif_subqueue_stopped(dev, skb))
+		ret = dev_hard_start_xmit(skb, dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
-- 
cgit v1.2.3-59-g8ed1b


From 746aa32d280084dbd520249170852e4616799928 Mon Sep 17 00:00:00 2001
From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Tue, 13 Nov 2007 21:01:23 -0800
Subject: [TCP] FRTO: Limit snd_cwnd if TCP was application limited
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Otherwise TCP might violate packet ordering principles that FRTO
is based on. If conventional recovery path is chosen, this won't
be significant at all. In practice, any small enough value will
be sufficient to provide proper operation for FRTO, yet other
users of snd_cwnd might benefit from a "close enough" value.

FRTO's formula is now equal to what tcp_enter_cwr() uses.

FRTO used to check application limitedness a bit differently but
I changed that in commit 575ee7140dabe9b9c4f66f4f867039b97e548867
and as a result checking for application limitedness became
completely non-existing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'net')

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 20c9440ab85e..b59da5308ac0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1669,6 +1669,9 @@ void tcp_enter_frto(struct sock *sk)
 	}
 	tcp_verify_left_out(tp);
 
+	/* Too bad if TCP was application limited */
+	tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1);
+
 	/* Earlier loss recovery underway (see RFC4138; Appendix B).
 	 * The last condition is necessary at least in tp->frto_counter case.
 	 */
-- 
cgit v1.2.3-59-g8ed1b


From 23aeeec365dcf8bc87fae44c533e50d0bb4f23cc Mon Sep 17 00:00:00 2001
From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Tue, 13 Nov 2007 21:03:13 -0800
Subject: [TCP] FRTO: Plug potential LOST-bit leak
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

It might be possible that, in some extreme scenario that
I just cannot now construct in my mind, end_seq <=
frto_highmark check does not match causing the lost_out
and LOST bits become out-of-sync due to clearing and
recounting in the loop.

This may fix LOST-bit leak reported by Chazarain Guillaume
<guichaz@yahoo.fr>.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_input.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

(limited to 'net')

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b59da5308ac0..12ae9a68cdaa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1704,6 +1704,8 @@ static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)
 	tcp_for_write_queue(skb, sk) {
 		if (skb == tcp_send_head(sk))
 			break;
+
+		TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
 		/*
 		 * Count the retransmission made on RTO correctly (only when
 		 * waiting for the first ACK and did not get it)...
@@ -1717,7 +1719,7 @@ static void tcp_enter_frto_loss(struct sock *sk, int allowed_segments, int flag)
 		} else {
 			if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
 				tp->undo_marker = 0;
-			TCP_SKB_CB(skb)->sacked &= ~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
+			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
 		}
 
 		/* Don't lost mark skbs that were fwd transmitted after RTO */
-- 
cgit v1.2.3-59-g8ed1b


From 96a2d41a3e495734b63bff4e5dd0112741b93b38 Mon Sep 17 00:00:00 2001
From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Wed, 14 Nov 2007 15:47:18 -0800
Subject: [TCP]: Make sure write_queue_from does not begin with NULL ptr
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

NULL ptr can be returned from tcp_write_queue_head to cached_skb
and then assigned to skb if packets_out was zero. Without this,
system is vulnerable to a carefully crafted ACKs which obviously
is remotely triggerable.

Besides, there's very little that needs to be done in sacktag
if there weren't any packets outstanding, just skipping the rest
doesn't hurt.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_input.c | 5 +++++
 1 file changed, 5 insertions(+)

(limited to 'net')

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12ae9a68cdaa..3f126ece8eb5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1269,6 +1269,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	if (before(TCP_SKB_CB(ack_skb)->ack_seq, prior_snd_una - tp->max_window))
 		return 0;
 
+	if (!tp->packets_out)
+		goto out;
+
 	/* SACK fastpath:
 	 * if the only SACK change is the increase of the end_seq of
 	 * the first block then only apply that SACK block
@@ -1515,6 +1518,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	    (!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark)))
 		tcp_update_reordering(sk, tp->fackets_out - reord, 0);
 
+out:
+
 #if FASTRETRANS_DEBUG > 0
 	BUG_TRAP((int)tp->sacked_out >= 0);
 	BUG_TRAP((int)tp->lost_out >= 0);
-- 
cgit v1.2.3-59-g8ed1b


From c67625a1ecd7caf4c0490fc5278d6bb736a5297f Mon Sep 17 00:00:00 2001
From: Pavel Emelyanov <xemul@openvz.org>
Date: Wed, 14 Nov 2007 15:53:16 -0800
Subject: [NET]: Remove notifier block from chain when
 register_netdevice_notifier fails

Commit fcc5a03ac42564e9e255c1134dda47442289e466:

	[NET]: Allow netdev REGISTER/CHANGENAME events to fail

makes the register_netdevice_notifier() handle the error from the
NETDEV_REGISTER event, sent to the registering block.

The bad news is that in this case the notifier block is
not removed from the list, but the error is returned to the
caller. In case the caller is in module init function and
handles this error this can abort the module loading. The
notifier block will be then removed from the kernel, but
will be left in the list. Oops :(

I think that the notifier block should be removed from the
chain in case of error, regardless whether this error is
handled by the caller or not. In the worst case (the error
is _not_ handled) module will not receive the events any
longer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'net')

diff --git a/net/core/dev.c b/net/core/dev.c
index dd40b35bb006..86d62611f2fc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1171,6 +1171,8 @@ rollback:
 			nb->notifier_call(nb, NETDEV_UNREGISTER, dev);
 		}
 	}
+
+	raw_notifier_chain_unregister(&netdev_chain, nb);
 	goto unlock;
 }
 
-- 
cgit v1.2.3-59-g8ed1b


From e1cd8f78f8cbfa314a095dbf704707217c8ee197 Mon Sep 17 00:00:00 2001
From: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date: Wed, 14 Nov 2007 15:55:09 -0800
Subject: [TCP] FRTO: Clear frto_highmark only after process_frto that uses it
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

I broke this in commit 3de96471bd7fb76406e975ef6387abe3a0698149:

	[TCP]: Wrap-safed reordering detection FRTO check

tcp_process_frto should always see a valid frto_highmark. An invalid
frto_highmark (zero) is very likely what ultimately caused a seqno
compare in tcp_frto_enter_loss to do the wrong leading to the LOST-bit
leak.

Having LOST-bits integry ensured like done after commit
23aeeec365dcf8bc87fae44c533e50d0bb4f23cc:

	[TCP] FRTO: Plug potential LOST-bit leak

won't hurt. It may still be useful in some other, possibly legimate,
scenario.

Reported by Chazarain Guillaume <guichaz@yahoo.fr>.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'net')

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f126ece8eb5..0f0c1c9829a1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3113,11 +3113,11 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, &seq_rtt, prior_fackets);
 
+	if (tp->frto_counter)
+		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
 	if (before(tp->frto_highmark, tp->snd_una))
 		tp->frto_highmark = 0;
-	if (tp->frto_counter)
-		frto_cwnd = tcp_process_frto(sk, flag);
 
 	if (tcp_ack_is_dubious(sk, flag)) {
 		/* Advance CWND, if state allows this. */
-- 
cgit v1.2.3-59-g8ed1b


From d90bf5a976793edfa88d3bb2393f0231eb8ce1e5 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 14 Nov 2007 16:14:05 -0800
Subject: [NET]: rt_check_expire() can take a long time, add a cond_resched()

On commit 39c90ece7565f5c47110c2fa77409d7a9478bd5b:

	[IPV4]: Convert rt_check_expire() from softirq processing to workqueue.

we converted rt_check_expire() from softirq to workqueue, allowing the
function to perform all work it was supposed to do.

When the IP route cache is big, rt_check_expire() can take a long time
to run.  (default settings : 20% of the hash table is scanned at each
invocation)

Adding cond_resched() helps giving cpu to higher priority tasks if
necessary.

Using a "if (need_resched())" test before calling "cond_resched();" is
necessary to avoid spending too much time doing the resched check.
(My tests gave a time reduction from 88 ms to 25 ms per
rt_check_expire() run on my i686 test machine)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'net')

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 45651834e1e2..1bff9ed349ff 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -578,6 +578,9 @@ static void rt_check_expire(struct work_struct *work)
 		i = (i + 1) & rt_hash_mask;
 		rthp = &rt_hash_table[i].chain;
 
+		if (need_resched())
+			cond_resched();
+
 		if (*rthp == NULL)
 			continue;
 		spin_lock_bh(rt_hash_lock_addr(i));
-- 
cgit v1.2.3-59-g8ed1b