In every odd-numbered round, instead of operating over the state
x00 x01 x02 x03
x05 x06 x07 x04
x10 x11 x08 x09
x15 x12 x13 x14
we operate over the rotated state
x03 x00 x01 x02
x04 x05 x06 x07
x09 x10 x11 x08
x14 x15 x12 x13
The advantage here is that this requires no changes to the
'x04 x05 x06 x07' row, which is in the critical path. This
results in a noticeable latency improvement of roughly R
cycles, for R diagonal rounds in the primitive.
In the case of BLAKE2s, which I also moved from requiring AVX
to only requiring SSSE3, we save approximately 30 cycles per
compression function call on Haswell and Skylake. In other
words, this is an improvement of ~0.6 cpb.
This idea was pointed out to me by Shunsuke Shimizu, though
it appears to have been around for longer.
Signed-off-by: Samuel Neves <firstname.lastname@example.org>