What did I do?
A plaintext first consists of an array of characters (unsigned char[16]), then I pass it as an __m128i type to the SHA-1 function. With SSE2 I do 4 plaintexts at the same time, so actually 4 of these variables are passed to this function.
The variables get split into 4 32-bit integers (I called these types UINT4). The SHA-1 function then puts these variables in an array of 80 UINT4's, this array is called W. This array is used to add a value in every one of the 80 steps of the actual hashing part. The first 16 UINT4's in this array W are just the parts of the plaintext, and as I only support plaintexts with length < 16, only the first 4 UINT4's are filled. Then W[4]...W[14] contain zero's in my case, W[15] contains the length of the plaintext in bits.
So what happens with W[16]...W[79]? The previous values (the actual plaintext) are 'expanded' into these values, so every step has additional input and is dependent on the plaintext. This expanding isn't that special:
W[t] = W[t-3] XOR W[t-8] XOR W[t-14] XOR W[t-16]
And then ROTATE this result by one.
As W[4]...W[14] contains nothing but zero's, I didn't actually want to set them. But because W[18]...W[30] depend on these zero's, they should be set. Unless we change the EXPAND function for W[18]...W[30]. But as that needs more unrolling of the loops, my code gets bigger and maybe slower (had that before).
So now I unrolled certain parts in a strange way, but somehow it works :)
I now have some strange code like:
for(t = 20; t < 21; t++){
SSE_EXPAND_3(t); ROTATE(t)
SSE_EXPAND_3(t+1); ROTATE(t+1)
SSE_EXPAND_3(t+2); ROTATE(t+2)
SSE_EXPAND_3_8(t+3); ROTATE(t+3)
SSE_EXPAND_3_8(t+4); ROTATE(t+4)
}
This code block within the for loop gets executed only once. So you'd say that I could just write out SSE_EXPAND3(20) - (24) and such, but somehow that makes things slower...
Anyway, instead of 58.5 Mhashes/s I now get:
Length 5 - 55% in 8.39 s (60.29 Mhashes/s)
Here are the downloads:
shabr_0.2_win_32bit_binary.zip
shabr_0.2_src.zip