Saturday, January 17, 2009

SHAbr update, I passed 60 Mhashes/s

I couldn't help optimizing SHAbr just a little more, I really wanted to get the 60 Mhashes/s on my system :)

What did I do?
A plaintext first consists of an array of characters (unsigned char[16]), then I pass it as an __m128i type to the SHA-1 function. With SSE2 I do 4 plaintexts at the same time, so actually 4 of these variables are passed to this function.

The variables get split into 4 32-bit integers (I called these types UINT4). The SHA-1 function then puts these variables in an array of 80 UINT4's, this array is called W. This array is used to add a value in every one of the 80 steps of the actual hashing part. The first 16 UINT4's in this array W are just the parts of the plaintext, and as I only support plaintexts with length < 16, only the first 4 UINT4's are filled. Then W[4]...W[14] contain zero's in my case, W[15] contains the length of the plaintext in bits.

So what happens with W[16]...W[79]? The previous values (the actual plaintext) are 'expanded' into these values, so every step has additional input and is dependent on the plaintext. This expanding isn't that special:

W[t] = W[t-3] XOR W[t-8] XOR W[t-14] XOR W[t-16]
And then ROTATE this result by one.

As W[4]...W[14] contains nothing but zero's, I didn't actually want to set them. But because W[18]...W[30] depend on these zero's, they should be set. Unless we change the EXPAND function for W[18]...W[30]. But as that needs more unrolling of the loops, my code gets bigger and maybe slower (had that before).

So now I unrolled certain parts in a strange way, but somehow it works :)
I now have some strange code like:

for(t = 20; t < 21; t++){
SSE_EXPAND_3(t); ROTATE(t)
SSE_EXPAND_3(t+1); ROTATE(t+1)
SSE_EXPAND_3(t+2); ROTATE(t+2)
SSE_EXPAND_3_8(t+3); ROTATE(t+3)
SSE_EXPAND_3_8(t+4); ROTATE(t+4)
}

This code block within the for loop gets executed only once. So you'd say that I could just write out SSE_EXPAND3(20) - (24) and such, but somehow that makes things slower...

Anyway, instead of 58.5 Mhashes/s I now get:
Length 5 - 55% in 8.39 s (60.29 Mhashes/s)

Here are the downloads:
shabr_0.2_win_32bit_binary.zip
shabr_0.2_src.zip

Friday, January 16, 2009

SHAbr, the SHA-1 password brute forcer

Today I finished my SHA-1 password brute forcer, SHAbr. It is optimized for passwords with a length < 16 characters. It supports multiple cores/processors and uses SSE2 to hash 4 plaintexts simultaneously, per thread.

Yesterday I got a little above 40 Mhashes/s with 4 cores (Q9450@3.2Ghz), tonight just before going to sleep I thought of some new attempts for optimizations. That got me close to 50 Mhashes/s. This morning when I woke up another idea popped up, and just before I thought I was done, another one... now I am close to 60 Mhashes/s :)

The main issue is that I can't fully unroll all the loops, this actually slows things down. This might have something to do with cache misses. The last changes mainly comprised of reordering some instructions and partially unrolling the loops.

This was my first attempt at using SSE2, and my first attempt at actually implementing a cryptographic hashing algorithm on my own. I used several sources of information and used some ideas from other SHA-1 implementations, such as the one in OpenSSL and RFC3174.

It is very well possible that my code can be optimized even more, or even a lot more, but as far as I know, SHAbr is currently the fastest SHA-1 password brute forcer around. You might also see some strange coding habits, I'm not a professional coder... feel free to drop me some remarks, I'd be happy to learn from it :)

Currently it only takes one hash for cracking and I haven't tested it with Linux. I expect it not to work straight away under Linux, but it shouldn't be that much work. I have it on my ToDo list, but I won't fix it in the upcoming month. I'd also like to implement support for multiple hashes later on.

Download:
shabr_win_32bit_binary.zip
shabr_src.zip

You might also need to install the Microsoft Visual C++ 2008 Redistributable Package.

Feel free to leave a comment, and I'd love to hear some of the speeds you gain.

tip: playing around with -B and -Q might render different speeds... also I noticed that somehow speed sometimes is like 2 Mhashes/s slower or faster after restarting the program.

Tuesday, January 13, 2009

Welcome

Hi, and welcome at yet another blog :)

I needed some space to share some thoughts, some things I learn and some tools. Last months I have been playing around with rainbow tables, hashing algorithms, brute forcers and other related topics. This has resulted in a couple of topics that I plan to blog about:

* A new set of rainbow tables that are currently generated by the Free Rainbow Tables project (link at the right). This set covers passwords from a special character set, specifically for LM hashes.

* rcracki_mt, a multi-threaded version of the tool rcracki, used for 'cracking' password hashes with the rainbow tables generated by the Free Rainbow Tables project.

* A tool for doing not only case correction with a cracked LM hash and the accompanying NTLM hash, but also do something I call unicode correction. I still need to release this tool, but I already implemented the code in rcracki_mt.

* A first attempt at optimizing the MD4 hashing algorithm that is used for NTLM hashes, using code provided by other people. My version is not that useful, but I learned some nice things.

* A SHA-1 implementation optimized for cracking passwords with a length < 16. First meant for use in rcracki_mt, but then I started liking optimizations. So by now I have sort of a SHA-1 brute forcer, using SSE2 and with the last 3 steps reversed, the first 4 steps unrolled and precalculated as much as possible. I got some tips, hints and directions from Svarychevski Michail Aleksandrovich regarding the use of SSE2, tnx for that. Btw, he has an awfully fast MD5 brute forcer, for both CPU (SSE2 optimized) and for GPU (both CUDA (nVidia) and Brook (AMD/ATI).

I plan to blog about these topics in the coming [days... eh, no, weeks... nooo,] months. I'll also post the tools I made along with the code, so maybe other people might profit from the things I learned as well. Or just so others that actually do have some l33t programming skills might optimize things even more. So, check back later and feel free to leave a note :)