Distracted

Daniel Niggebrugge

Here it is! I also added a nice 'smart' charset feature. So you only have to specify a hash or a files with hashes and let it try a logical charset per length. This means that it will use a smaller charset when length increases, default:

* length 1-4: all
* length 5: mixalpha-numeric
* length 6: loweralpha-numeric
* length 7: loweralpha
* length 8-10: numeric

You can use the '-s' option to set the offset for these charsets. Default is 4, set it to 5 to have it run charsets on larger lengths. If you want to run a very quick crackjob on your hashes, set '-s 3'. Estimated running time (2009) for -s:

* 3: seconds - couple of minutes
* 4: minutes - few hours (default)
* 5: half an hour - several hours
* 6: days - weeks

You can also tune the bitmap, setting different sizes. Read my previous posting for details about the bitmaps and the other techniques I used in this version.

I release this version only as source, as a lame attempt to stop script kiddies to use this software for illegal purposes. I expect those that want to use it for educational purposes or for legal penetration testing are able to compile this software themselves.

Download: EmDebr_0.5_src.zip

Daniel Niggebrugge

I really didn't think I'd care to actually write multihash brute forcers. I'd need to do a lot more comparing, sorting, a lot more hash reversals, etc... But I did it!

It's still work in progress, but EmDebr currently finds hashes quite fast on huge hash lists. Let's tell you some about my progress, the things I ran into and the ideas I borrowed from others :)

So I started out with some functions to compare calculated hashes to all the input hashes. Every round EmDebr generates 12 hashes, which then need to be compared to, for example, 1000 hashes. You don't want to compare all these one by one as that would really slow down the cracker for every additional hash you are searching for. The cracker should still be decently fast cracking an input list of over 100k hashes.

A very common way to search a list for a possible match is to do a binary search on a sorted list. This works fairly nice, but still gets fairly slow on large hash lists.

An idea I borrowed from the author of 'CUDA-Multiforcer' from cryptohaze.com (probably used by others as well) is to make a shared bitmap. The idea is to map a certain part of all the input hashes to a specific location in a bitmap. For example you take the first 32 bits of the hash and map that to a certain bit and set it to '1'. You do that with every input hash and end up with a bunch of memory filled with 0's and 1's. When you calculated a hash during the cracking process you can do the same mapping and see if there is a '1' on that location. If so, there must be a hash with the same first 32 bits in the hash list. This strategy gets you to only do one lookup for every calculated hash and only start doing a binary search on the full hash list if you have a partial match.

You can specify the size of the bitmap. The larger the bitmap, the more unique bits you'll have in your bitmap, so the less matches you have and thus the less binary searches you'll do. I did some benchmarking on my own system and came up with 1MB as a very reasonable size for both small and large hash lists. I actually tried some variations as well and ended up using 2 bitmaps. The first bitmap is created from the first 32 bits (not using all these bits), the second bitmap is created from the second 32 bits. This was faster then using a single bitmap with double the size.

The next problem was 'reversing', as with my single hash EmDebr a fair amount of hash reversing was used. This reversal takes the input hash and reversely calculates the last round of MD5 (and a few steps more) using a specific part of the current plaintext keyspace. With multihash support we'd have to do this process for every hash, quite often. That's like cracking salted hashes! And it's getting worse, every time after re-reversing we have to sort the full list again, and create the bitmaps!

So that sucks a lot in terms of speed, say 1 Mhashes/s. So I started out with just doing a full MD5 calculation again, not using any reversal at all. So when I got things to work and actually find plaintexts for the input hash list, I started playing with reversal again. The least you can do is do a one time reversal for the input hashes, up to the step where you need some plaintext input. This saves you from doing a single MD5 step and 4 additions. The next step to reverse already needs a part of the plaintext input. Luckily it's just 'W2', the part that contains characters 9-12 of the plaintext. This part doesn't change that often with a regular sized character set. The next (or previous ;)) steps to reverse don't take input on our (<16 sized) plaintexts, so we can just continue up to around half of the last round of MD5. There we run into 'W1' (characters 5-8), we'll have to stop there. In the end this actually allows us to still reverse 8 MD5 steps! And with a normal charset, this still pays off, gaining some extra Mhashes/s. For a small charset it doesn't work very well, but that's just 'numeric' in EmDebr. 'loweralpha' with 26 characters is running fine.

For the sorting I started out with 'default' quicksort, but ended up borrowing the Radix sort implementation from vampyr's Mysql/Sha1/Mssql GPU cracker.

My last improvement was to speed up the binary searches. Especially with larger hash lists, you'll still have to do these binary searches quite often. I now use a small index to limit the range of the binary search. I use the first 8 bits to cut the hash list in 256 parts. This even improved search speed on smaller hash lists!

To finish off, a nice little overview of my progress in terms of speed (using a 'normal' sized hash lists of like 30-40 hashes):

* 52 Mhashes/s using a 512MB bitmap
* 99 Mhashes/s using a 1MB bitmap (seemed bigger ain't always better)
* 107 Mhashes/s using two 1MB bitmaps
* 115 Mhashes/s using partial reversal
* 125 Mhashes/s using an index

With a hash list of 200 000 hashes EmDebr still keeps up, just falling back to around 119 Mhashes/s!

To compare, single hash EmDebr did around 175 Mhashes/s. I'm actually quite satisfied with current results on multihash EmDebr, using less reversing and doing all the comparing stuff! Just have to clean things up, make it display output in a nice way and let it write out results to a file. Stay tuned!

Daniel Niggebrugge

So in my very first blog posting I mentioned I would write about:

* A tool for doing not only case correction with a cracked LM hash and the accompanying NTLM hash, but also do something I call unicode correction. I still need to release this tool, but I already implemented the code in rcracki_mt.

I kind of forgot about this tool. It was actually my first attempt at making something like a brute forcer. I'm not releasing source code (yet), I need to clean it sometime.

You can use this tool when you cracked an LM hash and want to know the correctly cased password that results in the accompanying NTLM hash. Usually you'll be done with trying all different uppercase/lowercase variations, but sometimes there is a strange character in the password. For example, this could be a character with an accent that maps to a regular uppercase password in the LM hash.

Some time ago I've been working out a lot of these mappings, per 'Windows language version' (or actually per OEM codepage). LM2NTLM tries all the known mappings per character. If it doesn't find the correct password it will start again, replacing 2 characters with 'strange' ones. Again, if it doesn't find the password, more characters are tried. At this time you might start doubting the validity of the hashes and/or the password you are providing.

How to actually use this tool? Just run it without arguments and you'll be provided with the information you need. If the corrected password is found, the tool shows 'some' unicode output, you might actually want to inspect the "Password in hex" and spot the 'odd' unicode character (2 bytes). You can use this site for finding out which character it really is.

Download: lm2ntlm.zip (Windows 32bit .exe)

p.s. You can use LM2NTLM for regular case correction as well ;)

Daniel Niggebrugge

Let's keep it short, new plaintext generation, caching:

Old speed: ~200 Mhashes/s
New speed: ~315 Mhashes/s

Just shows the algorithm should not always be the main target of optimization.

Files:
EnTibr_0.3_win32.zip
EnTibr_0.3_src.zip

Daniel Niggebrugge

A little later then I planned (NTLM tables fixed and currently uploading), but here we go with version 0.4 of EmDebr, my SSE2 enabled MD5 password cracker. The main improvement is the plaintext generation on passwords > 4 characters. There are now two types of cracking threads where one, the old one version, is used for lengths <= 4. So what exactly is different now?

EmDebr generates MD5 hashes for 12 plaintexts simultaneously. In the old plaintext generation I used to continue to generate 12 plaintexts by just moving on to the next plaintext 12 times every round. With the new generation I split up the generation in 2 parts, 'first 4 characters' and 'character 5 and beyond'. The second part is now different for all 12 plaintexts, so the first part can be equal for all plaintexts. This way we only have to change 1 plaintext every round until we ran through all combinations on the first part (26*26*26*26=456976 rounds). At that point we move the second part to the next plaintext 12 times and start over again.

I added another speed up, using some sort of 'character couple' cache. I got the initial idea by looking at brute forcing code by Sascha Pfaller. His implementation generates a fairly large bunch of plaintexts at once before starting to hash them. This implementation didn't really work well with me, but it did give me an idea. So now I'm using an array of shorts (2 bytes) and store all combinations of 2 characters of the charset in it. So with loweralpha it contains aa-zz. So instead of 'generating' a plaintext every round, I just 'set' a plaintext every round.

EmDebr now went from like 135 Mhashes/s to like 176 Mhashes/s on my system!
I might clean up and release a new version of EnTibr as well soon, that one had an even better speed up.

Files:
EmDebr_0.4_win32.zip
EmDebr_0.4_src.zip

p.s. Haven't tested this code on Linux anymore.

Daniel Niggebrugge

After using my PlayStation 3 for some nice brute forcing, I was impressed by the power it has. The Cell processor on it performs faster then the CPU in my quadcore system. Consider the fact that the PS3 was released in like 2005 and my CPU is just one year old!

After writing a brute forcer on it I decided to try and enjoy the power of the machine in a completely different way... I actually used it the right way and the power was overwhelming! Try and play Assassins Creed on it for example :) The graphics, the physics, the smooth feel of the gameplay, just awesome! Check out some ingame footage, I can't wait for part 2 of this title.

So lately I've mainly been using my PS3 for playing several new bought games... and trust me, a PlayStation 3 is a powerful machine! :D

Daniel Niggebrugge

I was supposed to come up with new version of EnTibr and EmDebr with my next blog post, but something came in between. FreeRainbowTables.com kind of shutdown because of PowerBlade stopping his work/hosting on the project. The project has been a great way of generating and distributing rainbow tables for free, but it takes a fair amount of time to keep it running and especially if you want to keep improving. The news got out and was picked up on Slashdot, creating a nice flood of reactions from people willing to take over the site. I think FreeRainbowTables won't die and will continue in the near future!

Anyway, apparently there were some problems with the last few generated NTLM tables (ntlm_mixalpha-numeric#1-8_0 and ntlm_mixalpha-numeric#1-8_1), causing the indexing process to fail. As PowerBlade didn't have time to fix the tables, he suggested me to just release the original tables on tbhost.eu in standard .rt format. I don't really like that format, especially because one table is around 160GB, instead of like 110GB. So I started investigating the issue. While by now I thought I'd knew about every technical detail about Rainbow Tables, again I found myself learning quite a lot about them again! This time I had to master the binary files and especially the indexing part. I'm currently trying to fix the tables so they can be released on tbhost.eu and the GARR mirror. Just need a little more time for that.

So after fixing those tables I hope to finish off the new versions of my brute forcers! At least some good news, after fixing the bug I had in Entibr, speed didn't dramatically drop like I've been having with other bugs before :P It's currently doing around 310 Mhashes/s AND actually finding plaintexts!

Daniel Niggebrugge

Well... I'm still alive, just been having some vacation and a few weeks of no-coding.

It just happens that after a while things start itching again... especially if you already have some optimized code lying around that has to be finished for release. I already wrote about the new plaintext generation that sped up EmDebr (MD5 brute forcer) on my PC from around 138 Mhashes/s to around 177 Mhashes/s, but I just didn't finish it yet.

So I almost finished it now, and was wondering about the speed increase for EnTibr (NTLM brute forcer)... not really done yet (need to fix some bugs and test a little more), but the figures look promising alright! It seems like EnTibr will be doing over 300 Mhashes/s on my system, instead of the 200 Mhashes/s before!

Check back for the new releases and the final speeds!

Daniel Niggebrugge

So I finally got to cleaning up my mess and release the code! I don't feel like writing much text right now, so I hope to produce some explanations about PS3 coding in the future (could be in one day, could be in a few weeks ;)).

At least to share some links I came across while developing:

* Black Hat presentation slides by Nick Breese
* MD5 code by Nick Breese
* Someone improving part of Nick Breese's code
* G924789's improvements for MD5 ROUND1 and ROUND2
* Some examples on communication between PPU and SPU's
* Someone who implemented AES on PS3
* IBM documentation on Cell
* C/C++ Language Extensions for Cell/B.E. Architecture (somehow IBM killed the link at the moment)

You can get the source code, including binaries, here!

Feel free to comment, improve, share!

Daniel Niggebrugge

My work on Psebr-md5 is almost done. I'm currently cleaning up some code and adding some (hopefully useful) comments.

Speed is now close to 300 Mhashes/s, usually around 290 Mhashes/s average. This speed is composed of 6 SPU's going at like 43 Mhashes/ and the PPU going at 2x 16 Mhashes/s. Somehow just running 1 PPU thread has a faster speed/thread, something like 20 Mhashes/s.

Some keywords when coding for PS3:

* Vectors / SIMD
* Branching (__builtin_expect)
* Interlacing / pipeline 0 and 1
* Finding/using the right intrinsics/instructions
* Big Endian

I'll try and write about these when I release my code :) (I hope in one or two days!)

Daniel Niggebrugge

Check out G924789's blog!

Improved my code from around 200 Mhashes/s to 250 Mhashes/s!!

Tnx for sharing G924789 (do you live there?) :)

Daniel Niggebrugge

I knew my plaintext generation was probably not very efficient anymore (generating 12 plaintexts per round). A nice mail exchange with Sascha Pfaller who made an even faster MD5 brute forcer, pointed me at this piece of code again. Also some other issues got clear, which I should have seen myself before (but didn't, so tnx :P).

I actually didn't want to bother with implementing a new plaintext generation and was satisfied with current speeds. But then I decided to do something completely different... I decided to go and play around with my Playstation 3!

Previous research was done by Nick Breese and presented at Black Hat. He released some benchmarking code that was supposed to be superceeded with new code. I couldn't find that code though, but I'm really curious about his new results.

Anyway, back to my story. In his presentation slides Nick Breese wrote that he easily ported his PS3 code to SSE2 code, so I thought "then it must be easy to port it the other way around as well". So I started out and indeed, I was able to produce some working code in about 2-3 days work (not fulltime). I was however fairly disappointed by the results in terms of speed. I wasn't expecting to hit very high, as I already knew Nick's setback, but it was still quite low (at first 8 Mhashes/s per core, the next day 13 Mhashes/s). As the working cores of the PS3 (the so called SPE's, you've got 6 of them available) are designed to work with vectors (SIMD), my plaintext generation might be kicking in even worse this time.

I didn't really want to do it the way Sascha is doing it, as that would take more thinking and modifications for me. And then suddenly this new idea came to mind. Instead of updating the first 4 bytes of the plaintext 12 times per round, I now do this once, and have the 5th (and beyond) bytes differ from each other. These don't have to change every round, only when updating reversal values.

To keep it short, I went from like 13 Mhashes/s to like 28 Mhashes/s with this new plaintext generation... per core! So for 6 cores that is from 78 to 168 Mhashes/s! After some more changes (3x interlacing was faster again, instead of 2, which strikes me as odd, because the cores only have 2 pipelines...) I hit around 33.3 Mhashes/s per core, which comes to 200 Mhashes/s in total.

This will be sped up some more, as the PS3 also has 2 'generic' PowerPC CPU cores that can be used. Maybe I'm not even close to revealing the full potential of the PS3, but when I'm done with the brute forcer, I'll release code again (as usual) and hope that others go and give it a try.

Now I almost forgot to bring the SSE2 news... I implemented this new plaintext generation in my SSE2 MD5 brute forcer, and this got me from around 138 Mhashes/s to 177 Mhashes/s!! I'll try and implement these new optimizations into my other brute forcers as well and hopefully release them soon.

Daniel Niggebrugge

I don't really like releasing information about security bugs this way. But I'd rather have people know about a bug in the software they use if it's not fixed. The information has been publicly available in the SourceForge tracker anyway.

AWStats contains a Cross Site Scripting vulnerability (XSS) in the output parameter:

http://[domain]/awstats/awstats.pl?config=[example.com]&framename=mainright&output="%20style="width:%20expression(alert('XSS'));"

This one doesn't work in FF, it does in IE7 though.

(I tried contacting the author in several ways, no response.)

Daniel Niggebrugge

I got a response to my blog from Sascha Pfaller who's writing another MD5 brute forcer. As it seems there is a faster open source alternative coming up!

After skipping through this new code I realized that I forgot to interlace my ROTATE_LEFT functions. As this is actually 3 operations each, VC compiler failed to interlace this for me. I now went from 117 Mhashes/s to 135 Mhashes/s on my system with VC compiler. Apparently Intel compiler already interlaced my ROTATEs before, no speed increase with my new code with Intel compiler.

I also tried to fix my source code to work with Linux. I still have some stupid problem with pthreads and signaling conditions or something. I give up for now and just release a version that at least works. If you plan on using it, mind the README:

Linux support is in a state of 'it works and should find your plaintexts'. There are some bugs; Speed doesn't always show correctly or sometimes not at all. If you specify more threads then you have cores in your system, using -t, you might see high numbers for the speed. Your system is not actually running that fast, sorry :)

New files:

EmDebr:
EmDebr_0.3_win32.zip
EmDebr_0.3_src.zip

New version of EnTibr, Cacheebr and shabr might be coming up if anyone actually wants to use those in Linux.

Please feel free to leave comments or feedback about how fast you are going with the various versions. Or if you notice any bugs, please let me know!

Daniel Niggebrugge

As requested, I built an MS Cache brute forcer. The MS Cache hashes are a little harder to optimize. They are salted and need 2x MD4. This is how you built an MS Cache hash:

* Built NTLM hash for the password: MD4(Unicode(password))
* Append Unicode&lowercase username to the NTLM hash
* MD4 that

So in short: MD4( MD4(Unicode(password)) + Unicode(tolower(username)) )

Because of this, you need the calculate the full MD4 hash for every plaintext. Because of the unknown first 16 bytes of the input for the final MD4 (the NTLM hash), you cannot really reverse steps. I only reversed partial last steps.

I've been a little lazy, this version only supports usernames with a maximum length of 19 characters. You would need to do an additional MD4 for longer usernames.

I interlaced SSE2 three times, getting to something like 72 Mhashes/s on my system.

The download links:

Cacheebr_0.1_win32.zip
Cacheebr_0.1_src.zip

You might also need to install the Microsoft Visual C++ 2008 Redistributable Package.

Daniel Niggebrugge

I have created a project at SourceForge for rcracki_mt. For those who don't know what it is:

rcracki_mt can be used to perform a rainbow table attack on password hashes. It is intended for indexed&perfected rainbow tables, mainly generated by the distributed project www.freerainbowtables.com (which has been down for a few days, no idea why).

The new home is at: https://sourceforge.net/projects/rcracki/

At SourceForge you can also find the new version, version 0.6. This version should be the latest beta version and should become 1.0 after some (possible) bug fixes. Check out the README and ChangeLog (at the bottom).

Please use the trackers for bug reports and feature request.

Daniel Niggebrugge

Hi there, as I realized just after posting EmDebr and EnTibr, I could interlace more SSE2 instructions. From what I've understood so far, my processor (Core2) should be able to execute 3 SSE2 instructions simultaneously. With the 0.1 releases of my brute forcers, I only try to do 2. That would make up for a nice speed increase :)

But I noticed something strange when I was playing around with this... my MD4 brute forcer (EnTibr) has the following speeds with the different interlacings:

1x SSE2 : 110 Mhashes/s
2x SSE2 : 150 Mhashes/s
3x SSE2 : 175 Mhashes/s
4x SSE2 : 200 Mhashes/s
(5x gets slower, like 170)

My MD5 brute forcer (EmDebr) has the following speeds with the different interlacings:

1x SSE2 : 77 Mhashes/s
2x SSE2 : 100 Mhashes/s
3x SSE2 : 116 Mhashes/s
4x SSE2 : 105 Mhashes/s

Now for MD5 this seems more logical, but it strikes me as odd that MD4 still gains speed with 4x SSE2. BarsMonster (from BarsWF) suggested to try again with Intel compiler. So I downloaded evaluation versions for VC and Intel compiler and tried... this one at least had logical results :)

So 'final' results with Intel compiler:

EnTibr: 3x SSE2 -> 200 Mhashes/s
EmDebr: 3x SSE2 -> 144 Mhashes/s

I probably had some luck with 4x SSE2 and the VC compiler. It probably arranges instructions well enough to actually perform 3 SSE2 instructions simultaneously. MD4 code is also smaller then MD5, maybe allowing it to just fit in the cache.

Both brute forcers can gain some more speed by tweaking the Intel compiler some more, but I can't really care at the moment. I will only release binaries compiled with VC Express, but feel free to compile your own faster version with the Intel compiler :)

I release 2 versions of the NTLM/MD4 code&binary, one with 3x SSE2, the other with 4x SSE2. I also fixed a bug where the brute forcers sometimes kept running after finding the plaintext. Uhm, I might have changed some other pieces of the code as well... don't remember.

New files:

EmDebr:
EmDebr_0.2_win32.zip
EmDebr_0.2_src.zip

EnTibr 3x SSE2:
EnTibr_0.2_3xSSE2_win32.zip
EnTibr_0.2_3xSSE2_src.zip

EnTibr 4x SSE2:
EnTibr_0.2_4xSSE2_win32.zip
EnTibr_0.2_4xSSE2_src.zip

Please feel free to leave comments or feedback about how fast you are going with the various versions. Or if you notice any bugs, please let me know!

Daniel Niggebrugge

And hereby I also present my open source, SSE2 optimized, NTLM password brute forcer, called EnTibr. This is almost the same code as the one used in EmDebr. I also used reversing of the MD4 algorithm, skipping the full 3rd round. With my own system I get around 150 Mhashes/s using 4 cores @ 3.2Ghz.

Again, I hope this helps others to understand reversing and maybe you can use my code to write a better/faster open source cracker. Feel free to leave a comment or to post some improvements!

The download links:

EnTibr_0.1_win32.zip
EnTibr_0.1_src.zip

You might also need to install the Microsoft Visual C++ 2008 Redistributable Package.

Daniel Niggebrugge

As promised I hereby present my MD5 password brute forcer, called EmDebr. I used reversing to speed things up, see my previous blog for more information about reversing.

I also gave interlacing SSE2 a try (did that after I posted my previous blog ;)) and came up with a speed improvement of about 40%. So on my own system (3.2Ghz quadcore) I got like 77 Mhashes/s before interlacing, with the current version I get around 100 Mhashes/s. Interlacing SSE2 is not my own idea and can be done (a lot) better then what I came up with. EmDebr is in this way still by far not the fastest SSE2 optimized MD5 password brute forcer around, as far as I know that's still BarsWF. As far as I know EmDebr is now the fastest open source one though :)

I hope this helps others to understand reversing and maybe you can use my code to write a better/faster open source cracker. Feel free to leave a comment or to post some improvements!

Oh right, the download links:

EmDebr_0.1_win32.zip
EmDebr_0.1_src.zip

You might also need to install the Microsoft Visual C++ 2008 Redistributable Package.

*Again, credits to Sc00bz for the public explanation about reversing*

Daniel Niggebrugge

Well... it's been a while :) But I want to share something with you:

The idea of a cryptographic hash of a plaintext password is that you cannot 'reverse' or 'decrypt' the hash and in this way recover the original plaintext. One can only attempt to recover the plaintext password by trying many plaintexts and compare the resulting hashes to this hash (for example by regular brute forcing, dictionary attacks or by using rainbow tables). It is however possible to reverse part of the hash in such a way that a smaller part of the cryptographic algorithm has to be performed. The amount of steps that can be reversed, depends on the algorithm used. At least for MD5 an unlikely large amount of steps can be reversed. Several people have been working on reversing this algorithm and implementing these techniques into brute forcers. For example the GPU powered brute forcer BarsWF gains much speed because of the reversed steps.

There is this one guy, Sc00bz, who actually wrote about reversing MD5 in detail, providing enough information for one to go and implement the reversing steps into for example a brute forcer. Later on he even provided some code snippets to work with.

I already got the idea of reversing and reversed a few steps of SHA-1 in shabr. With SHA-1 however you can't reverse many steps:

When you normally hash a plaintext password with SHA-1, the last part is to add initial values:


   context->Intermediate_Hash[0] += A;
   context->Intermediate_Hash[1] += B;
   context->Intermediate_Hash[2] += C;
   context->Intermediate_Hash[3] += D;
   context->Intermediate_Hash[4] += E;

A, B, C, D, E are defined as:


    A   = 0x67452301;
    B   = 0xEFCDAB89;
    C   = 0x98BADCFE;
    D   = 0x10325476;
    E   = 0xC3D2E1F0;

As you already know the last values of the 'Intermediate_Hash' you are looking for, you can just as well subtract these values before starting with brute forcing all those combinations. In this way, you don't have to calculate this last part for every plaintext you hash.

Previous to this part the actual SHA-1 steps are calculated. Let's look at the final step:


   SHA1_STEP_ROUND4( B, C, D, E, A, W[79])

with:


   SHA1_STEP_ROUND4(a, b, c, d, e, x)
     e = e + ROTATE_LEFT(a,5) + F_60_79(b,c,d) + x + K;
     b = SSE_ROTATE(b,30);

If you want to know the values of A-E before this step, you do the reverse of what you see here. So you subtract when there was an add, you rotate right when there was a rotate left:


   REVERSE_SHA1_STEP_ROUND4(a, b, c, d, e, x)
     e = e - ROTATE_RIGHT(a,5) - F_60_79(b,c,d) - x - K;
     b = ROTATE_RIGHT(b,30);

There are some unknowns here, we don't know 'x', so we can't even fully reverse this last step. We can however pre-calculate most of this step, which saves time. You just need to do something like 'e = e + x' in the last step of the hashing code. You can do something similar with the 2 previous steps, but you'll have more unknown values, so you can reverse a smaller part of the step. Also don't forget that you only need to check 1 of 5 parts of the hash most of the time, so you only calculate the last steps when you have an early match. You can look at my code in shabr to see it implemented (the reversing slightly different then what you see here, but the idea is the same).

Now the fun starts when you think about reversing MD4 or MD5... where with SHA-1 you start with unknown values in the last steps straight away, with MD4 and MD5 you can go back many steps when you keep part of your plaintext equal. For a detailed explanation of how this is done for MD5 you can read the writings of Sc00bz I mentioned at the beginning:

Sc00bz @ freerainbowtables forum
More explanation from Bars and from Sc00bz, including code snippets

I implemented these ideas in an MD5 brute forcer, and fairly easily I could apply the same ideas to MD4 and make an NTLM brute forcer. I'll release these tools including source code in the upcoming days. While BarsWF beats my code by FAR, as far as I know mine is now the fastest open source MD5 SSE2 brute forcer ;) And the NTLM brute forcer might be the fastest around for CPU's (just because Bars or x4d3 didn't make a CPU version yet I guess) :)
(I should really try and seriously learn to interlace SSE2 one day...)

Daniel Niggebrugge

I couldn't help optimizing SHAbr just a little more, I really wanted to get the 60 Mhashes/s on my system :)

What did I do?
A plaintext first consists of an array of characters (unsigned char[16]), then I pass it as an __m128i type to the SHA-1 function. With SSE2 I do 4 plaintexts at the same time, so actually 4 of these variables are passed to this function.

The variables get split into 4 32-bit integers (I called these types UINT4). The SHA-1 function then puts these variables in an array of 80 UINT4's, this array is called W. This array is used to add a value in every one of the 80 steps of the actual hashing part. The first 16 UINT4's in this array W are just the parts of the plaintext, and as I only support plaintexts with length < 16, only the first 4 UINT4's are filled. Then W[4]...W[14] contain zero's in my case, W[15] contains the length of the plaintext in bits.

So what happens with W[16]...W[79]? The previous values (the actual plaintext) are 'expanded' into these values, so every step has additional input and is dependent on the plaintext. This expanding isn't that special:


W[t] = W[t-3] XOR W[t-8] XOR W[t-14] XOR W[t-16]
And then ROTATE this result by one.

As W[4]...W[14] contains nothing but zero's, I didn't actually want to set them. But because W[18]...W[30] depend on these zero's, they should be set. Unless we change the EXPAND function for W[18]...W[30]. But as that needs more unrolling of the loops, my code gets bigger and maybe slower (had that before).

So now I unrolled certain parts in a strange way, but somehow it works :)
I now have some strange code like:


for(t = 20; t < 21; t++){
  SSE_EXPAND_3(t); ROTATE(t)
  SSE_EXPAND_3(t+1); ROTATE(t+1)
  SSE_EXPAND_3(t+2); ROTATE(t+2)
  SSE_EXPAND_3_8(t+3); ROTATE(t+3)
  SSE_EXPAND_3_8(t+4); ROTATE(t+4)
}

This code block within the for loop gets executed only once. So you'd say that I could just write out SSE_EXPAND3(20) - (24) and such, but somehow that makes things slower...

Anyway, instead of 58.5 Mhashes/s I now get:
Length 5 - 55% in 8.39 s (60.29 Mhashes/s)

Here are the downloads:
shabr_0.2_win_32bit_binary.zip
shabr_0.2_src.zip

Daniel Niggebrugge

Today I finished my SHA-1 password brute forcer, SHAbr. It is optimized for passwords with a length < 16 characters. It supports multiple cores/processors and uses SSE2 to hash 4 plaintexts simultaneously, per thread.

Yesterday I got a little above 40 Mhashes/s with 4 cores (Q9450@3.2Ghz), tonight just before going to sleep I thought of some new attempts for optimizations. That got me close to 50 Mhashes/s. This morning when I woke up another idea popped up, and just before I thought I was done, another one... now I am close to 60 Mhashes/s :)

The main issue is that I can't fully unroll all the loops, this actually slows things down. This might have something to do with cache misses. The last changes mainly comprised of reordering some instructions and partially unrolling the loops.

This was my first attempt at using SSE2, and my first attempt at actually implementing a cryptographic hashing algorithm on my own. I used several sources of information and used some ideas from other SHA-1 implementations, such as the one in OpenSSL and RFC3174.

It is very well possible that my code can be optimized even more, or even a lot more, but as far as I know, SHAbr is currently the fastest SHA-1 password brute forcer around. You might also see some strange coding habits, I'm not a professional coder... feel free to drop me some remarks, I'd be happy to learn from it :)

Currently it only takes one hash for cracking and I haven't tested it with Linux. I expect it not to work straight away under Linux, but it shouldn't be that much work. I have it on my ToDo list, but I won't fix it in the upcoming month. I'd also like to implement support for multiple hashes later on.

Download:
shabr_win_32bit_binary.zip
shabr_src.zip

You might also need to install the Microsoft Visual C++ 2008 Redistributable Package.

Feel free to leave a comment, and I'd love to hear some of the speeds you gain.

tip: playing around with -B and -Q might render different speeds... also I noticed that somehow speed sometimes is like 2 Mhashes/s slower or faster after restarting the program.

Daniel Niggebrugge

Hi, and welcome at yet another blog :)

I needed some space to share some thoughts, some things I learn and some tools. Last months I have been playing around with rainbow tables, hashing algorithms, brute forcers and other related topics. This has resulted in a couple of topics that I plan to blog about:

* A new set of rainbow tables that are currently generated by the Free Rainbow Tables project (link at the right). This set covers passwords from a special character set, specifically for LM hashes.

* rcracki_mt, a multi-threaded version of the tool rcracki, used for 'cracking' password hashes with the rainbow tables generated by the Free Rainbow Tables project.

* A tool for doing not only case correction with a cracked LM hash and the accompanying NTLM hash, but also do something I call unicode correction. I still need to release this tool, but I already implemented the code in rcracki_mt.

* A first attempt at optimizing the MD4 hashing algorithm that is used for NTLM hashes, using code provided by other people. My version is not that useful, but I learned some nice things.

* A SHA-1 implementation optimized for cracking passwords with a length < 16. First meant for use in rcracki_mt, but then I started liking optimizations. So by now I have sort of a SHA-1 brute forcer, using SSE2 and with the last 3 steps reversed, the first 4 steps unrolled and precalculated as much as possible. I got some tips, hints and directions from Svarychevski Michail Aleksandrovich regarding the use of SSE2, tnx for that. Btw, he has an awfully fast MD5 brute forcer, for both CPU (SSE2 optimized) and for GPU (both CUDA (nVidia) and Brook (AMD/ATI).

I plan to blog about these topics in the coming [days... eh, no, weeks... nooo,] months. I'll also post the tools I made along with the code, so maybe other people might profit from the things I learned as well. Or just so others that actually do have some l33t programming skills might optimize things even more. So, check back later and feel free to leave a note :)

Saturday, October 3, 2009

Wednesday, September 30, 2009

Monday, September 7, 2009

Wednesday, August 5, 2009

Tuesday, July 28, 2009

Tuesday, July 21, 2009

Monday, July 20, 2009

Wednesday, July 15, 2009

Monday, June 8, 2009

Sunday, June 7, 2009

Friday, May 29, 2009

Wednesday, May 27, 2009

Monday, May 25, 2009

Sunday, May 17, 2009

Friday, May 15, 2009

Thursday, May 14, 2009

Monday, May 11, 2009

Sunday, May 3, 2009

Thursday, April 30, 2009

Saturday, January 17, 2009

Friday, January 16, 2009

Tuesday, January 13, 2009

About Me

Tools

Blog Archive

Links