What does it mean to hash data and do I really care?
What is Hashing?
Hashing is simply passing some data through a formula that produces a result, called a hash. That hash is usually a string of characters and the hashes generated by a formula are always the same length, regardless of how much data you feed into it. For example, the MD5 formula always produces 32 character-long hashes. Regardless of whether you feed in the entire text of MOBY DICK or just the letter C, you’ll always get 32 characters back.
Finally (and this is important) each time you run that data through the formula, you get the exact same hash out of it. So, for example, the MD5 formula for the string Dataspace returns the value e2d48e7bc4413d04a4dcb1fe32c877f6. Every time it will return that same value. Here, try it yourself.
Changing even one character will produce an entirely different result. For example, the MD5 for dataspace with a small d yields 8e8ff9250223973ebcd4d74cd7df26a7
Hashing is One-Way
Hashing works in one direction only – for a given piece of data, you’ll always get the same hash BUT you can’t turn a hash back into its original data. If you need to go in two directions, you need encrypting, rather than hashing.
With encrypting you pass some data through an encryption formula and get a result that looks something like a hash, but with the biggest difference being that you can take the encrypted result, run it through a decryption formula and get your original data back.
Remember, hashing is different – you can’t get your original data back simply by running a formula on your hash (a bit about how to hack these, though, in a moment).
What Hash Formulae are Available?
There are a huge number of widely accepted hashing algorithms available for general use. For example, MD5, SHA1, SHA224, SHA256, Snefru… Over time these formulae have become more complex and produce longer hashes which are harder to hack.
Hashing capability is available in standard libraries in common programming languages. Here’s a quick example coded in Python (call me if you’d like to walk through this code – I’d love to chat!):
hash = hashlib.md5(“Dataspace”.encode(‘utf-8’))
The result comes back as: e2d48e7bc4413d04a4dcb1fe32c877f6
Notice that it’s the same as the hash value we created earlier! In the words of Bernadette Peters in THE JERK, “This s__t really works!”
Hashing and Passwords
When an online system stores your credentials, it usually stores both your username and password in a database. There’s a problem here, though: any employee who accesses the database, or any hacker who breaks into the system, can see everyone’s username and password. They can then go out to the logon screen for that system, type in that username and password, and get access to anything that you are allowed to do on that system.
However, if the system stores your password as a hash, then seeing it won’t do a hacker any good. He can see that the hash is, for example, 5f4dcc3b5aa765d61d8327deb882cf99, but he can’t use that to get into the system and look like you. He has no way of knowing that your password (i.e. the value you type into a logon screen) is actually the word password. On the system’s side, whenever you log in, it takes the password you give it, runs it through its hash formula and compares the result to what’s in its database. If they match, you’re in!
Can I Break a Hash? Can I Keep Someone Else From Breaking it?
Can hashes be hacked? Absolutely. One of the easiest ways is to access a list of words and the hash that each results in. For example, there are websites that publish millions of words and their related hash values. Anyone (usually a hacker, actually) can go to these sites, search for a hash value and instantly find what the value was before it was hashed:
To protect against this, security professionals use a technique known as salting. To salt a hash, simply append a known value to the string before you hash it. For example, if before it’s stored in a database every password is salted with the string ‘dog’, it will likely not be found in online databases. So, password salted with dog (i.e. passworddog) and then run through the md5 calculator becomes 854007583be4c246efc2ee58bf3060e6.
To use these passwords when you log in, the system takes the password that you enter, appends the word ‘dog’ to it, runs that string through the hashing algorithm, and finally looks up the result in its database to see if you’re really authorized and if you’ve typed in the right password.
Hey Ben, Do You Know of Other Cool Uses for Hashing?
Why, yes, there are some other great uses for hashing beyond storing passwords. Here are two:
- Fighting computer viruses: When a computer virus ‘infects’ a program it does so by changing some of the code in that program, making it do something malicious. One way to protect against viruses, therefore, is to create a hash value for a program when it’s distributed to users (i.e. run the computer code through a hashing algorithm and get a hash). Then, whenever that program is run, create a new hash value for the file you’re about to run. Compare the new hash to the original hash. If the two values match then you’re fine. If they don’t match, someone has fiddled with your copy of the program.
- Change data capture: When reading data into a data warehouse we frequently want to know if any records in our source system changed. To do this we sometimes read every field in every source record and compare it to every field in the related record in our data warehouse – a complex process that requires a lot of computer cycles. However, we can speed it up as follows:
- Read all the fields in the source record, concatenate them together, and create a hash of the result
- Compare that hash to a hash value that was stored on the related record in the data warehouse when it was last updated
- If the two don’t match, you know that the source record has changed and the changes should be migrated to the warehouse
- Creating smart keys: Dataspace recently released a software as a service (SaaS) product called Golden Record. Golden Record helps data professionals identify and link records together across databases. For example, it can tell you when the same person appears in a database and in a separate spreadsheet. Internally, the product uses hashes extensively. For example, each match is assigned a ‘key’. That key is actually a hash! This is different than traditional mechanisms where records, in this case matches, are assigned the next available sequential number as a key. Here’s why this is useful: because Golden Record knows the formula it used to create that hash, it can easily find any record / match because it also knows the data that was used to create that key. If, instead, the traditional, sequential number were used, the software would have to read through every record in its list of matches until it came to the one it needs.
OK, this one got a little out of hand. I was asked to write a short paragraph for our monthly email and ended up with four pages of text. Thanks for hearing me out. I just think the concept of and uses for hashes are way cooler than most people realize.
If you’d like to talk about hashes, Python, data science, big data, or World War II aviation, please get in touch – I’d love to chat!