What does it mean to hash data and do I really care?
What is Hashing?
Hashing is simply passing some data through a formula that produces a result, called a hash. That hash is usually a string of characters and the hashes generated by a formula are always the same length, regardless of how much data you feed into it. For example, the MD5 formula always produces 32 character-long hashes. Regardless of whether you feed in the entire text of MOBY DICK or just the letter C, you’ll always get 32 characters back.
Finally (and this is important) each time you run that data through the formula, you get the exact same hash out of it. So, for example, the MD5 formula for the string Dataspace returns the value e2d48e7bc4413d04a4dcb1fe32c877f6. Every time it will return that same value. Here, try it yourself.
Changing even one character will produce an entirely different result. For example, the MD5 for dataspace with a small d yields 8e8ff9250223973ebcd4d74cd7df26a7
Hashing is One-Way
Hashing works in one direction only – for a given piece of data, you’ll always get the same hash BUT you can’t turn a hash back into its original data. If you need to go in two directions, you need encrypting, rather than hashing.
With encrypting you pass some data through an encryption formula and get a result that looks something like a hash, but with the biggest difference being that you can take the encrypted result, run it through a decryption formula and get your original data back.
Remember, hashing is different – you can’t get your original data back simply by running a formula on your hash (a bit about how to hack these, though, in a moment).
What Hash Formulae are Available?
There are a huge number of widely accepted hashing algorithms available for general use. For example, MD5, SHA1, SHA224, SHA256, Snefru… Over time these formulae have become more complex and produce longer hashes which are harder to hack.
Hashing capability is available in standard libraries in common programming languages. Here’s a quick example coded in Python (call me if you’d like to walk through this code – I’d love to chat!):
hash = hashlib.md5(“Dataspace”.encode(‘utf-8’))
The result comes back as: e2d48e7bc4413d04a4dcb1fe32c877f6
Notice that it’s the same as the hash value we created earlier! In the words of Bernadette Peters in THE JERK, “This s__t really works!”
Hashing and Passwords
When an online system stores your credentials, it usually stores both your username and password in a database. There’s a problem here, though: any employee who accesses the database, or any hacker who breaks into the system, can see everyone’s username and password. They can then go out to the logon screen for that system, type in that username and password, and get access to anything that you are allowed to do on that system.
However, if the system stores your password as a hash, then seeing it won’t do a hacker any good. He can see that the hash is, for example, 5f4dcc3b5aa765d61d8327deb882cf99, but he can’t use that to get into the system and look like you. He has no way of knowing that your password (i.e. the value you type into a logon screen) is actually the word password. On the system’s side, whenever you log in, it takes the password you give it, runs it through its hash formula and compares the result to what’s in its database. If they match, you’re in!
Can I Break a Hash? Can I Keep Someone Else From Breaking it?
Can hashes be hacked? Absolutely. One of the easiest ways is to access a list of words and the hash that each results in. For example, there are websites that publish millions of words and their related hash values. Anyone (usually a hacker, actually) can go to these sites, search for a hash value and instantly find what the value was before it was hashed:
To protect against this, security professionals use a technique known as salting. To salt a hash, simply append a known value to the string before you hash it. For example, if before it’s stored in a database every password is salted with the string ‘dog’, it will likely not be found in online databases. So, password salted with dog (i.e. passworddog) and then run through the md5 calculator becomes 854007583be4c246efc2ee58bf3060e6.
To use these passwords when you log in, the system takes the password that you enter, appends the word ‘dog’ to it, runs that string through the hashing algorithm, and finally looks up the result in its database to see if you’re really authorized and if you’ve typed in the right password.
Hey Ben, Do You Know of Other Cool Uses for Hashing?
Why, yes, there are some other great uses for hashing beyond storing passwords. Here are two:
- Fighting computer viruses: When a computer virus ‘infects’ a program it does so by changing some of the code in that program, making it do something malicious. One way to protect against viruses, therefore, is to create a hash value for a program when it’s distributed to users (i.e. run the computer code through a hashing algorithm and get a hash). Then, whenever that program is run, create a new hash value for the file you’re about to run. Compare the new hash to the original hash. If the two values match then you’re fine. If they don’t match, someone has fiddled with your copy of the program.
- Change data capture: When reading data into a data warehouse we frequently want to know if any records in our source system changed. To do this we sometimes read every field in every source record and compare it to every field in the related record in our data warehouse – a complex process that requires a lot of computer cycles. However, we can speed it up as follows:
- Read all the fields in the source record, concatenate them together, and create a hash of the result
- Compare that hash to a hash value that was stored on the related record in the data warehouse when it was last updated
- If the two don’t match, you know that the source record has changed and the changes should be migrated to the warehouse
- Creating smart keys: Dataspace recently released a software as a service (SaaS) product called Golden Record. Golden Record helps data professionals identify and link records together across databases. For example, it can tell you when the same person appears in a database and in a separate spreadsheet. Internally, the product uses hashes extensively. For example, each match is assigned a ‘key’. That key is actually a hash! This is different than traditional mechanisms where records, in this case matches, are assigned the next available sequential number as a key. Here’s why this is useful: because Golden Record knows the formula it used to create that hash, it can easily find any record / match because it also knows the data that was used to create that key. If, instead, the traditional, sequential number were used, the software would have to read through every record in its list of matches until it came to the one it needs.
OK, this one got a little out of hand. I was asked to write a short paragraph for our monthly email and ended up with four pages of text. Thanks for hearing me out. I just think the concept of and uses for hashes are way cooler than most people realize.
If you’d like to talk about hashes, Python, data science, big data, or World War II aviation, please get in touch – I’d love to chat!
Leave a ReplyWant to join the discussion?
Feel free to contribute!
So, not being very savvy about these things my question is. Suppose one has a 15 character password. But 3 of the characters are the same, e.g. 3 “e”s in the password. Does that reduce it to only a 12 character password and hence easier to hack?
No, each additional character makes the password harder to hack so, all things being equal, longer passwords are always better than shorter ones. Of course, if you go from a password of five random characters to one that’s an English word that’s six characters long, longer isn’t better. But, if you go from five random characters to six random characters, your password will be tighter.
In the end, most passwords are stored as hash values. So, to get a sense of how they work, you might want to play with a simple hashing tool. I use this one and it’s free: http://www.miraclesalad.com/webtools/md5.php. With every character you add, the hash changes. And, if you change one of your e’s to an uppercase E, the hash will change.
Hope that helps. Thanks for the question!
I would like to say thank you for this clear and simple explanation. If you write a book about the subject, I will buy it.
Wonderfully clear explanation of concepts!
Wonderful explanation that even a non-techie can get. If you wrote about hashes and crypto mining, I would love to read it.
Why does Hashing, as it relates to blockchain building, create value that translates into bitcoin and other crypto? I fail to understand the value and how its created
Ben, thank you for this comprehensive explanation. I do have another question: I did a little testing and found that generating the HASH for a file (I used a text file with one word in it “test”) then adding a space at the end “test “, expectedly resulted in two different HASH values. When I changed the file back by deleting the space (going back to “test”), I got the (not unexpected) same HASH value. However, when using this concept for your virus example, could a bad actor not insert malicious code, execute said code, then as part of the execution delete that code to return to the “original” version, thereby defeating the HASH verification? If what I’m asking doesn’t make sense, please let me know.
Thanks for the question. I’m glad the hashing worked, you had me worried there for a second :)
The answer to your question is that the hash must be checked before the code is run, not after. In practice, you usually do this when you download the software to your hard drive. After that, most people assume that it hasn’t changed and run it without checking. I suspect, though, that certain virus scanning programs do periodically validate hash totals, maybe even every time you run a program.
It might help to quickly check out this link: https://dev.mysql.com/downloads/connector/python/ This is the download site for the MySQL Python connector. You’ll see that, for each download, they publish an MD5 checksum and specifically suggest that you check this signature after you download. They also provide an even more reliable protocol called GnuPG (they provide a link for more info on that).
I hope this helps. Thanks for the question!