We here at Dataspace thought it would be helpful to share some intriguing examples of how data can be easily manipulated to bring efficiency and value into our day-to-day lives. This week we will focus on how the daunting and time-consuming task of learning a foreign language can be made easier by taking a data-driven approach.
With a little help from my Friends
A few weeks ago, data analyst and researcher Tomi Mester of data36.com posted a detailed article about his unorthodox method for quickly understanding the Swedish language. Mester knew that he didn’t have the time it takes to truly gain fluency in a foreign language, much less a language as difficult as Swedish. But he wanted to pursue it as a “hobby language” to make his time in Sweden a more rewarding and accessible experience.
Instead of picking up a phrasebook or downloading a language-learning app, Mester found all of the Swedish subtitles for two popular sitcoms: How I Met Your Mother and Friends. A few lines of code later, he had every word used in these series organized by frequency of usage. He made a cut-off at the top 1000 words and asked “how much will I understand if I learn only these 1000 words?”
By simply comparing the total instances of these top words to the total number of words counted in the scripts, he realized he could understand as much as 85% of the sitcoms just by learning 1000 words!
So what tools do I need to do this myself?
The cleaning and analysis Mester carried out in doing this task was all done almost entirely with the help of Bash. Bash is simply a language that allows you to interact with your operating system via the command line. With a few lines of Bash code, he was able to scour the compilation of subtitles, sort its contents into an alphabetized list of individual words, clean the words into a uniform format and also print the frequency of each word next to it in a .csv export.
But, you say, “I don’t know Bash. How can this be easy?” Two answers: If you look carefully at the lines of Tomi’s code you start to realize that you could duplicate his program in a tool you probably do know: Excel. Secondly, if you’ve worked in a command line environment (anyone remember DOS?), you’ll realize that Bash isn’t too hard to pick up. Give it an hour or two and you’ll know the basics. Here’s a great place to start.
Is data science really this easy?
Mester emphatically notes that, unfortunately, this is lightweight data science at best. But it’s a start and does provide a nice introduction to the steps in the data science process Learning tools such as Python, R, Bash and SQL represent other meaningful steps in the direction of becoming a data scientist, or at least a citizen data scientist.
The point here is that data can be useful in unexpected places, and making it useful is not always a complicated endeavor.