If nothing else, the data science industry is good at coming up with new, unique, confusing names and terms. ZooKeeper, MapReduce, Hadoop, Pig, Storm, Mahout MongoDB…the list keeps growing and it’s totally understandable if you can’t always identify or explain the different technologies and tools of the industry.
In each article in this series we will select a term and give you a quick background on what it means and its implications for your data science efforts and broader data strategy. This week we begin with Kafka!
Apache Kafka: What is it?
Originally developed by Linkedin, Kafka is an open-source product from the Apache Software Foundation, the well-known provider of open-source software. It is a self-described “distributed streaming platform.” The website reads “Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”
What problems does it solve?
Kafka allows you to pass data between systems as transactions occur in real-time. Imagine a situation where you have multiple web properties, all kicking out transactions and multiple downstream systems that need to read those transactions (e.g. a CRM, data warehouse and order management system). Each of those systems could build a connection directly to the sources, creating a brittle, spaghetti-like architecture of interwoven systems
With Kafka, however, each of those sources, known in Kafka as producers, writes its data just to Kafka. Each of the downstream systems (known in Kafka as consumers) reads the data from Kafka. The data is therefore organized for easy access.
There are mechanisms for transforming it, called processors. It’s starting to sound a bit like a combination of our traditional staging area and ETL, isn’t it?
Kafka can also store this data, allowing downstream systems to reload history should they ever lose it.
Who’s using it?
Kafka is used by thousands of major companies, including Twitter, Paypal, Netflix and many other significant players.
One good example of Kafka in action is at Walmart, a particularly massive company with a variety of data sources and users. They began looking at options for scalable data processing systems three years ago and ultimately used Kafka as a company-wide, multi-tenant data hub.
It has allowed Walmart to onboard sellers and launch product listings faster by enabling fast processing of data from various sources. By centralizing all incoming data updates in Kafka, they were able to use the same data for activities such as reprocessing catalog information, analytics and A/B testing instead of each activity pulling data directly from the source systems.