As you may know, many big data technologies are defined as schema on read. What this means is that you can throw whatever you want on the disk and then, when you need the data, you tell the data store what that data means (e.g. the second column contains price per unit). Traditional relational databases, on the other hand, are schema on write – you tell the system what your data will look like before you put the data into it. And, if your data doesn’t look like you said, the system will reject it (I originally said, ‘barf it back to you’ but that line was removed in editing).
I was recently reminded of the value of schema on write systems when I ported data from an old forecasting system (ex-Dataspace cadets may remember a system called PrOps) to a new, cloud-based system we’ve developed. I created the new schema in Google Cloud SQL, exported data from our old, MS Access database, and tried to import it into the new schema. Lo and behold, much of the data got rejected. Here’s just one reason why: MS Access had prefixed every dollar value with a dollar sign. Thus, the new database saw those values as strings, not floating point numbers. I then had to put in some effort to fix the load routines to address this.
Suppose for a moment that our new database was in a schema on read technology, like Hadoop. It would have just let the data in, dollar signs and all. The problem wouldn’t be exposed until later, when someone tried to work with that data, getting strings when they expected floats and potentially getting a faulty result to their query.
It’s not that schema-on-read systems are bad, it’s just that they have their place. When developing applications where exactness is paramount, think very carefully about whether or not schema-on-read makes the most sense.