UPDATE: 2020 January 20 – When I originally wrote this post, the term data lake wasn’t in vogue. Well, times have changed! Nowadays, the term sandbox has largely been replaced with data lake. The concept is the same. Now, the text is updated!
A few weeks ago I wrote about the idea that there are really only two uses for BI: analysis and business process improvement. Today, let’s focus a bit more on analytic users / analysts. In particular, let’s consider the concept of the data ‘sandbox’ or data lake. A sandbox / data lake is an area of storage where a few highly skilled users can import and manipulate large volumes of data. The whole point is that these users frequently need data other than what’s in the warehouse. We’ve encountered the sandbox / data lake concept a few times recently at Dataspace, both in our consulting engagements and in conversations with other data warehousing professionals.
As a rule, sandbox / data lake data is only needed by a small user group and it’s only needed temporarily. Once the necessary knowledge is squeezed out, the data can be discarded. If the data doesn’t fit these criteria, it may be a good candidate for addition to the data warehouse for widespread access.
Think of a sandbox / data lake as a grown up version of Microsoft Access; it’s a place without a ton of IT controls and where folks can throw and manipulate data. With this in mind, there are two basic ways to provide a sandbox / data lake. The first is as a completely separate database where users can store and play with data. This is a common use case for Hadoop and other noSQL technologies.
The other option, and one we recently recommended to a client, is to set aside a portion of a data warehouse as a sandbox / data lake. This area would only be accessible to analysts and only under certain circumstances. This approach has one major advantage: new data sets that are loaded can easily be tied to existing data warehouse tables. There’s no need to transfer warehouse data to some other environment, and then have to go back for more if the first grab doesn’t contain all the necessary information.
One danger of sandboxes and data lakes is that end-users might use them to build permanent applications outside the purview of IT. We strongly discourage this since the proper home for permanent analytic data structures is in the data warehouse itself. However, organizations frequently do need to examine data and prototype table designs before building permanent structures. A sandbox / data lake is an ideal environment for this type of work. In fact, Tom Carroll, one of our Dataspace consultants, spent a number of years in the telematics division of an auto manufacturer and used this technique extensively.
Regardless of how you implement a sandbox or data lake, it is important to establish ground rules before opening the environment to user access. Starting point guidelines might include:
- The environment is for short term studies and prototyping only, and is not to be used as a permanent home for applications or data
- Each user is assigned a controlled amount of storage, and governors are put in place to limit the computing resources they can consume
- Users must demonstrate both technical expertise and a willingness to abide by the environment’s rules before participating
- All sandbox / data lake users, and recipients of analyses produced in that environment, must understand that the analyses were produced without the level of validation and quality control provided by applications built on the “managed” portion of the data warehouse.
In order to keep the sandbox / data lake from becoming a permanent home for databases, some organizations run system reports on their tables to see how long they have existed. If something was built, say, six months ago, IT is justified in asking its owner what the purpose of the structure is and what the time frame is for the table to become a ‘certified’ piece of the data warehouse.
As for how much space to give users, the answer is determined by balancing the cost of sandbox / data lake space with the need to ensure enough space is available for promising prototyping and analysis activities.
In the end, the “democratization” of technology has made it likely that some end users will have significant database and SQL skills. It can be game changing for both these users, and their organizations, to provide them with a place to pursue data-based ideas and possibilities. However, these users must understand that they are working inside IT-supported environments. In fact, we recommend that a written agreement between these users, their managements, and IT be put in place before they are granted sandbox or data lake access. This agreement should outline the ground rules for the database, the space to be granted to the user, allowed and disallowed activities, and any other guidelines that will make for a smooth-running environment.
Does your organization have a data sandbox or data lake in place? What’s working? What’s not? Leave your comments on this post or send me an email at Benjamin.Taub@Dataspace.com.
Thanks for reading!
PS: Since this post was originally written, Dataspace has delivered a service that provides a critical service to data lake users. Golden Record can identify when the same records occur in multiple data sets. For example, it can link the customer records in your data lake collected from your CRM system to the attendee records captured in a spreadsheet that you you also loaded into your data lake. Check it out on its dedicated website.