Thursday, July 21, 2016
Data Lakes: What are they, and what are they used for?
“Data Lakes” is a term that was coined in the last decade. This form of storage grew with the advent of Big Data. The name, and the analogy were both coined by James Dixon, CTO for Pentaho. The idea behind a Data Lake is to have a place to store massive amounts of raw data that can be used in a variety of ways, depending on the need.
Data flows into the lake like a stream, filling up the lake. In its raw form, the lake isn’t good for anything specific other than wading around. But a user can examine the data, take samples, filter out needed parts, or troll through it looking for and gathering the bits of data that are helpful for a specific purpose. The data can be cleaned, and packaged, into something that is useful to a consumer - think of a bottle of water.
The bottled water is like a Data Warehouse. The source of that water is the Data Lake.
Unlike a Data Warehouse, all the water, including contaminants, fish and everything else, flow into the Data Lake. It does not matter what the source was or the structure of it when it arrives. It is stored in the Data Lake in its raw form, at the leaf level. Because the data contains everything, some people prefer to refer to it as a Data Swamp.
The data is unstructured. This allows for the use of inexpensive servers and storage devices. Terabytes of data that may or may not ever be used are kept virtually indefinitely because the economics of keeping it allow for this.
The nature of a Data Lake encourages exploration. Someone with a specific need that is not satisfied by the Data Warehouse can dive in and play. Very little resources are required to simply explore or troll the Lake. If something useful is found, a more structured analysis can be devised.
In Azure, Microsoft has provided tools that are specific to exploring a Data Lake. Azure Data Lake Analytics free you from system and hardware tuning and allow you to get right down to writing queries using U-SQL
Using Azure, you will never be short of power. Azure Data Lake Analytics will scale from a small pond (gigabytes) to a vast lake (exabytes) of data.
Writing, debugging and tuning can all be done in the Studio, which I hope you are familiar with by now.
Finally, using Azure Data Lake Analytics is very cost effective. You only pay for the processing time you use.
A free tutorial on exactly how to fish in a Data Lake can be found here: https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-get-started-portal/
In my next blog, I will paddle over to the Data Lake Store to see how that is used.