“Data Lakes” is a term that was coined in
the last decade. This form of storage
grew with the advent of Big Data. The name, and the analogy were both coined by
James Dixon, CTO for Pentaho. The idea
behind a Data Lake is to have a place to store massive amounts of raw data that
can be used in a variety of ways, depending on the need.
Data flows into the lake like a stream,
filling up the lake. In its raw form, the lake isn’t good for anything specific
other than wading around. But a user can
examine the data, take samples, filter out needed parts, or troll through it
looking for and gathering the bits of data that are helpful for a specific
purpose. The data can be cleaned, and packaged,
into something that is useful to a consumer - think of a bottle of water.
The bottled water is like a Data
Warehouse. The source of that water is
the Data Lake.
Unlike a Data Warehouse, all the water,
including contaminants, fish and everything else, flow into the Data Lake. It does not matter what the source was or the
structure of it when it arrives. It is
stored in the Data Lake in its raw form, at the leaf level. Because the data contains everything, some
people prefer to refer to it as a Data Swamp.
The data is unstructured. This allows for the use of inexpensive
servers and storage devices. Terabytes
of data that may or may not ever be used are kept virtually indefinitely
because the economics of keeping it allow for this.
The nature of a Data Lake encourages
exploration. Someone with a specific
need that is not satisfied by the Data Warehouse can dive in and play. Very little resources are required to simply
explore or troll the Lake. If something
useful is found, a more structured analysis can be devised.
In Azure, Microsoft has provided tools that
are specific to exploring a Data Lake.
Azure Data Lake Analytics free you from system and hardware tuning and
allow you to get right down to writing queries using U-SQL
Using Azure, you will never be short of
power. Azure Data Lake Analytics will
scale from a small pond (gigabytes) to a vast lake (exabytes) of data.
Writing, debugging and tuning can all be
done in the Studio, which I hope you are familiar with by now.
Finally, using Azure Data Lake Analytics is
very cost effective. You only pay for
the processing time you use.
A free tutorial on exactly how to fish in a
Data Lake can be found here: https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-get-started-portal/
In my next blog, I will paddle over to the
Data Lake Store to see how that is used.
Hello! This post couldn’t be written any better! Reading this post reminds me of my previous room mate! He always kept chatting about this. I will forward this page to him. Fairly certain he will have a good read. Thank you for sharing!
ReplyDelete_____________________
Pentaho