Wednesday, September 14, 2016

Introduction to HDInsight

Again, I would like to start by bringing you up to date on some Microsoft news you may not have heard.  According to a recent Microsoft news release, out of the almost $12 billion they spent on R&D in 2015, over $1 billion was dedicated to “cybersecurity efforts”.  With their focus on developing Azure, clients should feel secure in storing their data in the cloud.  

Now, lets go answer the question on everyone’s mind:  What is a “hadoop” and will it level up to a Squirtle?

For decades, Microsoft fought the onslaught of open source software.  Who could have guessed that Microsoft would now embrace it?  This is just another example of Microsoft’s commitment to providing the best solutions in their Azure cloud.  Rather than using their old playbook and only providing Microsoft branded solutions, they have adopted a suite of open source products managed by Apache.

If you are not comfortable leaving the Microsoft family, the Data Lake Analytics system will let you swim in your lake using the familiar U-SQL.  But if you want to swim using the tools that big data crunchers have adopted, then head over to HDInsight in Azure Cloud.

HDInsight is Microsoft’s cloud service that provides a managed platform for Apache Hadoop, Spark, R, HBase, Giraph, and Storm. The open source foundation that manages all these solutions, is the Apache Software Foundation.

HDInsight brings all these powerful tools together in one place.  Combined, it allows you to manage and make sense from any big data project.

Hadoop is an open source program designed to work with very large data pools.  It provides a framework for distributed storage and processing.  The framework for how Hadoop works originates from a couple of Google research papers that were released in 2003 and 2004.  Development of Hadoop didn’t start until 2006.  Douglass Cutting headed up the project and named the software after his son’s toy elephant.  The Hadoop platform makes big data easier to manage.  (Sorry, it turns out it isn’t a Pokemon, so you will have to continue your search for a Squirtle elsewhere.)

Spark is a very fast engine used for large scale data processing.  It will run on top of Hadoop and allow you write applications in Java, Scala, Python, and R.  Your applications will run up to 100 times faster in Spark than they would if launched in Hadoop.

R is a programming language designed to do statistical computing and graphics, using data lakes as its data source.  If you are working with big data, R is one of the languages you will need to familiarize yourself with.

HBase is a non-relational, distributed database.  It is designed to provide a fault-tolerant way of storing large quantities of sparse data.  It doesn’t replace SQL, and, in fact, Apache Phoenix will provide a SQL layer for HBase if you need one.  HBase is a very efficient storage database that is tuned for use with big data.

Giraph is a graph processing platform designed to work with big data.  Graph structures such as flight routes connecting airports, computers connected via the internet, and hypertext links between web pages, all show relational structures that can provide insight into how systems work and how to make them more efficient.  Giraph hit the mainstream when Facebook published a paper showing how they used it to analyze one trillion edges (links) in only four minutes.

Storm is a system that allows for real time processing of unbounded streams of data.  Using Storm, you can perform real-time analytics on your data stream.

No comments:

Post a Comment