Thursday, October 13, 2016

Data Factory

What is it and what is it used for?

Microsoft in the News:

As a netizen, odds are that there are few things that you like more than you cat (or dog, depending you your proclivities).  Some even believe that the internet was created for pet lovers to share photos, videos and stories.  So, this week I thought I’d lighten things up with something I stumbled across and thought was an interesting IoT device.  All work and no play … actually kinda describes my life lately.  Anyway, I want to tell you about a device that allows you to track your pet.  Its called G-Paws.

G-Paws is a device that attaches to your pet’s collar.  It doesn't track them in real time since that would mean a lot more weight and some kind of subscription service.  It will, however, allow you to download the stored data to G-Paws website which is hosted by Azure.  The download can be done through your smart phone or your computer.  The G-Paws website uses Azure’s Internet of Things to store and process the data and give you a visual presentation of what your little fluff ball has been up to.  Perhaps the Internet of Things will become useful to the average person after all?

The steady stream of structured and unstructured data that comes in from all of G-Paws’ customers need to be automatically processed and then presented back to the client in a meaningful format.  In order to automate this, G-Paws set up a data factory in Azure.

Now, put on your hard hat, we are now going to stroll through the factory.

As we all know, a factory is a place where a steady stream of raw material is brought in and processed in order to produce a steady stream of finished product.  The materials don’t all necessarily enter the same pipeline.  The parts to build the chassis of a car will go in one pipeline and the parts to build the motor will enter a different pipeline.  At some point within the factory, the finished product from one pipeline (the engine) is combined with the product from the other pipeline (the chassis) to produce the final output.   

A Data Factory does the same thing.  The raw material comes in initially as a stream.  With a little processing, some, most, or perhaps all of that data is fed into a specific pipeline that is directed towards one or more processes that will take place within the Data Factory. 

Other data may be fed into a different pipeline and undergo a different process.  Each process may need to a series of transformations, or perhaps just a single transformation.  Some of the processes may be done in parallel, or in series.  These are all things that you will define as you build your factory. 

The data is processed through one or more pipelines, and when it reaches the end, it will be combined to produce the useable finished product.  The factory will contain all the processes necessary to automatically produce a steady stream of finished products.  In this case, processed data that is useable by the client.

Don’t you just love it when analogies from the real world we are all familiar with translate so nicely into the digital world?

Microsoft has a number of tutorials that will walk you through the process of building some sample Data Factories.  The really nice thing about Azure is that it provides you with all kinds of raw materials and tools to let you play for free.  You can learn to build a Data Factory knowing that there are no hazardous materials or red tape that may impede your progress.  Just some fun to be had while learning a new skill.

If you are ready to get started, here are some links to some tutorials:

Process data using Hadoop cluster:
Copy data from Blob Storage to SQL:
Move your data to the cloud:

1 comment:

  1. This is insightful and data factory made easy. ... Thanks for making it this simple.