I talked recently about the process of choosing the right algorithm involving
a lot of trial and error. Using the Algorithm Cheat Sheet provided by Microsoft
is a great start, but, having a basic understanding of how the various
algorithms work will certainly help guide you towards the one that is right for
your situation.
There
are three basic types of ML algorithms: Supervised, Unsupervised, and
Reinforcement Learning. Azure ML works with Supervised and
Unsupervised. Reinforcement Learning is primarily used in robotics to
teach the robot how to do something. Self-driving cars are a great
example where reinforced learning is used.
Supervised
ML is by far the most popular and that is reflected in the fact that all but
one algorithm in Azure is Supervised. Supervised algorithms are
“predictive”. They are used to predict a future outcome based on
historical data. With this type of algorithm, you must be clear in what
you want to learn and how to go about learning it. You supervise the
process. You get to choose what data is most important. Because you
are only guessing as to what data is most important, you would run several experiments
using a variety of permutations.
There
are three subsets to the Supervised algorithm: Classification,
Regression, and Anomaly. You would choose a Classification model if you
trying to choose between different things: Red or Blue; aquatic or non-aquatic;
Small, Medium or Large; a team in a tournament.
Regression
algorithms are used when you need to predict a number or value. This
could be a sale percentage, a stock price, a number of units sold.
Anomaly
algorithms are used when you are looking for the outliers in a set of
data. This is when you have a large set of data where most of it is as
you would expect or want, but you want to find out where you can expect
trouble. This type of algorithm is commonly used in fraud
detection. The algorithm will learn what “normal” looks like, and then
separate out the data sets that do not follow that normal pattern.
Unsupervised
algorithms are “descriptive” models. They are used to find patterns in
what looks like random data. For example, a retailer could use an
unsupervised algorithm to figure out what combinations of products are often
purchased together. In this model, you have no specific target in mind
and you do not pick or choose any specific features as particularly
important. A medical authority may use this to see which diseases are
likely to occur when one specific disease is present.
Another
consideration when choosing an algorithm is how linear your data is. If
you try to generalize a non-linear data set, like traffic congestion over the
course of a day, you are not going to get an accurate picture of traffic
patterns. When you are dealing with non-linear data with peaks and valley
like rush hour and 3 AM traffic, you need to choose an algorithm that will not
reduce the problem down to an average. Having said that, linear
algorithms are still a great place to start. They are simple and fast to
train, so they can give you a quick overview of your data. In addition,
if you are trying too hard to be accurate, you may end up over-fitting.
This is where your algorithm is built to fit your limited set of data
points. You end up describing the random error or noise instead of the
underlying relationships. Sometimes a set of data that is best described
by a linear function does not look linear due to the outliers in the
data. The outliers may cluster, giving it a non-linear feel.
When
setting up your experiment, you get to play with the various parameters that
affect the algorithm’s behavior. Choosing things like error tolerance or
number of iterations can have an enormous effect. The more
parameters you choose, the more trial and error you introduce to the
experiment. Azure ML has a parameter sweeping feature that automatically
tries all parameter combinations (you choose the granularity). Keep in
mind that the time required to train a model increases exponentially with the
number of parameters.
No comments:
Post a Comment