As you peruse your options in Azure ML, it is helpful
to understand what it is you are looking at. For those of us who are not
Data Scientists this world has a lot of new language and processes to learn.
Understanding how and when to use the various types of ML algorithms is your
entrance key to this world. Machine learning is all about finding the
structures hiding within data sets (pattern recognition). The ML
algorithm you need will depend on the nature of the data and the structure you
are looking for.
As I mentioned in my last post, Linear Regression
algorithms are the work horses in this industry. When you have 2
variables that are related, they are easily represented in the X and Y plane and
then reduced to a linear equation. Your Y variable is the dependent
variable that you want to predict. The X variable is independent and is
the “seed” that you use to predict Y. Linear relationships are the
simplest and easiest to work with. Even if they are not “true”
relationships, they are often close enough to be useful.
Logistic Regression is used when you have
two unique and unrelated possible outcomes (usually described as “dichotomous”
or “binary”) that can result from some explanatory variables. This is
where you want to have your results described by unique classifications.
Unlike in linear regression where your dependent variable can have a range of
correct answers, in logistic regression, the dependent variable falls into unique
categories. You may want to predict whether a shopper who buys a certain
basket of goods is male or female. The government may look
at a wide variety of data to predict whether a person will be violent or
non-violent. Students might be interested in the likelihood of passing
or failing an exam, given a variety of possible hours spent
studying. Logistic regression can be considered as a specialized case of
the linear regression algorithm. It is fast and simple to use, just like
linear regression. The results are similar to a linear approximation, so
again, you have to content with approximate results. The resultant curve
looks like an “S” with the transition from one state to the other taking place
in the middle of the graph.
Decision Trees classify data by sorting up the tree from
the root to a final “leaf” node. Each node specifies a test of some
attribute and each branch follows one of the possible values for this
attribute. This process is repeated until the final desired attribute is
found. Decision trees are commonly used to classify medical patients by
their disease, equipment malfunctions by their cause, and loan applicants by
their likelihood of defaulting. Each time the node is split by a test,
the data sets are also split. So, for example, if you are looking to find
the likelihood of a person defaulting on a loan, you could develop a decision
tree that splits your data, first into male/female, then, over 30 years
old/under 30 years old, then, over $50K gross income/under 50K gross income,
etc. You enter your data at the root and begin splitting it up. At
the final node, using your historical data, you can assign a likelihood of
people similar to this defaulting on a loan. Individual trees like this
can be fairly weak in their predictive value. To remedy this, data
scientists will us a Decision Forest. The Decision Forest will use
several decision trees with a variety of different test nodes. No two
trees are alike and each tree individually may be weak in their predictive
value. By taking a regression value from all the trees, the “Forest”
tends to have a much stronger correlation to the predicted outcomes.
Next week I will look at
Algorithms part two – Neural network, SVMs and Bayesian.
No comments:
Post a Comment