Thursday, May 26, 2016
Algorithms part one seeing the forest thru the trees.
As you peruse your options in Azure ML, it is helpful to understand what it is you are looking at. For those of us who are not Data Scientists this world has a lot of new language and processes to learn. Understanding how and when to use the various types of ML algorithms is your entrance key to this world. Machine learning is all about finding the structures hiding within data sets (pattern recognition). The ML algorithm you need will depend on the nature of the data and the structure you are looking for.
As I mentioned in my last post, Linear Regression algorithms are the work horses in this industry. When you have 2 variables that are related, they are easily represented in the X and Y plane and then reduced to a linear equation. Your Y variable is the dependent variable that you want to predict. The X variable is independent and is the “seed” that you use to predict Y. Linear relationships are the simplest and easiest to work with. Even if they are not “true” relationships, they are often close enough to be useful.
Logistic Regression is used when you have two unique and unrelated possible outcomes (usually described as “dichotomous” or “binary”) that can result from some explanatory variables. This is where you want to have your results described by unique classifications. Unlike in linear regression where your dependent variable can have a range of correct answers, in logistic regression, the dependent variable falls into unique categories. You may want to predict whether a shopper who buys a certain basket of goods is male or female. The government may look at a wide variety of data to predict whether a person will be violent or non-violent. Students might be interested in the likelihood of passing or failing an exam, given a variety of possible hours spent studying. Logistic regression can be considered as a specialized case of the linear regression algorithm. It is fast and simple to use, just like linear regression. The results are similar to a linear approximation, so again, you have to content with approximate results. The resultant curve looks like an “S” with the transition from one state to the other taking place in the middle of the graph.
Decision Trees classify data by sorting up the tree from the root to a final “leaf” node. Each node specifies a test of some attribute and each branch follows one of the possible values for this attribute. This process is repeated until the final desired attribute is found. Decision trees are commonly used to classify medical patients by their disease, equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting. Each time the node is split by a test, the data sets are also split. So, for example, if you are looking to find the likelihood of a person defaulting on a loan, you could develop a decision tree that splits your data, first into male/female, then, over 30 years old/under 30 years old, then, over $50K gross income/under 50K gross income, etc. You enter your data at the root and begin splitting it up. At the final node, using your historical data, you can assign a likelihood of people similar to this defaulting on a loan. Individual trees like this can be fairly weak in their predictive value. To remedy this, data scientists will us a Decision Forest. The Decision Forest will use several decision trees with a variety of different test nodes. No two trees are alike and each tree individually may be weak in their predictive value. By taking a regression value from all the trees, the “Forest” tends to have a much stronger correlation to the predicted outcomes.
Next week I will look at Algorithms part two – Neural network, SVMs and Bayesian.