Decision trees are a graphical method of supporting the decision-making process. The decision tree algorithm is also used in machine learning to derive knowledge from examples. The decision tree is built iteratively from the root to the leaves. In subsequent iterations, nodes consisting of appropriately selected attributes are added. Attributes are selected by the so-called feature selection algorithm in order to maximize the information gain from a given node. The whole process ends when all the leaves are in the same class or when the classes are taken for division. When building a complete tree, there is a high probability of over-fitting the algorithm. It is a common problem with the basic decision tree. This can have a negative impact on the algorithm’s result, so it is good practice to trim the tree to a certain fixed depth. Overwriting can also be avoided in a different way by building a modified version of the algorithm.
from IPython.display import Image
Image(filename="img/tree.png")
Random forests, which are a generalization of the idea of decision trees as the ensemble method. Random forests work by classifying them with a group of decision trees. The final decision is made as a result by majority voting over the classes indicated by individual decision trees. However, each tree is constructed on the basis of a bootstrap sample, created by drawing and returning N objects from the training sequence with cardinality N. In each node, the division is made only on the basis of k randomly selected features. Their number is much smaller than p, that is, the number of all features. Thanks to this property, random forests can be used in problems with a huge number of features.
from IPython.display import Image
Image(filename="img/ensemble.png")
A random forest model is a collection of decision tree models that are combined to make predictions. When you create a random forest, you need to determine the number of decision trees you want to use to build the model. The random forest algorithm takes random samples of observations from the training data and builds a decision tree model for each sample. Random samples are usually taken with swap, which means that the same observation can be taken multiple times. The end result is a set of decision trees that are created with different groups of data observations from the original training data.
Procedures for aggregating families of classifiers (ensemble method) serve to strengthen them. These methods include:
- bagging algorithm – groups of classifiers (not necessarily trees)
- boosting algorithm – groups of classifiers (not necessarily trees)
- random forests – groups of decision trees
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
We will use a very popular data set – iris:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()
We will add a column with the name os iris species:
df['name'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()
We will change the names of the species to numbers and divide the data into training and test data:
df['name_cat'] = pd.factorize(df['name'])[0]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']], df['name_cat'],test_size=0.25)
And now time to build the model:
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(X_train, y_train)
clf.predict(X_test)
How reliable is the classifier for each observation?
clf.predict_proba(X_test[0:5])
We have three species of irises, so [1., 0., 0.] tells us that the classifier is sure that the plant belongs to the first class. Taking another example, [0.1, 0.4, 0.5] tells us that the classifier gives a 50% probability that the plant is in the third class, 40% in the second and 10% in the first class. Since 50 is greater than 40, the classifier predicts that the plant belongs to the third class.
Class = iris.target_names[clf.predict(X_test)]
Class[0:5]
names = iris.target_names[y_test]
names[0:5]
Using the matrix, we can see how good our classifier is. The columns are the target species for the test data and the rows are the actual species of the test data. The values on the diagonal were classified correctly and everything outside the diagonal was classified incorrectly.
pd.crosstab(names, Class, rownames=['names'], colnames=['classification'])
Random Forests also offers a good rate of feature selection. The Scikit library provides an additional indicator that shows the relative importance or contribution of each variable to the forecast. It automatically calculates the relevance of each function during the training phase. It then scales the meaning so that the sum of all scores is 1. This score will help us select the most important variables and remove the least important ones for building the model. A random forest uses the Gini coefficient (a measure of the concentration (non-uniformity) of the distribution of a random variable in statistics) to calculate the significance of each feature. The Gini Index can describe the overall explanatory power of a feature.
list(zip(X_train, clf.feature_importances_))
Clearly, petal width is the most important feature in the classification.
To summarize: Random forests can be used for both classification and regression. It is also the most flexible and easy-to-use algorithm. A forest is made of trees, and the more trees, the stronger the forest is. Random forests create decision trees on randomly selected data samples, take forecasts from each tree and select the best solution by voting. It also provides a pretty good feature importance rating.