We divide machine learning into two types of problems: supervised and unsupervised. Both types are further divided, we will deal with the supervised ones, which have two subgroups – regression and classification. For regression problems, the value we want to predict is a continuous value. On the other hand, in case of problems with classification, the value we want to predict is discrete. We take a closer look at the regression problem, and as an example of data – several attributes that make up the price of the house – and the price of the house is a continuous value that we want to predict. The data is fictitious.
Every problem is an optimization problem. This means that we are trying to find the maximum or minimum of specific function. The function you want to optimize is usually called the loss / cost function. A loss function is defined for each algorithm used and is the main measure for judging the accuracy of a trained model.
Which simply means that our model’s loss is the sum of the distance between the projected house price and the underlying truth. This loss function is called the loss square or least squares. We want to minimize the loss function as much as possible so that the predictions are as close to the truth as possible. This is simple if we only have one independent variable, while with many variables the optimization of the coefficient is not so simple.
Having more variables may make it easier for us to train the model, but on the other hand, be careful the model might be over-trained. Data needs to be filtered (anomalies) and explored, some functions can be more destructive than helpful, and for example can repeat information that already expresses other functions and add a high level of noise to the dataset.
from IPython.display import Image
Image(filename="img/variance.png")
Multiple function model:
y = β0 + β1X1 + β2X2 + β3X3
Overfitting problems are very common and there are various ways to avoid them. It is best to simplify the model without losing any information, which is a trade-off between overfitting and simplifying the model. One of the most common mechanisms to avoid overfitting is called regularization. A regularized model is a model whose loss function contains an additional element that will also be minimized. Two elements of the loss function: – the sum of the distances between each forecast and its true value, and the other element is the regularization element. It sums the square values of β and multiplies them by another parameter λ. The result is a „punishment” of the loss function for high values of the β coefficients. Our task is to simplify the model as much as possible and minimize the loss function. By punishing the β value, we will add a constraint to minimize it as much as possible.
from IPython.display import Image
Image(filename="img/bias.png")
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
data = pd.read_csv(r'C:\Users\VFW863\Desktop\en\regres.csv'); data.head()
We prepare data for modeling – we divide it into dependent (X) and independent (y) variables:
X = data[['area', 'rooms', 'bathrooms', 'city']]; X.head(3)
y = data['price']; y.head(3)
Ridge Regression¶
Ridge regression is an extension of linear regression. The parameter λ is the scalar to choose, preferably using the cross validation method. Ridge regression enforces lower β coefficients, but does not force them to be zero. This means that we do not get rid of irrelevant features, but rather minimize their impact on the trained model.
- Reduces parameters, therefore it is mainly used to prevent multi-wedge.
- Reduces the complexity of the model by shrinking the factor.
- Uses the L2 regularization technique.
The Ridge () function has an alpha (λ) argument, which is used to tune the model. Higher alpha produces simpler models
from IPython.display import Image
Image(filename="img/ridge.png")
Lasso Regression¶
Lasso is another extension to linear regression and differs from Ridge regression in that the term regularization is in absolute value. The Lasso method overcomes the drawback of Ridge regression by not only punishing high β coefficients, but even setting them to zero if they are not significant (de facto removing them). Therefore, we may have fewer attributes in the model than at the beginning.
- It uses the L1 regularization technique
- Usually it is used when we have a lot of variables
(1 / (2 n_samples)) ||y – Xw||^2_2 + alpha * ||w||_1
from IPython.display import Image
Image(filename="img/reg.png")
Elastic Net Regression¶
Elastic Net regression combines the features of Ridge and Lasso, through the use of two types of norms, the significance of which is controlled by the parameter ρ.
1 / (2 n_samples) ||y – Xw||^2_2
- alpha l1_ratio ||w||_1
- 0.5 alpha (1 – l1_ratio) * ||w||^2_2
from IPython.display import Image
Image(filename="img/lasso.png")
We will divide the data into testing and training:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
We define our models, in regressions with built-in cross_validation (ridgecv and lassocv) we add some options for alpha, then we will see which value is the best. If we don’t choose alpha – the default value -> 1 will be assigned, in simple regression alpha = 0. The higher the alpha value, the greater the ‚punishment’ – regularization.
regres = LinearRegression(normalize = True);regres
ridge = Ridge(normalize = True);ridge
ridgecv = RidgeCV(normalize = True, alphas = [0.1, 1.0, 10, 15]);ridgecv
lasso = Lasso (normalize = True);lasso
lassocv = LassoCV (normalize = True, alphas = [0.1, 1.0, 10, 15]);lassocv
elastic = ElasticNet(normalize = True);elastic
We adjust the model to the training data:
regres.fit(X_train, y_train)
ridge.fit(X_train, y_train)
ridgecv.fit(X_train, y_train)
lasso.fit(X_train, y_train)
lassocv.fit(X_train, y_train)
elastic.fit(X_train, y_train)
We can see which alpha for ridgecv and lassocv was selected:
print('Alpha - ridge:', ridgecv.alpha_)
print('Alpha - lasso:', lassocv.alpha_)
After matching these 6 models, we will see which is the best – R2 value:
print('R2 Linear Regression:', np.round(regres.score(X_test, y_test),2))
print('R2 Ridge Regression:', np.round(ridge.score(X_test, y_test),2))
print('R2 RidgeCV Regression:', np.round(ridgecv.score(X_test, y_test),2))
print('R2 Lasso Regression:', np.round(lasso.score(X_test, y_test),2))
print('R2 LassoCV Regression:', np.round(lassocv.score(X_test, y_test),2))
print('R2 Elastic Net Regression:', np.round(elastic.score(X_test, y_test),2))
We found the best value of R2 is elastic net, but because it is also negative, it means that the model is weak. Now we will use the predict method and then we will see what is the error for each regression:
pred_regres = regres.predict (X_test)
pred_ridge = ridge.predict (X_test)
pred_ridgecv = ridgecv.predict (X_test)
pred_lasso = lasso.predict (X_test)
pred_lassocv = lassocv.predict (X_test)
pred_elastic = elastic.predict (X_test)
from sklearn import metrics
print('MSE Linear Regression:', np.round(metrics.mean_squared_error(y_test, pred_regres),2))
print('MSE Ridge Regression:', np.round(metrics.mean_squared_error(y_test, pred_ridge),2))
print('MSE RidgeCV Regression:', np.round(metrics.mean_squared_error(y_test, pred_ridgecv),2))
print('MSE Lasso Regression:', np.round(metrics.mean_squared_error(y_test, pred_lasso),2))
print('MSE LassoCV Regression:', np.round(metrics.mean_squared_error(y_test, pred_lassocv),2))
print('MSE Elastic Net:', np.round(metrics.mean_squared_error(y_test, pred_elastic),2))
The error is also the smallest for the elastic net model, so out of all these models this is the best but still poor.