Yet another regression model occurs when dealing with data that may not be as linear, but more dispersed. In such cases, linear regression may not be the best way to describe the data and predict new values. A curved or nonlinear line may be better suited to such data.
The equation for a polynomial of degree n is:
y = β0 + β1X1 + β2X2^2 + β3X3^3 + … + βnXn^n
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import statsmodels.api as sm
%matplotlib inline
Our data are fictitious prices for renting a conference room with a different number of seats:
data = pd.read_csv(r'C:\Users\VFW863\Desktop\en\poly.csv');data.head()
X = data[['seats']].values
y = data[['price']].values
Let’s see how the linear function fits the data:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
plt.scatter(X, y, color = 'm')
plt.plot(X, lin_reg.predict(X), color = 'g')
plt.title('Linear Regression')
plt.xlabel('seats')
plt.ylabel('price')
plt.show()
lin_reg.predict(np.array([140]).reshape(1, 1))
The price prediction for 140 places was 1056, which is less than the real prices, we should use a different model for this data. Visually, you can see that the data is not linear – prices do not increase at the same rate as the number of seats in the room – so let’s try the polynomial function with different degrees (2,3,4):
from sklearn.preprocessing import PolynomialFeatures
poly_reg2 = PolynomialFeatures(degree = 2)
X_poly2 = poly_reg2.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly2, y)
poly_reg3 = PolynomialFeatures(degree = 3)
X_poly3 = poly_reg3.fit_transform(X)
lin_reg_3 = LinearRegression()
lin_reg_3.fit(X_poly3, y)
poly_reg4 = PolynomialFeatures(degree = 4)
X_poly4 = poly_reg4.fit_transform(X)
lin_reg_4 = LinearRegression()
lin_reg_4.fit(X_poly4, y)
Our value increases the matrix X to X_poly where each column contains the power of x:
X_poly2[:3]
plt.scatter(X, y, color = 'y')
plt.plot(X, lin_reg_2.predict(X_poly2), color = 'r')
plt.plot(X, lin_reg_3.predict(X_poly3), color = 'g')
plt.plot(X, lin_reg_4.predict(X_poly4), color = 'b')
plt.title('Polynomial Regression')
plt.xlabel('seats')
plt.ylabel('price')
plt.show()
print ('degree 2:', lin_reg_2.predict(poly_reg2.fit_transform([[140]])))
print ('degree 3:', lin_reg_3.predict(poly_reg3.fit_transform([[140]])))
print ('degree 4:', lin_reg_4.predict(poly_reg4.fit_transform([[140]])))
The prediction of the price of a room with 140 seats using the polynomial model gave us prices – 1067-1072, i.e. values more approximate to real prices.