Logistic regression is similar to linear regression, but the curve is constructed using the natural logarithm of the „probability” of the target field, not the probability. Moreover, the predictors do not have to be normally distributed, nor do they have to have the same variance in each group.
1 / (1 + e ^ -value)
Where e is the base of the natural logarithms (Euler number or EXP () function in the spreadsheet) and value is the actual numeric value you want to transform.
The input values (x) are combined linearly using weights or factor values (known as the Greek letter Beta) to predict the output value (y). The key difference from linear regression is that the modeled output is a binary (0 or 1) value, not a numeric value.
The following is an example of a logistic regression equation:
y = e ^ (b0 + b1 x) / (1 + e ^ (b0 + b1 x))
Where y is the predicted output value, b0 is the start or captured value, and b1 is the coefficient for the single input value (x). Each column in the input data has an associated b-factor (constant true) that you need to know from your training data.
from IPython.display import Image Image(filename="img/log1.png")
Logistic Regression is one of the simplest and most widely used machine learning algorithms for classifying two classes. It is easy to implement and can be used as a benchmark for any binary classification problem. Logistic regression describes and estimates the relationship between one binary dependent variable and independent variables.
Types of logistic regression:
- Binary logistic regression: the target variable has only two possible outcomes, such as spam or not spam, cancer or no cancer.
- Multinomial logistic regression: the target variable has three or more nominal categories, such as predicting the type of wine.
- Simple Logistic Regression: The target variable has three or more ordinal categories, such as restaurant or product rating from 1 to 5.
For this classification problem, we will use credit card data from the bank and try to detect fraud:
import numpy as np import pandas as pd data = pd.read_csv(r'C:\Users\VFW863\Desktop\en\creditcard.csv');data.head()
5 rows × 31 columns
We divide the data into independent variables (X) and dependent variable (y), let’s see how many observations are in each category:
X = data.iloc[:,:-2] y = data['Class'] y.value_counts()
0 284315 1 492 Name: Class, dtype: int64
As you can see, the classes are not equal and you have to take this inequality into account, but before we get to that, we will divide the data into training and testing:
from sklearn import model_selection from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20)
We normalize the data:
from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() X_train = sc_x.fit_transform(X_train) X_test = sc_x.transform(X_test)
Then we model the data:
from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train) y_predicted = model.predict(X_test)
And let’s check how good our model is:
from sklearn.metrics import accuracy_score accuracy_score(y_test, y_predicted)
99% probably too good … credit to unbalanced classes
from sklearn.metrics import confusion_matrix confusion_matrix(y_test, y_predicted)
array([[56857, 11], [ 37, 57]], dtype=int64)
Like many other algorithms, logistic regression has a built-in method for handling unbalanced classes. If we have highly unbalanced classes and we don’t include them in preprocessing, we have the option of using class_weight to weight the classes to make sure we have a balanced mix of each class. In particular, a balanced argument automatically weighs classes inversely proportional to their frequency. We model the data again:
bal = LogisticRegression(random_state=0, class_weight='balanced') bal.fit(X_train, y_train) y_predicted = bal.predict(X_test) accuracy_score(y_test, y_predicted)
array([[55601, 1267], [ 11, 83]], dtype=int64)
The results are a little more likely. It is worth mentioning that accuracy as an evaluation method should not be used in the case of large imbalances.