The Supporting Vector Machines (SVM) method offers very high accuracy compared to other classifiers such as logistic regression and decision trees. The advantage of SVM is its use for nonlinear data. It is used in a variety of applications – face detection, intrusion detection, email, newspaper and website classification, gene classification, and handwriting recognition.

SVM can easily handle many continuous and categorical variables. SVM constructs a hyperplane in a multidimensional space to separate the different classes. The SVM generates the optimal hyperplane in a repeatable manner which is used to minimize the error. The basic idea behind SVM is to find the maximum boundary hyperplane (MMH) that best divides the data set into classes. SVM can be used for both classification and regression challenges. However, it is most often used in classification problems.

How does SVM work?

The main goal is to segregate a given data set in the best possible way. The distance between the closest points is called the margin. The goal is to select a hyperplane with the maximum possible margin between the support vectors in a given data set. The SVM searches for the maximum boundary hyperplane in the following steps:

- We generate hyperplanes that segregate classes best. Then we select the correct hyperplane with maximum segregation from the nearest data point. – Simple SVM

```
from IPython.display import Image
Image(filename="img/svm1.png")
```

```
from IPython.display import Image
Image(filename="img/svm2.png")
```

- Dealing with nonlinear and non-separable data – Some problems cannot be solved with a linear hyperplane. In such a situation, SVM uses the so-called kernel trick to transform input space into higher dimensional space. – SVM kernel

```
from IPython.display import Image
Image(filename="img/svm3.png")
```

SVM solves this problem by introducing an additional feature. These are functions that transform an inseparable problem into a separable problem, these functions are called kernels. This is mainly useful for nonlinear separation. In other words, it performs very complicated data transformations so that class separation is possible.

```
from IPython.display import Image
Image(filename="img/svm4.png")
```

Auxiliary vectors are the data points that are closest to the hyperplane. These points will better define the dividing line by calculating the margins. These points are important for the construction of the classifier.

A hyperplane is a decision plane that separates a set of objects with different class memberships.

The margin is the gap between two lines at the closest class points. It is computed as the perpendicular distance from the line to the auxiliary vectors or the closest points. If the margin is larger between grades then it is considered a good margin.

The SVM differs from other classification algorithms in that it selects a decision boundary that maximizes the distance from the closest data points of all classes. SVM not only finds the decision boundary; finds the most optimal decision limit. The most optimal decision limit is the one that has the maximum margin from the nearest points of all classes.

Iris data example:

```
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()
```

We add new column with species name:

```
df['name'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['name_cat'] = pd.factorize(df['name'])[0]
```

We will divide the data into a test and training part:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']], df['name_cat'],test_size=0.25)
```

Now we can apply the SVM classifier:

```
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
```

Let’s see how this classifier worked:

```
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
```

The results are pretty good and similar to the decision tree classifier. To improve the result, we can always play with parameter optimization. It can be a kernel change (it was linear in the example above), a parameter C or Gamma.

Kernel – There are different types of functions, such as linear, polynomial, and radial (RBF). Polynomials and RBF are useful for a nonlinear hyperplane. The polynomial and RBF kernels compute the dividing line in a higher dimension. In some applications it is recommended to use a more complex kernel to separate curved or nonlinear classes. This transformation can lead to more accurate classifiers.

Regularization – C parameter used to maintain regularization. Here, C is the penalty parameter that represents misclassification or error. The misclassification or error term tells SVM optimization how much error can be tolerated. This way you can control the trade-off between the decision boundary and misclassification. A smaller C value creates a hyperplane with a small margin, and a larger C value creates a hyperplane with a larger margin.

Gamma – A lower Gamma value will loosely match the training dataset, while a higher gamma value will exactly match the training dataset, resulting in overfitting. In other words, it can be said that a low gamma value only takes into account nearby points in the break line calculation, while a high gamma value takes into account all data points in the break line computation.

Let’s try how the results change with a different kernel:

```
svclassifier = SVC(kernel='poly', degree=4)
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

```
svclassifier = SVC(kernel='rbf')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

The results are similar on all 3 kernels.