Unsupervised learning is a class of machine learning techniques for finding data patterns. The data provided for the unsupervised algorithm is not tagged, so we only have input (X) fields without corresponding to output fields. In unsupervised learning mode, the algorithms are left to their own devices to discover interesting structures and patterns in the data.

Cluster analysis is the so-called unsupervised learning. It is a method of grouping elements into relatively homogeneous classes. The basis of grouping in most algorithms is the similarity between elements – expressed by the similarity function.

There are several methods of cluster analysis / clustering, and the K-means method is the most popular. Initially divides the population into a predetermined number of classes (clusters), then the obtained division is corrected in such a way that some elements are transferred to other classes, so as to obtain the minimum variance inside each of them – the aim is to ensure the greatest similarity of elements within each cluster, while at the same time the maximum difference between the classes themselves. Each element is assigned to the class (cluster) to the center of which it has the closest (the measure of similarity here is the distance between the element and the centroid – most often the Euclidean distance, its square or the so-called Chebyshev distance are used), calculation of new centers of clusters – most often the new center of the class (cluster) is the point which coordinates are the arithmetic mean of the coordinates of elements belonging to this class, repeating the algorithm until the convergence criterion is achieved (most often it is a step in which the belonging of points to classes has not changed or the end of repeating the algorithm is conditioned by the adopted number iteration).

```
from IPython.display import Image
Image(filename="img/kmean.png")
```

To begin with, we will generate classification data to visualize the clusters:

```
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
```

```
data = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1.8)
points = data[0]
```

```
plt.scatter(data[0][:,0], data[0][:,1], c=data[1])
plt.xlim(-15,15)
plt.ylim(-15,15)
```

The sklearn library has a KMeans function ready, which we will use for the next example. The data is also fictitious:

```
import pandas as pd
df = pd.read_csv(r'C:\Users\VFW863\Desktop\en\kmeans.csv');df.head()
```

```
X = df.iloc[:, [0,4]].values
X
```

The KMeans method requires specifying the number of clusters, if we do not know how many groups should be in our data, we can use the so-called the elbow method. The idea behind this method is to calculate the k-means of clusters in the dataset for a range of k values (e.g., K from 1 to 20 in the example below), and for each k value it computes the sum of squared errors (SSE):

```
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans
wb=[]
for i in range(1, 20):
kmeans=KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10)
kmeans.fit(X)
wb.append(kmeans.inertia_)
plt.plot(range(1,20),wb )
```

From the above graph, we can determine the best number of clusters for our model, the optimal number in this case is 3, the number in the ‚elbow flexion’.

We can also use another metric – Inertia. Inertia is a metric used to estimate how close the data points are to the cluster. It is computed as the sum of the squared distances for each point relative to the closest center of gravity, ie the Assigned Cluster Center. Intervention by inertia is that clusters with less inertia are better because they mark closely related points that make up the cluster. Inertia is computed by scikit-learn by default.

```
kmeans_2=KMeans(n_clusters=2).fit(X)
kmeans_3=KMeans(n_clusters=3).fit(X)
kmeans_4=KMeans(n_clusters=4).fit(X)
kmeans_5=KMeans(n_clusters=5).fit(X)
```

```
print ("Inertia for K-mean with 2 clusters =", kmeans_2.inertia_)
print ("Inertia for K-mean with 3 clusters =", kmeans_3.inertia_)
print ("Inertia for K-mean with 4 clusters =", kmeans_4.inertia_)
print ("Inertia for K-mean with 5 clusters =", kmeans_5.inertia_)
```

The smaller the inertia, the smaller the error, but you also have to bear in mind that the error will decrease with more and more clusters. In our example above, the optimal value of K will be 3 or 4, with K = 5 the error is smaller but it is minimally reduced.

Another way to evaluate clustering is to visualize its results. However, the data we are working with have more dimensions (> 2), so it is difficult to visualize or draw. So the first step to visualization is to reduce the dimensions of your data. To do this, we can use Principal Component Analysis (PCA). PCA reduces the size of the data and keeps only the most important components. It is a commonly used technique for visualizing data from multidimensional spaces.

```
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
x_pca=pca.fit_transform(X)
print(pca.explained_variance_ratio_)
```

In fact, the first component would do (it explains 98% of the variance) but we’ll take two:

```
kmeans = KMeans(n_clusters=3)
X_clus = kmeans.fit_predict(x_pca)
```

```
x_pca
```

```
df = pd.DataFrame({'x1': x_pca[:,0], 'x2': x_pca[:,1]})
df['label'] = X_clus
plt.scatter(df.x1, df.x2, c=df['label'])
```

In the above graph, the points are very similar and overlap, so you can see only a few points. There are several metrics that can be used to evaluate the model and results, i.e .: purity, accuracy, recall.