Resampling is a method of drawing repeated samples from the original data samples. It is a non-parametric method of statistical inference. Why we need it? Resampling as a method uses a collected dataset to improve the estimate of the population parameter and help to quantify the uncertainty of the estimate.
Common resampling techniques include non-replacement sampling, bootstrapping (replacement sampling), jackknife (using subsets), cross-validation, and LOOCV (leave one out cross validation).
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4
plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')
To start with, we’ll use simple replacement sampling, generate a sample of 30 items with a value of 0-4:
np.random.seed(222)
np.random.choice(5, 30)
We will use the same function to generate a sample without replacement. You just have to remember that the number of samples (here 6) cannot be greater than the values (8)
np.random.choice(8, 6, replace=False)
Bootstrap¶
Having one set of data, we compute the statistic and we get one set of statistic parameters. We don’t know how variable this statistic is. Bootstrap creates a large number of datasets that we could see and computes statistics for each of those datasets. This way we get the distribution of the statistics. The key is the strategy for creating data that we „could see”. Bootstrap is a resampling technique used to estimate population statistics by sampling a swap dataset. Resampling generates a unique sampling distribution based on the actual data. It uses experimental methods to generate a unique sampling distribution. It gives objective estimates as it is based on objective samples of all possible outcomes of the data investigated by the researcher.
Importantly, the samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample when selected. This allows a given observation to be included in a given small sample more than once.
from IPython.display import Image
Image(filename="img/boot.png")
Below is a simple bootstrapping example, our set has only 10 observations:
np.random.seed(2)
data = np.random.choice(5, 10);data
We randomly select one sample -> 3:
obs = [3]
This sample, after drawing and saving the value, returns to the data set, we redo the draw several times and we have our sample:
sample= [3,2,1,1]
Each element from the set can be drawn several times or not at all, 1 was drawn 2 times even though it is in the set only once. We don’t need to implement this method manually, scikit-learn has a resample () function:
from sklearn.utils import resample
bootstrap = resample (data, replace=True, n_samples=4);bootstrap
Jackknife¶
Jackknifing, which is similar to bootstrap, is used in statistical conclusions to estimate the deviation and standard error (variance) of a statistic when a random sample of observations is used to compute it. The basic idea of the jackknife variance estimator is to systematically recalculate the statistical score, omitting one or more observations at once from a sample set. From this new repetition set of statistics, you can calculate an estimated error value and estimate the variance of the statistic.
We will use the jackknife to estimate the standard deviation:
def jackknife(x, func):
n = len(x)
idx = np.arange(n)
return np.sum(func(x[idx!=i]) for i in range(n))/float(n)
x = np.random.normal(0, 2, 500)
jackknife(x, np.std)
Train and test split¶
A very popular and easy method – we divide our original data into training and test sets. After finding the appropriate coefficients for the model with the training set, we apply that model on the test set and find the accuracy of the model. This is the final accuracy before applying it to unknown data. The greater this final accuracy, the greater the hope of obtaining accurate results on unknown data.
Image(filename="img/train.png")
From the generated data of the variables X and y, we will divide the set into training and test data in the ratio – 80% training and 20% test:
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Cross Validation Split¶
A bit more complex split is when we additionally divide the training set into its training and test subsets, and then calculate the final accuracy of this subset and do it repeatedly for many subsets. Then we choose those coefficients that give us the maximum accuracy among these subsets and we hope that this model will give maximum accuracy to the final test set.
Image(filename="img/kfold.png")
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
KFold(n_splits=2, shuffle=False)
for train_index, test_index in kf.split(X):
print('train:', train_index, 'test:', test_index)
Leave One Out Cross Validation (LOOCV)¶
Another variation of the cross validation method. In this type of cross-checking, the number of subsets is equal to the number of cases we have in the dataset. Then we average ALL subsets and build our model with the mean. We test the model against the last subset. As we would get a large number of training sets (equal to the number of samples), this method is computationally very expensive and should be used for small data sets. If the dataset is large, it would most likely be better to use another method such as kfold.
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
print('train:', train_index, 'test:', test_index)
The more samples / subsets we have, we will decrease the bias error but increase the error due to variance. The computational price will also increase – the more samples, the more time it will take to compute and more memory is required. Usually k = 3 is recommended for large datasets, and LOOCV is best used for smaller datasets.