Python Certification Training for Data Scienc ...
- 77k Enrolled Learners
- Live Class
The problem with the validation technique in Machine Learning is, that it does not give any indication on how the learner will generalize to the unseen data. This is where Cross-Validation comes into the picture. This article covers the basic concepts of Cross-Validation in Machine Learning, the following topics are discussed in this article:
For any model in Machine Learning, it is considered as a best practice if the model is tested with an independent data set. Normally, any prediction model works on a known data set which is also known as the training set.
But in a real-life scenario, the model will be tested for its efficiency and accuracy with an altogether different and unique data set. Under those circumstances, you’d want your model to be efficient enough or at least to be at par with the same efficiency that it shows for the training set. Basically this testing is known as cross-validation in Machine Learning so that it is fit to work with any model in the future.
We can also call it a technique for asserting how the statistical model generalizes to an independent data set. Now that we know what cross-validation stands for, let us try to understand cross-validation in simple terms.
The basic purpose of cross-validation is to assess how the model will perform with an unknown data set. For instance, you are trying to score a goal in an empty goal. It looks pretty easy, and you could even score from a considerable distance too. But the real test starts when there is a goalkeeper and a bunch of defenders. That’s why you need to get trained in a real match facing all the heat and still score the goal.
There are two types of cross-validation techniques in Machine Learning.
Exhaustive Cross-Validation – This method basically involves testing the model in all possible ways, it is done by dividing the original data set into training and validation sets. Example: Leave-p-out Cross-Validation, Leave-one-out Cross-validation.
Let’s get into more details about various types of cross-validation in Machine Learning.
In Machine Learning, there is never enough data to train the model. Even then, if we remove some part of the data, it poses a threat of overfitting the Machine Learning model. It is also possible that it may not recognize a dominant pattern if enough data is not provided for the training phase.
By reducing the data, we also face the risk of reduced accuracy due to the error induced by bias. To overcome this problem, we need a method that would provide ample data for training and some data for testing. K-fold Cross-validation does exactly that.
How does it work?
In this cross-validation technique, the data is divided into k subsets. We take one subset from the bunch and treat it as the validation set for the model. And we keep the k-1 subset for training the model.
The error estimation is averaged for all the ‘k trials’ to get the effective readiness of the model. Each k subset will be in the validation set at least once. It is also included in the k-1 training set at least once. This significantly reduces the error induced by bias. It also reduces the variance as each of the k subsets is used in the validation.
In this technique, a slight change is made in the k-fold Cross-Validation. It changes such that each fold will have an approximately equal percentage of samples of each target class as the whole set. In the case of prediction problems, the mean responsive value is approximately equal in all the folds.
In some cases, there is a large imbalance in the responsive variables. Let us understand this with an example. In a house pricing problem, the prices of some houses can be much more than the other houses. Also, in classification problems, the samples may have more negative examples than the positive samples. To tackle this discrepancy we follow the stratified k-fold Cross-Validation technique in Machine Learning.
This is the simplified cross-validation method among all. In this method, we randomly assign data points to two data sets. The size is not relevant in this case.
The basic idea behind this is to remove a part from your training set and use it to get predictions from the model that is trained on the rest of the data. This method suffers from high variance since it takes only a single run to execute all this. It may also give misleading results.
In this approach, p data points are left out of the training data. Let’s say there are m data points in the data set, then m-p data points are used for the training phase. And the p data points are kept as the validation set.
This technique is rather exhaustive because the above process is repeated for all the possible combinations in the original data set. To check the overall effectiveness of the model, the error is averaged for all the trials.
This method of Cross-validation is similar to Leave-p-out Cross-validation but the only difference is that in this case p = 1. It actually saves a lot of time which is a big advantage.
Although If the sample data is too large, it can still take a lot of time. But it would still be quicker than the Leave-p-out cross-validation method.
We do not have to implement Cross-Validation manually, Scikit-Learn library in Python provides a simple implementation that will split the data accordingly. There are Cross-Validation iterators that are used depending upon various Cross-Validation strategies.
k-fold Cross-Validation: KFold() scikit-learn class
Leave-one-out Cross-Validation: LeaveOneOut() scikit-learn class
Leave-p-out Cross-Validation: LeavePOut() scikit-Learn class
Stratified K-Fold Cross-Validation: StratifiedKFold() scikit-learn class
For example, let us try to use the Kfold using python to create training and validation sets.
from numpy import array from sklearn.model_selection import KFold # sampling the data data = array([0.10, 0.22, 0.31, 0.43, 0.52, 0.63,0.72,0.85,0.92,0.99]) # Splittinf the data kfold = KFold(3, True, 1) # enumerating the splits for train, test in kfold.split(data): print('train: %s, test: %s' % (data[train], data[test]))
Similarly, we can choose other cross-validation iterators depending upon the requirement and the type of data. Now let us try to understand how we can calculate the model’s bias and variance.
If we do the k-fold cross-validation, we will get k different estimation errors. In an ideal situation, these errors would sum up to zero, but it is highly unlikely to get such results. To get the bias, we take the average of all the estimation error.
To calculate the model’s variance, we take the standard deviation of all the errors. If we get a low value of standard deviation it means that our model does not vary a lot with different sets of training data.
The focus should be to maintain a balance between the bias and the variance of the model. This can be achieved by reducing the variance to the minimum and controlling the bias. This trade-off usually results in making better predictive models.
The following are a few limitations faced by Cross-Validation:
In an ideal situation, Cross-Validation will produce optimum results. But in case of inconsistent data, the results may vary drastically. It is quite uncertain what kind of data will be encountered by the model.
Predictive modeling often requires an evolution in terms of data, this can pretty much change the training and the validation sets drastically.
The results may vary depending upon the features of the data set. Let us say we make a predictive model to detect an ailment in a person and we train it with a specific set of population. It may vary with the general population causing inconsistency and reduced efficiency.
With the overpowering applications to prevent a Machine Learning model from Overfitting and Underfitting, there are several other applications of Cross-Validation listed below:
We can use it to compare the performances of a set of predictive modeling procedures.
Cross-Validation excels in the field of medical research.
It can be used in the meta-analysis since a lot of data analysts are already using cross-validation.
This brings us to the end of this article where we have learned Cross-Validation in Machine Learning. I hope you are clear with all that has been shared with you in this tutorial.
If you found this article on “Cross-Validation In Machine Learning” relevant, check out the Edureka Machine Learning Certification Training, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.
We are here to help you with every step on your journey and come up with a curriculum that is designed for students and professionals who want to be a Machine Learning Engineer. The course is designed to give you a head start into Python programming and train you for both core and advanced Python concepts along with various Machine Learning Algorithms like SVM, Decision Tree, etc.
If you come across any questions, feel free to ask all your questions in the comments section of “Cross-Validation In Machine Learning” and our team will be glad to answer.