When scoring a logistic regression model is having the predicted variable in test dataset mandatory

Question

Please help in explaining the scoring process of a model which was built on training data set and now I want to apply it to test data set to get the final results !

Follow up question to this -->I have created the model and confusion matrix and now i want to implement this model to a different new TEST data set. What should be my approach ? and how can i predict yes/no for each record of the new dataset ? — Oct 17, 2018

Anmol · Answer 1 · Oct 16, 2018

We'll need a target variable to predict the output with the actual values if there is no output column you can just create one

In case of R use the code: test_set$Output_Column_Name<- NA

In case of Python use the code: test_set['Output_Column_Name'] = np.nan

Up next after creating of the target column, you'll have to predict the output on it using the logistic regression model created.

Once you have the target column values which would be binary in nature i.e. either YES/NO or 1 / 0, we'll compare it with the actual values based upon a confusion matrix.

Confusion matrix takes into account the below mentioned values

True Positive: number of instances in which the actual and the predicted value both are True
True Negative: number of instances in which the actual and the predicted value both are False
False Positive: number of instances in which the actual is False but the predicted value is True
False Negative: number of instances in which the actual is True but the predicted value is False

Based on the actual number of above values the accuracy of the model is created as per the formula

(True Positive +True Negatve) / Total number of Instances

For more details about the confusion matrix you can refer to the following link:https://bit.ly/2RSDtW8

Hope this helps :)

Anmol · Answer 2 · Oct 17, 2018

Answer to your follow up question:

We can never find the accuracy of a model without the actual values if you have created a predictive mode and arel testing it on the new data whose target variable values are unknown, then you are actually deploying the model to predict the outcome based on new data.

Coming to the second part of your question - "how can I predict yes/no for each record of the new dataset ?"

Once the model is created you just have to again create a dummy column with NA values in the new dataset and then use the below code

predict <- predict(model_name,newdata = new_test_data, type = 'response')

The predict dataframe which would be the output of above command would have the respective values corresponding to each instance/row which you passed.

The new_test_data can have just multiple numbers of rows, even just a single row would work.