Machine Learning - Training/Testing
- Previous Page Zoom
- Next Page Decision Tree
Evaluate the model
In machine learning, we create models to predict the results of certain events, just like in the previous chapter when we understood the weight and engine displacement, we predicted the carbon dioxide emissions of cars.
To measure whether the model is good enough, we can use a method called training/test.
What is training/test
Training/test is a method to measure the accuracy of the model.
It is called training/test because we divide the dataset into two groups: training set and test set.
80% is used for training, and 20% is used for testing.
You can use the training set to train the model.
You can use the test set to test the model.
Training the model means creating the model.
Testing the model means testing the accuracy of the model.
Start from the dataset.
Start with the dataset to be tested.
Our dataset shows 100 customers in the store and their shopping habits.
Example
import numpy import matplotlib.pyplot as plt numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x plt.scatter(x, y) plt.show()
Result:
The x-axis represents the number of minutes before the purchase.
The y-axis represents the amount spent on the purchase.

Split training/test
The training set should be a random selection of 80% of the original data.
The test set should be the remaining 20%.
train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:]
Show the training set
Show the scatter plot the same as the training set:
Example
plt.scatter(train_x, train_y) plt.show()
Result:
It looks like the original dataset, so it seems to be a reasonable choice:

Show the test set
To ensure that the test set is not completely different, we also need to look at the test set.
Example
plt.scatter(test_x, test_y) plt.show()
Result:
The test set also looks like the original dataset:

Fit the dataset
What does the dataset look like? I think the most suitable fit is polynomial regression, so let's draw a polynomial regression line.
To draw a line through the data points, we use the matplotlib module plott()
Method:
Example
Draw a polynomial regression line through the data points:
import numpy import matplotlib.pyplot as plt numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:] mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4)) myline = numpy.linspace(0, 6, 100) plt.scatter(train_x, train_y) plt.plot(myline, mymodel(myline)) plt.show()
Result:

This result can support our recommendation to fit the dataset with polynomial regression, even if we try to predict values outside the dataset and it may give us some strange results. For example: this line indicates that a customer shopping for 6 minutes in the store will complete a shopping transaction worth 200. This may be a sign of overfitting.
But what about the R-squared score? The R-squared score well indicates the degree of fit of my dataset to the model.
R2
Remember R2, also known as R-squared (R-squared)?
It measures the relationship between the x-axis and y-axis, ranging from 0 to 1, where 0 indicates no relationship and 1 indicates a complete correlation.
The sklearn module has a named rs_score()
The method, which will help us find this relationship.
Here, we want to measure the relationship between the time customers spend in the store and how much money they spend.
Example
How well does our training data fit in polynomial regression?
import numpy from sklearn.metrics import r2_score numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:] mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4)) r2 = r2_score(train_y, mymodel(train_x)) print(r2)
Note:The result 0.799 shows a good relationship.
Introducing the test set
Now, at least in terms of training data, we have established a good model.
Then, we will use test data to test the model to check if the same results are given.
Example
Let's determine the R2 score when using test data:
import numpy from sklearn.metrics import r2_score numpy.random.seed(2) x = numpy.random.normal(3, 1, 100) y = numpy.random.normal(150, 40, 100) / x train_x = x[:80] train_y = y[:80] test_x = x[80:] test_y = y[80:] mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4)) r2 = r2_score(test_y, mymodel(test_x)) print(r2)
Note:The result of 0.809 indicates that the model is also suitable for the test set, and we are confident that we can use this model to predict future values.
Predicted Value
Now that we have confirmed that our model is good, we can start predicting new values.
Example
How much money will the customer spend if they stay in the store for 5 minutes?
print(mymodel(5))
This example predicts that the customer spent $22.88, which seems to correspond to the chart:

- Previous Page Zoom
- Next Page Decision Tree