Machine Learning - Training/Testing

Evaluate the model

In machine learning, we create models to predict the results of certain events, just like in the previous chapter when we understood the weight and engine displacement, we predicted the carbon dioxide emissions of cars.

To measure whether the model is good enough, we can use a method called training/test.

What is training/test

Training/test is a method to measure the accuracy of the model.

It is called training/test because we divide the dataset into two groups: training set and test set.

80% is used for training, and 20% is used for testing.

You can use the training set to train the model.

You can use the test set to test the model.

Training the model means creating the model.

Testing the model means testing the accuracy of the model.

Start from the dataset.

Start with the dataset to be tested.

Our dataset shows 100 customers in the store and their shopping habits.

Example

import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
plt.scatter(x, y)
plt.show()

Result:

The x-axis represents the number of minutes before the purchase.

The y-axis represents the amount spent on the purchase.


Run Instance

Split training/test

The training set should be a random selection of 80% of the original data.

The test set should be the remaining 20%.

train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]

Show the training set

Show the scatter plot the same as the training set:

Example

plt.scatter(train_x, train_y)
plt.show()

Result:

It looks like the original dataset, so it seems to be a reasonable choice:


Run Instance

Show the test set

To ensure that the test set is not completely different, we also need to look at the test set.

Example

plt.scatter(test_x, test_y)
plt.show()

Result:

The test set also looks like the original dataset:


Run Instance

Fit the dataset

What does the dataset look like? I think the most suitable fit is polynomial regression, so let's draw a polynomial regression line.

To draw a line through the data points, we use the matplotlib module plott() Method:

Example

Draw a polynomial regression line through the data points:

import numpy
import matplotlib.pyplot as plt
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
myline = numpy.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()

Result:


Run Instance

This result can support our recommendation to fit the dataset with polynomial regression, even if we try to predict values outside the dataset and it may give us some strange results. For example: this line indicates that a customer shopping for 6 minutes in the store will complete a shopping transaction worth 200. This may be a sign of overfitting.

But what about the R-squared score? The R-squared score well indicates the degree of fit of my dataset to the model.

R2

Remember R2, also known as R-squared (R-squared)?

It measures the relationship between the x-axis and y-axis, ranging from 0 to 1, where 0 indicates no relationship and 1 indicates a complete correlation.

The sklearn module has a named rs_score() The method, which will help us find this relationship.

Here, we want to measure the relationship between the time customers spend in the store and how much money they spend.

Example

How well does our training data fit in polynomial regression?

import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)

Run Instance

Note:The result 0.799 shows a good relationship.

Introducing the test set

Now, at least in terms of training data, we have established a good model.

Then, we will use test data to test the model to check if the same results are given.

Example

Let's determine the R2 score when using test data:

import numpy
from sklearn.metrics import r2_score
numpy.random.seed(2)
x = numpy.random.normal(3, 1, 100)
y = numpy.random.normal(150, 40, 100) / x
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
mymodel = numpy.poly1d(numpy.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)

Run Instance

Note:The result of 0.809 indicates that the model is also suitable for the test set, and we are confident that we can use this model to predict future values.

Predicted Value

Now that we have confirmed that our model is good, we can start predicting new values.

Example

How much money will the customer spend if they stay in the store for 5 minutes?

print(mymodel(5))

Run Instance

This example predicts that the customer spent $22.88, which seems to correspond to the chart: