Machine Learning - Data Distribution

Data Distribution (Data Distribution)

In the earlier part of this tutorial, we only used a very small amount of data in examples to understand different concepts.

In the real world, datasets are much larger, but at least in the early stages of a project, it is difficult to collect real-world data.

How do we obtain large datasets?

To create a large dataset for testing, we use the Python module NumPy, which comes with many methods for creating random datasets of arbitrary size.

Example

Create an array containing 250 random floating-point numbers between 0 and 5:

import numpy
x = numpy.random.uniform(0.0, 5.0, 250)
print(x)

Run Instance

Histogram

To visualize the dataset, we can plot a histogram of the collected data.

We will use the Python module Matplotlib to plot the histogram:

Example

Plot histogram:

import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 250)
plt.hist(x, 5)
plt.show()

Result:


Run Instance

Histogram Explanation

We draw 5 bar charts using the array from the previous example.

The first column represents how many values between 0 and 1 are in the array.

The second column represents how many values are between 1 and 2.

and so on.

The results we get are:

52 values are between 0 and 1
48 values are between 1 and 2
49 values are between 2 and 3
51 values are between 3 and 4
50 values are between 4 and 5

Note:The array values are random numbers and the results will not be exactly the same on your computer.

Big Data Distribution

An array containing 250 values is not very large, but now you know how to create a set of random values and by changing the parameters, you can create datasets of the required size.

Example

Create an array with 100000 random numbers and display them using a histogram with 100 columns:

import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 100000)
plt.hist(x, 100)
plt.show()

Run Instance