Machine Learning - Data Distribution
- Previous Page Percentile
- Next Page Normal Data Distribution
Data Distribution (Data Distribution)
In the earlier part of this tutorial, we only used a very small amount of data in examples to understand different concepts.
In the real world, datasets are much larger, but at least in the early stages of a project, it is difficult to collect real-world data.
How do we obtain large datasets?
To create a large dataset for testing, we use the Python module NumPy, which comes with many methods for creating random datasets of arbitrary size.
Example
Create an array containing 250 random floating-point numbers between 0 and 5:
import numpy x = numpy.random.uniform(0.0, 5.0, 250) print(x)
Histogram
To visualize the dataset, we can plot a histogram of the collected data.
We will use the Python module Matplotlib to plot the histogram:
Example
Plot histogram:
import numpy import matplotlib.pyplot as plt x = numpy.random.uniform(0.0, 5.0, 250) plt.hist(x, 5) plt.show()
Result:

Histogram Explanation
We draw 5 bar charts using the array from the previous example.
The first column represents how many values between 0 and 1 are in the array.
The second column represents how many values are between 1 and 2.
and so on.
The results we get are:
52 values are between 0 and 1 48 values are between 1 and 2 49 values are between 2 and 3 51 values are between 3 and 4 50 values are between 4 and 5
Note:The array values are random numbers and the results will not be exactly the same on your computer.
Big Data Distribution
An array containing 250 values is not very large, but now you know how to create a set of random values and by changing the parameters, you can create datasets of the required size.
Example
Create an array with 100000 random numbers and display them using a histogram with 100 columns:
import numpy import matplotlib.pyplot as plt x = numpy.random.uniform(0.0, 5.0, 100000) plt.hist(x, 100) plt.show()
- Previous Page Percentile
- Next Page Normal Data Distribution