Data Sampler¶

class pyfume.Sampler.Sampler(train_x, train_y, number_of_bins=5, histogram=False)¶

Bases: object

Creates a new Sampler object that makes it possible to oversample unbalanced data sets to make them more balanced.

Parameters

train_x – The input data.
train_y – The output data (true label/golden standard) on basis which will be sampled.
number_of_bins – Number of clusters that should be identified in the data.
histogram – True/False flag that determines whether a histogram of the frequencies of the output data will be plotted of both the old and new (= sampled) situation (default = False). The package ‘matplotlib.pyplot’ is required for this functionality.

oversample()¶

Created a more balanced data set by oversampling underrepresented data instances (based on values of the output variable) in the data set.

Returns

Tuple containing (new_train_x, new_train_y)