Splitting a dataset
by: Kevin Broløs
(Feyn version 3.1.0 or newer)
The split
function found in feyn.tools
is suitable for splitting your data into random subsets prior to training a Model
. This practice can aid you in cross-validation of the results, by allowing you to evaluate the Model
on a dataset it has not been trained on.
We support splitting into as many subsets as the dataset supports, as well as stratification according to one or more columns.
The data is always shuffled prior to splitting, regardless of whether you stratify or not.
Example
from feyn.tools import split
from pandas import DataFrame
data = DataFrame(
{
"A": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
"B": [1, 1, 1, 0, 0, 0, 1, 1, 0, 0],
}
)
train, test = split(data, ratio=[0.8, 0.2], stratify=['A'], random_state=42)
In the example above, our data set has 10 samples, and B contains 5 samples with values 1 and 5 samples with values 0.
After splitting, train will contain 8 samples, evenly distributed between the column B having the values 0 and 1 with 4 of each. test will contain 2 samples, also evenly distributed with B taking on values of 0 and 1.
We provide a random state to ensure we get the exact same split every time we run the function.
Usage
You can choose however many sets you would like, as well as their comparative sizes using the ratio
parameter. The ratio list is normalised before splitting, so [1., 1.] results in a 50/50 split, [1., 1., 1.] in an equal 3-way split, etc.
For readability, you'll often want to choose sensible splits that sum to 1 (or 100), such as [0.75, 0.25] or [75, 25].
Stratification
Using the stratify
parameter, you can stratify the splits according to one or more columns provided. This helps insure your target class in a classification problem is equally represented in your train and test sets, or to help balance columns you know are not evenly represented between sets.
This results in each subset having (as close as possible to) the same relative ratios of distinct values represented in those columns in the final subsets.
The function returns an error if you try to stratisfy in a way that results in not enough samples being able to go into each chosen subset of the ratio
parameter.
The stratification takes care to also proportionally distribute NaN values among splits. This is particularly important for categorical NaN values when training with the QLattice
, as they get assigned a weight and bucket just like any other category.
Note: Since it works using distinct values, this parameter works best for ordinal and categorical values, but makes little sense for continuous variables.
feyn.tools.split
Parameters of data
The data to split.
ratio
Default: [0.75, 0.25]
The size ratio of the resulting subsets.
stratify
Default: None.
The names of columns to stratify by.
random_state
Default: None.
The random state of the split (integer)