Splitting a dataset · Feyn Documentation

by: Kevin Broløs
(Feyn version 3.1.0 or newer)

The split function found in feyn.tools is suitable for splitting your data into random subsets prior to training a Model. This practice can aid you in cross-validation of the results, by allowing you to evaluate the Model on a dataset it has not been trained on.

We support splitting into as many subsets as the dataset supports, as well as stratification according to one or more columns.

The data is always shuffled prior to splitting, regardless of whether you stratify or not.

Example

from feyn.tools import split
from pandas import DataFrame


data = DataFrame(
    {
        "A": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
        "B": [1, 1, 1, 0, 0, 0, 1, 1, 0, 0],
    }
)

train, test = split(data, ratio=[0.8, 0.2], stratify=['A'], random_state=42)

In the example above, our data set has 10 samples, and B contains 5 samples with values 1 and 5 samples with values 0.

After splitting, train will contain 8 samples, evenly distributed between the column B having the values 0 and 1 with 4 of each. test will contain 2 samples, also evenly distributed with B taking on values of 0 and 1.

We provide a random state to ensure we get the exact same split every time we run the function.

Usage

You can choose however many sets you would like, as well as their comparative sizes using the ratio parameter. The ratio list is normalised before splitting, so [1., 1.] results in a 50/50 split, [1., 1., 1.] in an equal 3-way split, etc.

For readability, you'll often want to choose sensible splits that sum to 1 (or 100), such as [0.75, 0.25] or [75, 25].

Stratification

Using the stratify parameter, you can stratify the splits according to one or more columns provided. This helps insure your target class in a classification problem is equally represented in your train and test sets, or to help balance columns you know are not evenly represented between sets.

This results in each subset having (as close as possible to) the same relative ratios of distinct values represented in those columns in the final subsets.

The function returns an error if you try to stratisfy in a way that results in not enough samples being able to go into each chosen subset of the ratio parameter.

The stratification takes care to also proportionally distribute NaN values among splits. This is particularly important for categorical NaN values when training with the QLattice, as they get assigned a weight and bucket just like any other category.

Note: Since it works using distinct values, this parameter works best for ordinal and categorical values, but makes little sense for continuous variables.

Parameters of `feyn.tools.split`

data

The data to split.

ratio

Default: [0.75, 0.25]

The size ratio of the resulting subsets.

stratify

Default: None.

The names of columns to stratify by.

random_state