# Splitting a dataset

by: Kevin Broløs

(Feyn version 3.1.0 or newer)

The `split`

function found in `feyn.tools`

is suitable for splitting your data into random subsets prior to training a `Model`

. This practice can aid you in cross-validation of the results, by allowing you to evaluate the `Model`

on a dataset it has not been trained on.

We support splitting into as many subsets as the dataset supports, as well as stratification according to one or more columns.

The data is always shuffled prior to splitting, regardless of whether you stratify or not.

## Example

```
from feyn.tools import split
from pandas import DataFrame
data = DataFrame(
{
"A": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
"B": [1, 1, 1, 0, 0, 0, 1, 1, 0, 0],
}
)
train, test = split(data, ratio=[0.8, 0.2], stratify=['A'], random_state=42)
```

In the example above, our data set has 10 samples, and **B** contains 5 samples with values 1 and 5 samples with values 0.

After splitting, **train** will contain 8 samples, evenly distributed between the column **B** having the values *0* and *1* with 4 of each. **test** will contain 2 samples, also evenly distributed with **B** taking on values of *0* and *1*.

We provide a random state to ensure we get the exact same split every time we run the function.

## Usage

You can choose however many sets you would like, as well as their comparative sizes using the `ratio`

parameter. The ratio list is normalised before splitting, so [1., 1.] results in a 50/50 split, [1., 1., 1.] in an equal 3-way split, etc.

For readability, you'll often want to choose sensible splits that sum to 1 (or 100), such as [0.75, 0.25] or [75, 25].

### Stratification

Using the `stratify`

parameter, you can stratify the splits according to one or more columns provided. This helps insure your target class in a classification problem is equally represented in your **train** and **test** sets, or to help balance columns you know are not evenly represented between sets.

This results in each subset having (as close as possible to) the same relative ratios of distinct values represented in those columns in the final subsets.

The function returns an error if you try to stratisfy in a way that results in not enough samples being able to go into each chosen subset of the `ratio`

parameter.

The stratification takes care to also proportionally distribute *NaN* values among splits. This is particularly important for categorical *NaN* values when training with the `QLattice`

, as they get assigned a weight and bucket just like any other category.

Note: Since it works using distinct values, this parameter works best for ordinal and categorical values, but makes little sense for continuous variables.

`feyn.tools.split`

Parameters of ### data

The data to split.

### ratio

Default: [0.75, 0.25]

The size ratio of the resulting subsets.

### stratify

Default: None.

The names of columns to stratify by.

### random_state

Default: None.

The random state of the split (integer)