Semantic types · Feyn Documentation

by: Kevin Broløs
(Feyn version 3.4.0 or newer)

There are three types of input data that Models can interpret:

numerical, which includes:
- floating point numbers
- integers
categorical, which includes:
- strings
boolean, represented by
- a discrete number 0 or 1
- True or False

The Model handles transformation of inputs, and it uses the stype declarations to decide how. Numerical values learn a linear rescaling and categorical values get assigned individual weights. boolean inputs can be assigned to either numerical or categorical, and the only difference will be whether or not it can handle string representations of booleans (like yes and no). Generally speaking, numerical inputs are a bit more efficient and the resulting equations are arguably simpler, so we recommend using that over the other.

The categorical stype helps maintain a simple and interpretable model by avoiding dimensional expansion like you would see in one-hot or dummy encoding. You can read more about how we treat Categorical features.

Stypes in auto_run

When you use auto_run, the stypes will be automatically inferred from your data unless you specify them manually. It will also produce warnings if some columns appear to be unsuitable for training, or have issues like high cardinality (many unique values compared to the size of the dataset).

We try to be clever and efficient, but all data sets are different, so you can supply your own stypes to bypass this extra step entirely. You can also call the function feyn.tools.infer_stypes on your own before running auto_run to use as a starting point and just change the types you are not satisfied with.

Example

This simplifies the preprocessing task that would fall on the data scientist. This means that you should not standardise inputs, nor should you one-hot encode categoricals.

Instead, you assign the relevant stypes as shown below.

import feyn
import numpy as np
from pandas import DataFrame


data = DataFrame(
    {
    'numerical_input': np.random.rand(4),
    'categorical_input': np.array(['apple','pear','banana','orange']),
    'boolean_input': np.array([True, False, False, True]),
    'output_name': np.random.rand(4)
    }
)

stypes = {
    'numerical_input': 'f',
    'categorical_input': 'c',
    'boolean_input': 'f'
    }

ql = feyn.QLattice()
models = ql.auto_run(
    data=data,
    output_name='output_name',
    stypes=stypes,
    n_epochs=1
)

If no stypes are provided for an input, it is assumed to be numerical.

Infer stypes from the dataset

A quick way to define an stypes dictionary based on your data is to use our function feyn.tools.infer_stypes on your pandas dataframe, supplying your output column as well:

from feyn.tools import infer_stypes
stypes = infer_stypes(data, 'output_name')

If a column contains non-numerical values such as apple or pear like the categorical_input above then it will be assigned categorical.

We also have additional smart detection for numericals that might actually be ordinal/nominal, as well as behaviour to skip columns that are not suitable for training. Below is a short summary of some cases we look out for:

binary data: stype='f'
continuous data: stype='f'
numerical data:
- Ordinal/Nominal: stype='c' (if number of distinct values is below a number related to the dataset size)
- others: stype='f'
strings/objects/category: stype='c'
ID: skip (if dataset is larger than 10)
- Also produces an info message
ISO Date: skip
- Also produces an info message
Constant values: skip (if dataset is larger than 10)
- Also produces an info message
Mixed types: skip
- Also produces an info message
Any category with high cardinality (more than 50 distinct values, or a distinct ratio higher than 50% of the data)
- Produces a warning message and keeps the category type.

Data validation

Note on missing values: We ignore the missing values during most type inference and instead rely on downstream validation to detect data issues like this.

That means that just because an stype is correctly assigned, you might still have some common data issues.

The validator we use in auto_run is feyn.validate_data and you can use that separately if you want to screen your data for common data issues before training.

We recommend to study your data carefully before just trusting the types, but this should give you a good starting point.

Stypes in auto_run

Example

Infer stypes from the dataset

Data validation

Subscribe to get news about Feyn and the QLattice.