Semantic types
by: Kevin Broløs
(Feyn version 3.4.0 or newer)
There are three types of input data that Model
s can interpret:
- numerical, which includes:
- floating point numbers
- integers
- categorical, which includes:
- strings
- boolean, represented by
- a discrete number
0
or1
True
orFalse
- a discrete number
The Model
handles transformation of inputs, and it uses the stype
declarations to decide how. Numerical values learn a linear rescaling and categorical values get assigned individual weights.
boolean inputs can be assigned to either numerical or categorical, and the only difference will be whether or not it can handle string
representations of booleans (like yes
and no
).
Generally speaking, numerical inputs are a bit more efficient and the resulting equations are arguably simpler, so we recommend using that over the other.
The categorical stype
helps maintain a simple and interpretable model by avoiding dimensional expansion like you would see in one-hot or dummy encoding. You can read more about how we treat Categorical features.
Stypes in auto_run
When you use auto_run, the stypes
will be automatically inferred from your data unless you specify them manually. It will also produce warnings if some columns appear to be unsuitable for training, or have issues like high cardinality (many unique values compared to the size of the dataset).
We try to be clever and efficient, but all data sets are different, so you can supply your own stypes
to bypass this extra step entirely.
You can also call the function feyn.tools.infer_stypes on your own before running auto_run
to use as a starting point and just change the types you are not satisfied with.
Example
This simplifies the preprocessing task that would fall on the data scientist. This means that you should not standardise inputs, nor should you one-hot encode categoricals.
Instead, you assign the relevant stypes
as shown below.
import feyn
import numpy as np
from pandas import DataFrame
data = DataFrame(
{
'numerical_input': np.random.rand(4),
'categorical_input': np.array(['apple','pear','banana','orange']),
'boolean_input': np.array([True, False, False, True]),
'output_name': np.random.rand(4)
}
)
stypes = {
'numerical_input': 'f',
'categorical_input': 'c',
'boolean_input': 'f'
}
ql = feyn.QLattice()
models = ql.auto_run(
data=data,
output_name='output_name',
stypes=stypes,
n_epochs=1
)
If no stypes
are provided for an input, it is assumed to be numerical.
Infer stypes from the dataset
A quick way to define an stypes
dictionary based on your data is to use our function feyn.tools.infer_stypes
on your pandas dataframe, supplying your output column as well:
from feyn.tools import infer_stypes
stypes = infer_stypes(data, 'output_name')
If a column contains non-numerical values such as apple
or pear
like the categorical_input
above then it will be assigned categorical.
We also have additional smart detection for numericals that might actually be ordinal/nominal, as well as behaviour to skip columns that are not suitable for training. Below is a short summary of some cases we look out for:
- binary data: stype='f'
- continuous data: stype='f'
- numerical data:
- Ordinal/Nominal: stype='c' (if number of distinct values is below a number related to the dataset size)
- others: stype='f'
- strings/objects/category: stype='c'
- ID: skip (if dataset is larger than 10)
- Also produces an info message
- ISO Date: skip
- Also produces an info message
- Constant values: skip (if dataset is larger than 10)
- Also produces an info message
- Mixed types: skip
- Also produces an info message
- Any category with high cardinality (more than 50 distinct values, or a distinct ratio higher than 50% of the data)
- Produces a warning message and keeps the category type.
Data validation
Note on missing values: We ignore the missing values during most type inference and instead rely on downstream validation to detect data issues like this.
That means that just because an stype is correctly assigned, you might still have some common data issues.
The validator we use in auto_run
is feyn.validate_data and you can use that separately if you want to screen your data for common data issues before training.
We recommend to study your data carefully before just trusting the types, but this should give you a good starting point.