Semantic types
by: Kevin Broløs & Chris Cave
(Feyn version 3.0 or newer)
There are three types of input data that Model
s can interpret:
- numerical, which includes:
- floating point numbers
- integers
- categorical, which includes:
- strings
- boolean, represented by
- a discrete number
0
or1
True
orFalse
- a discrete number
The Model
handles transformation of inputs, and it uses the stype
declarations to decide how. Numerical values learn a linear rescaling and categorical values get assigned individual weights. We recommend boolean inputs be assigned the numerical stype
, unless you have reasons to treat it as a category.
The categorical stype
helps maintain a simple and interpretable model by avoiding dimensional expansion.
This simplifies the preprocessing task that would fall on the data scientist. This means that you should not standardise inputs, nor should you one-hot encode categoricals.
Instead, you assign the relevant stypes
as shown in the example below.
import feyn
import numpy as np
from pandas import DataFrame
data = DataFrame(
{
'numerical_input': np.random.rand(4),
'categorical_input': np.array(['apple','pear','banana','orange']),
'boolean_input': np.array([True, False, False, True]),
'output_name': np.random.rand(4)
}
)
stypes = {
'numerical_input': 'f',
'categorical_input': 'c',
'boolean_input': 'f'
}
ql = feyn.QLattice()
models = ql.auto_run(
data=data,
output_name='output_name',
stypes=stypes,
n_epochs=1
)
If no stypes
are provided for an input, it is assumed to be numerical.
stypes
Quick way of defining A quick way to define an stypes
dictionary based on your data is to use the dtype property of each column of your pandas dataframe like below:
stypes = {column: 'c' for column in data.columns if data[column].dtype == 'object'}
If a column contains non-numerical values such as apple
or pear
like the categorical_input
above then the dtype will be object
. In most cases this is enough to indicate that this is a categorical input.
Note: Make sure to study your data carefully before auto inferring the stypes this way; otherwise you might end up assigning an input incorrectly. For instance a column containing a missing value will have an object dtype so the above code will assign it as a categorical.