# Semantic types

by: Kevin Broløs & Chris Cave

(Feyn version 3.2 or newer)

There are three types of input data that `Model`

s can interpret:

- numerical, which includes:
- floating point numbers
- integers

- categorical, which includes:
- strings

- boolean, represented by
- a discrete number
`0`

or`1`

`True`

or`False`

- a discrete number

The `Model`

handles transformation of inputs, and it uses the `stype`

declarations to decide how. **Numerical** values learn a linear rescaling and **categorical** values get assigned individual weights.
**boolean** inputs can be assigned to either **numerical** or **categorical**, and the only difference will be whether or not it can handle `string`

representations of booleans (like `yes`

and `no`

).
Generally speaking, numerical inputs are a bit more efficient and the resulting equations are arguably simpler, so we recommend using that over the other.

The **categorical** `stype`

helps maintain a simple and interpretable model by avoiding dimensional expansion like you would see in one-hot or dummy encoding. You can read more about how we treat Categorical features.

## Stypes in auto_run

When you use auto_run, the `stypes`

will be automatically inferred from your data unless you specify them manually. It will also produce warnings if some columns appear to be unsuitable for training, or have issues like high cardinality (many unique values compared to the size of the dataset).

We try to be clever and efficient, but all data sets are different, so you can supply your own `stypes`

to bypass this extra step entirely.
You can also call the function feyn.tools.infer_stypes on your own before running `auto_run`

to use as a starting point and just change the types you are not satisfied with.

## Example

This simplifies the preprocessing task that would fall on the data scientist. This means that you should **not** standardise inputs, **nor** should you one-hot encode categoricals.

Instead, you assign the relevant `stypes`

as shown below.

```
import feyn
import numpy as np
from pandas import DataFrame
data = DataFrame(
{
'numerical_input': np.random.rand(4),
'categorical_input': np.array(['apple','pear','banana','orange']),
'boolean_input': np.array([True, False, False, True]),
'output_name': np.random.rand(4)
}
)
stypes = {
'numerical_input': 'f',
'categorical_input': 'c',
'boolean_input': 'f'
}
ql = feyn.QLattice()
models = ql.auto_run(
data=data,
output_name='output_name',
stypes=stypes,
n_epochs=1
)
```

If no `stypes`

are provided for an input, it is assumed to be **numerical**.

## Infer stypes from the dataset

A quick way to define an `stypes`

dictionary based on your data is to use our function `feyn.tools.infer_stypes`

on your pandas dataframe, supplying your output column as well:

```
from feyn.tools import infer_stypes
stypes = infer_stypes(data, 'output_name')
```

If a column contains non-**numerical** values such as `apple`

or `pear`

like the `categorical_input`

above then it will be assigned categorical.

We also have additional smart detection for numericals that might actually be ordinal/nominal, as well as behaviour to skip columns that are not suitable for training. Below is a short summary of some cases we look out for:

- binary data: stype='c'
- continuous data: stype='f'
- numerical data:
- Ordinal/Nominal: stype='c' (if number of distinct values is below a number related to the dataset size)
- others: stype='f'

- strings/objects/category: stype='c'
- ID: skip (if dataset is larger than 10)
- Also produces an info message

- ISO Date: skip
- Also produces an info message

- Constant values: skip (if dataset is larger than 10)
- Also produces an info message

- Mixed types: skip
- Also produces an info message

- Any category with high cardinality (more than 50 distinct values, or a distinct ratio higher than 50% of the data)
- Produces a warning message and keeps the category type.

### Data validation

Note on missing values: We ignore the missing values during most type inference and instead rely on downstream validation to detect data issues like this.

That means that just because an stype is correctly assigned, you might still have some common data issues.

The validator we use in `auto_run`

is feyn.validate_data and you can use that separately if you want to screen your data for common data issues before training.

We recommend to study your data carefully before just trusting the types, but this should give you a good starting point.