# Categorical features

by: Chris Cave and Emil Larsen

(Feyn version 2.0 or newer)

## Categorical features

A feature is **categorical** if there is no clear ordering in the values the feature can take. The values a categorical feature can take are called **categories**. Below is an example dataset containing categorical features and their categories:

Country | Favourite colour | Gender | Smoker/non-smoker |
---|---|---|---|

Denmark | Red | Male | 0 |

Spain | Yellow | Female | 1 |

UK | Blue | Male | 1 |

Brazil | Green | Male | 0 |

USA | Yellow | Female | 1 |

Italy | Red | Female | 0 |

In the example above each feature does not have an obvious ordering.

`QLattice`

treats categorical features

How the Before we pass the dataset above through the `QLattice`

we need to specify the semantic types of the categoricals.

```
import feyn
import numpy as np
import pandas as pd
data = pd.DataFrame({
"Country": ["Denmark", "Spain", "UK", "Brazil", "USA", "Italy"],
"Favourite colour": ["Red", "Yellow", "Blue", "Green", "Yellow", "Red"],
"Gender": ["Male", "Female", "Male", "Male", "Female", "Female"],
"Smoker": [0,1,1,0,1,0]
})
stypes = {
'Country': 'c',
'Favourite colour': 'c',
'Gender': 'c',
}
```

After which we can pass into the `QLattice`

.

```
ql = feyn.connect_qlattice()
ql.reset(random_seed=42)
models = ql.auto_run(
data=data,
output_name="Smoker",
stypes=stypes,
n_epochs=1,
max_complexity=2
)
model = models[0]
```

Here is one graph from the output of auto run.

How has the `QLattice`

interpreted the categories of this feature as a number so it can be an input to a mathematical function?

Each category is associated to a numerical value which we call its **weight**. The weight gets learnt while the model is being fitted to the data.

We can see the weights associated to each category by calling `params`

:

```
input_node = model[2]
print(f"category weights: {input_node.params['categories']}")
print(f"bias: {input_node.params['bias']}")
```

```
category weights: [
('Yellow', 0.26123558974851063),
('Blue', 0.26123554327444204),
('Red', -0.25637270012528196),
('Green', -0.25637278717475576)
]
bias: 0.12080737359066837
```

When we predict on a sample that contains a category in the list above then we first convert that category to **category_weight** + **bias**.

This value gets passed into the functions inside the model and behaves like a normal numerical value.

If there are any NaN values present for some observations in a categorical feature, then the NaN values will be interpretated as a category.