Categorical features
by: Kevin Broløs, Chris Cave and Emil Larsen
(Feyn version 3.0 or newer)
A feature is categorical if there is no clear ordering in the values the feature can take. The values a categorical feature can take are called categories. Below is an example dataset containing categorical features and their categories:
Country | Favourite colour | Gender | Smoker/non-smoker |
---|---|---|---|
Denmark | Red | Male | 0 |
Spain | Yellow | Female | 1 |
UK | Blue | Male | 1 |
Brazil | Green | Male | 0 |
USA | Yellow | Female | 1 |
Italy | Red | Female | 0 |
In the example above each feature does not have an obvious ordering.
QLattice
treats categorical features
How the When the QLattice
samples models, it uses something called semantic types to decide which inputs are numerical and which are categorical. These types are inferred automatically when you use auto_run
, but you may also choose to specify them on your own.
Here's an example where we explicitly specify them.
import feyn
import numpy as np
import pandas as pd
data = pd.DataFrame({
"Country": ["Denmark", "Spain", "UK", "Brazil", "USA", "Italy"],
"Favourite colour": ["Red", "Yellow", "Blue", "Green", "Yellow", "Red"],
"Gender": ["Male", "Female", "Male", "Male", "Female", "Female"],
"Smoker": [0,1,1,0,1,0]
})
stypes = {
'Country': 'c',
'Favourite colour': 'c',
'Gender': 'c',
}
After which we can pass them into the QLattice
.
ql = feyn.QLattice(random_seed=42)
models = ql.auto_run(
data=data,
output_name="Smoker",
stypes=stypes,
n_epochs=1,
max_complexity=2
)
model = models[0]
Here is one graph from the output of auto run.
Categorical weights
When the QLattice
fits a model to the data, any categories will be assigned a learned numerical value - a weight.
This happens in order to convert the categories into a number so it can be used as an input to a mathematical function.
We can see the weights associated to each category by picking out the input variable in the model, and calling get_parameters
:
params = model.get_parameters("Favourite colour")
print(params)
Favourite colour
category
Green 0.242616
Red 0.242616
Yellow -0.299821
Blue -0.299821
When we predict on a sample that contains a category in the list above then we first convert that category to category_weight + bias.
This value gets passed into the functions inside the model and behaves like a normal numerical value.
If there are any NaN values present for some observations in a categorical feature, then the NaN values will be interpretated as a category.