Categorical features
by: Chris Cave and Emil Larsen
(Feyn version 3.0 or newer)
Categorical features
A feature is categorical if there is no clear ordering in the values the feature can take. The values a categorical feature can take are called categories. Below is an example dataset containing categorical features and their categories:
Country | Favourite colour | Gender | Smoker/non-smoker |
---|---|---|---|
Denmark | Red | Male | 0 |
Spain | Yellow | Female | 1 |
UK | Blue | Male | 1 |
Brazil | Green | Male | 0 |
USA | Yellow | Female | 1 |
Italy | Red | Female | 0 |
In the example above each feature does not have an obvious ordering.
QLattice
treats categorical features
How the Before we pass the dataset above through the QLattice
we need to specify the semantic types of the categoricals.
import feyn
import numpy as np
import pandas as pd
data = pd.DataFrame({
"Country": ["Denmark", "Spain", "UK", "Brazil", "USA", "Italy"],
"Favourite colour": ["Red", "Yellow", "Blue", "Green", "Yellow", "Red"],
"Gender": ["Male", "Female", "Male", "Male", "Female", "Female"],
"Smoker": [0,1,1,0,1,0]
})
stypes = {
'Country': 'c',
'Favourite colour': 'c',
'Gender': 'c',
}
After which we can pass into the QLattice
.
ql = feyn.QLattice(random_seed=42)
models = ql.auto_run(
data=data,
output_name="Smoker",
stypes=stypes,
n_epochs=1,
max_complexity=2
)
model = models[0]
Here is one graph from the output of auto run.
How has the QLattice
interpreted the categories of this feature as a number so it can be an input to a mathematical function?
Each category is associated to a numerical value which we call its weight. The weight gets learnt while the model is being fitted to the data.
We can see the weights associated to each category by calling params
:
input_node = model[2]
print(f"category weights: {input_node.params['categories']}")
print(f"bias: {input_node.params['bias']}")
category weights: [
('Yellow', 0.26123558974851063),
('Blue', 0.26123554327444204),
('Red', -0.25637270012528196),
('Green', -0.25637278717475576)
]
bias: 0.12080737359066837
When we predict on a sample that contains a category in the list above then we first convert that category to category_weight + bias.
This value gets passed into the functions inside the model and behaves like a normal numerical value.
If there are any NaN values present for some observations in a categorical feature, then the NaN values will be interpretated as a category.