Validate data
by: Kevin Broløs
(Feyn version 2.0.7 or newer)
validate_data
is a function that helps discover the few common data errors that might give unwanted effects with feyn
. We advise running this once after loading in your data, to ensure that your data is in good enough condition.
In order to best validate your data, you need to specify the kind
of problem you intend to solve, the output
column as well as the stypes
that you'll use for sample_models
, if any of them are categorical.
Example
from feyn.datasets import make_classification
from feyn import validate_data
train, test = make_classification()
validate_data(data=train, kind='classification', output_name='y', stypes={})
Here's an example that doesn't validate, because we're using a continuous numerical output to do a classification:
from feyn.datasets import make_regression
from feyn import validate_data
train, test = make_regression()
try:
validate_data(data=train, kind='classification', output_name='y', stypes={})
except ValueError as e:
print(e)
y must be an iterable of booleans or 0s and 1s
In the examples we run it for the training data, but we recommend running it for the full dataset.
validate_data
will raise a ValueError in the following cases:
- If the
output
column does not consist of only numerical values for aregression
case. - If the
output
column does not consist boolean-like values for aclassification
case. - If any of the columns are object types, but have not been declared as
categorical
instypes
. - If columns contain NaN values, and are not declared as
categorical
instypes
.- Note:
categoricals
support NaN values by assigning them their own weights, so we allow this. You should still consider if that's the behaviour what you want, and handle it yourself if you don't.
- Note: