Stats
The stats
package contains experimental tools to help perform statistical analysis on the models produced from the QLattice
. They can be found in __future__.contrib.stats
graph_log_likelihood
def graph_log_likelihood(graph, data):
"""
This computes the log-likelihood of the graph evaluated on the data set.
Arguments:
graph {[feyn.Graph]} -- Graph to evaluate log-likelihood.
data {[dic of numpy arrays or pandas dataframe]} -- Data to evaluate the log-likelihood on.
Returns:
[scalar] -- The log-likelihood of the graph on the data set.
"""
Basic usage:
import feyn
from feyn.__future__.contrib.stats import graph_log_likelihood
ql = feyn.QLattice()
qgraph = ql.get_regressor(data.columns, output)
qgraph.fit(data)
graph = qgraph[0]
loglik = graph_log_likelihood(graph, data)
This is suitable for regressors and classifiers.
graph_f_score
def graph_f_score(graph,data):
"""
This computes the F-statistic associated to a feyn graph under the null hypothesis.
The null hypothesis is that every weight on each feature and category is equal to zero.
If the hypothesis is true then the F-score is distributed by F(q, n - p),
the Fisher distribution of q and n-p degrees of freedom. Here:
* q is the amount of weights we assume is equal to zero
* n is the amount of samples in data
* p amount of parameters in the graph. The F score is calculated by:
nom = {sum((data[target].mean - data[target])**2) - (graph.mse(data) * n)} * (n-p)
denom = (graph.mse(data) * n) * q
F = nom / denom
Arguments:
graph {[feyn.Graph]} -- Graph to test null hypothesis.
data {[dic of numpy arrays or pandas dataframe]} -- Data to test significance of graph on.
Returns:
tuple -- The F score of hypothesis and p value
"""
Basic usage:
import feyn
from feyn.__future__.contrib.stats import graph_f_score
ql = feyn.QLattice()
qgraph = ql.get_regressor(data.columns, output)
qgraph.fit(data)
graph = qgraph[0]
F, p_value = graph_f_score(graph, data)
This is only suitable for regressors.
graph_g_score
def graph_g_score(graph, data):
"""
This computes the G-statistic associated to a feyn graph under the null hypothesis.
The null hypothesis is that every weight on each feature and category is equal to zero.
If the hypothesis is true then the G-score is distributed by chi2(q),
with q degrees of freedom. Here:
* q is the amount of weights we assume is equal to zero
The G-statistic is calculated by:
G = 2 * {graph_log_likelihood(graph, data) - log-likelihood of constant model}
where
log-likelihood of constant model = #neg_class * np.log(#neg_class) + #pos_class * np.log(#pos_class) - #samples * np.log(#samples)
Arguments:
graph {[feyn.Graph]} -- Graph to test null hypothesis.
data {[dic of numpy arrays or pandas dataframe]} -- Data to test significance of graph on.
Returns:
tuple -- The F score of hypothesis and p value
"""
Basic usage:
import feyn
from feyn.__future__.contrib.stats import graph_g_score
ql = feyn.QLattice()
qgraph = ql.get_classifier(data.columns, output)
qgraph.fit(data)
graph = qgraph[0]
G, p_value = graph_g_score(graph, data)
This is only suitable for classifiers.
plot_graph_p_value
def plot_graph_p_value(graph, data, title = 'Significance of graph', ax=None):
"""
Plots the probability density function under the null hypothesis.
The null hypothesis is that every weight on each feature and category is equal to zero.
If the graph is a regression then this plots the Fisher distribution
Under the null hypothesis the F-score approximately distributed by F(q, n - p),
with q and n-p degrees of freedom. Here:
* q is the amount of weights we assume is equal to zero
* n is the amount of samples in data
* p amount of parameters in the graph.
If the graph is a classification then this plots the chi2 distribution
Under the null hypothesis the G-score is distributed by chi2(q),
with q degrees of freedom. Here:
* q is the amount of weights we assume is equal to zero
This also plots vertical lines intercepting the x-axis at the F scores or G scores under each hypothesis.
Arguments:
graph {[feyn.Graph]} -- Graph to calculate p-values of under the null hypothesis
data {[dic of numpy arrays or pandas dataframe]} -- Data to test significance of graph on.
Keyword Arguments:
title {str} -- [Title of axes] (default: {'Significance of graph'})
ax {[matplotlib.Axes]} -- (default: {None})
Returns:
[matplotlibe.Axes] -- Plots of distributions under null hypothesis
"""
Basic usage:
import feyn
from feyn.__future__.contrib.stats import plot_graph_p_value
ql = feyn.QLattice()
qgraph = ql.get_regressor(data.columns, output)
qgraph.fit(data)
graph = qgraph[0]
plot_graph_p_value(graph, data)
This is suitable for regressors and classifiers.