Asking the right questions
by: Kevin Broløs & Meera Machado
Following from the workflow overview, we'll go through the first two steps - namely making observations and posing interesting questions.
Let's bring up an example:
Make observations
Let's take the following simple dataset. This is the well-known UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.
Features are computed from a digitized image of a fine needle aspiration (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
The defined target variable is the diagnosis, as 'Malignant' or 'Benign'.
import sklearn.datasets
import pandas as pd
breast_cancer = sklearn.datasets.load_breast_cancer()
input_columns = breast_cancer.feature_names
# Load into a pandas dataframe
data = pd.DataFrame(breast_cancer.data, columns=input_columns)
data['target'] = pd.Series(breast_cancer.target)
Preparation
This dataset comes prepared already so we don't have to do quite as much. Our workflow typically starts after data preparation, however, it is worth mentioning that with the QLattice
, you don't need to do any normalization of input features, and we have an input that explicitly handles categorical variables without the need for one-hot encoding.
The features
Let's print out the head of the dataframe and see what we're working with.
data.head().T
Asking Questions
We'll now go through some examples of interesting questions we could pose to this dataset. This comes down to your domain expertise and figuring out what you want to learn about, such as:
- What is our measurable - also known as the target variable. This forms the basis of our questions as the variable we want to explain.
- What are we trying to learn?
- Example: What is evidence for a malignant or benign tumor?
- Example: We have a specific feature we suspect is evidence for malignant tumors based on previous studies. We could ask does this feature relate to the diagnosis? And how?
If we don't have specific knowledge, but are trying to learn things about the problem domain, maybe we'll ask more general questions:
- Is there a single feature that captures a large part of the signal?
- If it does, how does it capture it - i.e. linearly or non-linearly?
- Are there some features that tend to explain the
same
signal? - Are there some features that tend to explain
different
parts of the signal?
- If there's one feature that captures part of the signal, does it then combine with another to capture even more?
- How do the features relate to explain the measurable?
- Does it make sense from a domain perspective?
- Would that give way for new questions to ask?
- Or whichever other question you might have.
You are now thinking in questions!
Other datasets might have multiple interesting targets to measure against, so remember to choose the measurable(s) that will be the strongest indicator of the questions you're asking later.
You can always go back and change this as you learn more!
On to hypotheses
On the next page we will be using the QLattice
to pose questions and formulate hypotheses as answers to some of these questions.