The Display Module¶
The diogenes.display
module provides tools for
summarizing/exploring data and the performance of trained classifiers.
Exploring data¶
Display provides a number of tools for examining data before they have been fit to classifiers.
We’ll start by pulling and organizing the wine dataset. We read a CSV
from The Internet using diogenes.read.read.open_csv_url()
.
%matplotlib inline
import diogenes
import numpy as np
wine_data = diogenes.read.open_csv_url('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv',
delimiter=';')
We will then separate labels from features using
diogenes.utils.remove_cols()
.
labels = wine_data['quality']
M = diogenes.utils.remove_cols(wine_data, 'quality')
Finally, we alter labels to make this into a binary classification problem. (At this point, all Diogenes features are available for binary classification, but other kinds of ML have more limited support).
labels = labels < np.average(labels)
We can look at our summary statistics with
diogenes.display.display.describe_cols()
. Like most functions in
Diogenes, describe_cols
produces a Numpy structured
array.
summary_stats = diogenes.display.describe_cols(M)
print summary_stats.dtype
[('Column Name', 'S20'), ('Count', '<i8'), ('Mean', '<f8'), ('Standard Dev', '<f8'), ('Minimum', '<f8'), ('Maximum', '<f8')]
print summary_stats
[('fixed acidity', 4898, 6.854787668436097, 0.8437820791264506, 3.8, 14.2)
('volatile acidity', 4898, 0.27824111882400976, 0.10078425854188974, 0.08, 1.1)
('citric acid', 4898, 0.33419150673744386, 0.12100744957029214, 0.0, 1.66)
('residual sugar', 4898, 6.391414863209474, 5.071539989333933, 0.6, 65.8)
('chlorides', 4898, 0.04577235606369947, 0.02184573768505638, 0.009000000000000001, 0.34600000000000003)
('free sulfur dioxide', 4898, 35.30808493262556, 17.005401105808414, 2.0, 289.0)
('total sulfur dioxide', 4898, 138.36065741118824, 42.49372602475034, 9.0, 440.0)
('density', 4898, 0.9940273764801959, 0.0029906015821480306, 0.98711, 1.03898)
('pH', 4898, 3.1882666394446715, 0.15098518431212068, 2.72, 3.82)
('sulphates', 4898, 0.48984687627603113, 0.1141141831056649, 0.22, 1.08)
('alcohol', 4898, 10.514267047774602, 1.2304949365418656, 8.0, 14.2)]
It’s a bit confusing to figure out which numbers go to which statistics
using default structured array printing, so we provide
diogenes.display.display.pprint_sa()
to make it more readable
when we print small structured arrays.
diogenes.display.pprint_sa(summary_stats)
Column Name Count Mean Standard Dev Minimum Maximum
0 fixed acidity 4898 6.85478766844 0.843782079126 3.8 14.2
1 volatile acidity 4898 0.278241118824 0.100784258542 0.08 1.1
2 citric acid 4898 0.334191506737 0.12100744957 0.0 1.66
3 residual sugar 4898 6.39141486321 5.07153998933 0.6 65.8
4 chlorides 4898 0.0457723560637 0.0218457376851 0.009 0.346
5 free sulfur dioxide 4898 35.3080849326 17.0054011058 2.0 289.0
6 total sulfur dioxide 4898 138.360657411 42.4937260248 9.0 440.0
7 density 4898 0.99402737648 0.00299060158215 0.98711 1.03898
8 pH 4898 3.18826663944 0.150985184312 2.72 3.82
9 sulphates 4898 0.489846876276 0.114114183106 0.22 1.08
10 alcohol 4898 10.5142670478 1.23049493654 8.0 14.2
Similarly, we have a number of tools that visualize data. They all return figures, in case the user wants to save them or plot them later.
figure = diogenes.display.plot_correlation_matrix(M)
figure = diogenes.display.plot_correlation_scatter_plot(M)
/Users/zar1/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):


There are also a number of tools for exploring the distribution of data in a single column (ie a 1-dimensional Numpy array)
chlorides = M['chlorides']
figure = diogenes.display.plot_box_plot(chlorides)
figure = diogenes.display.plot_kernel_density(chlorides)
figure = diogenes.display.plot_simple_histogram(chlorides)



diogenes.display.pprint_sa(diogenes.display.crosstab(np.round(chlorides, 1), labels))
col1_value False True
0 0.0 2668 1067
1 0.1 562 541
2 0.2 27 28
3 0.3 1 4
Examining classifier performance.¶
First, we will arrange and execute a quick grid_search experiment with
diogenes.grid_search.experiment.Experiment
. This will run
Random Forest on our data with a number of different hyper-parameters
and a number of different train/test splits. See documentation for
grid_search for more detail.
from sklearn.ensemble import RandomForestClassifier
clfs = [{'clf': RandomForestClassifier, 'n_estimators': [10,50],
'max_features': ['sqrt','log2'], 'random_state': [0]}]
exp = diogenes.grid_search.experiment.Experiment(M, labels, clfs=clfs)
_ = exp.run()
Now, we will extract a single run, which gives us a single fitted classifier and a single set of test data.
run = exp.trials[0].runs[0][0]
fitted_classifier = run.clf
# Sadly, SKLearn doesn't like structured arrays, so we have to convert to the other kind of array
M_test = diogenes.utils.cast_np_sa_to_nd(M[run.test_indices])
labels_test = labels[run.test_indices]
scores = fitted_classifier.predict_proba(M_test)[:,1]
We can use our fitted classifier and test data to make an ROC curve or a precision-recall curve showing us how well the classifier performs.
roc_fig = diogenes.display.plot_roc(labels_test, scores)
prec_recall_fig = diogenes.display.plot_prec_recall(labels_test, scores)


For classifiers that offer feature importances, we provide a convenience
method to get the top n
features.
top_features = diogenes.display.get_top_features(fitted_classifier, M=M)
feat_name score
0 alcohol 0.138773360606
1 density 0.124570994958
2 volatile acidity 0.122195201728
3 free sulfur dioxide 0.0899555971065
4 total sulfur dioxide 0.0862687591382
5 chlorides 0.0857157042913
6 residual sugar 0.0852635347622
7 citric acid 0.0848357469315
8 pH 0.069405648153
9 fixed acidity 0.0586184186208
For random forest classifiers, we also provide a function to examine
consecutive occurence of features in decision trees. see
diogenes.display.display.feature_pairs_in_rf()
for more detail.
results = diogenes.display.feature_pairs_in_rf(fitted_classifier, n=3)
================================================================================
RF Subsequent Pair Analysis
================================================================================
--------------------------------------------------------------------------------
Overall Occurrences
--------------------------------------------------------------------------------
feature pair : occurrences
(1, 3) : 97
(3, 7) : 94
(1, 5) : 86
--------------------------------------------------------------------------------
Average depth
--------------------------------------------------------------------------------
feature pair : average depth
(1, 5) : 7.01162790698
(1, 7) : 7.12698412698
(1, 10) : 7.17567567568
* Max depth was 24
--------------------------------------------------------------------------------
Occurrences weighted by depth
--------------------------------------------------------------------------------
feature pair : sum weight
(9, 9) : 11.48
(3, 3) : 12.56
(2, 2) : 12.96
* Wdiogenes for depth 0, 1, 2, ... were: [1.0, 0.96, 0.92, 0.88, 0.84, 0.8, 0.76, 0.72, 0.68, 0.64, 0.6, 0.56, 0.52, 0.48, 0.44, 0.4, 0.36, 0.32, 0.28, 0.24, 0.2, 0.16, 0.12, 0.08, 0.04]
--------------------------------------------------------------------------------
Occurrences at depth 0
--------------------------------------------------------------------------------
feature pair : occurrences
(5, 7) : 2
(6, 10) : 2
(2, 10) : 2
--------------------------------------------------------------------------------
Occurrences at depth 1
--------------------------------------------------------------------------------
feature pair : occurrences
(1, 3) : 4
(1, 2) : 3
(1, 5) : 3
--------------------------------------------------------------------------------
Occurrences at depth 2
--------------------------------------------------------------------------------
feature pair : occurrences
(1, 10) : 7
(1, 5) : 5
(5, 10) : 4
--------------------------------------------------------------------------------
Occurrences at depth 3
--------------------------------------------------------------------------------
feature pair : occurrences
(1, 10) : 8
(4, 7) : 6
(4, 10) : 6
--------------------------------------------------------------------------------
Occurrences at depth 4
--------------------------------------------------------------------------------
feature pair : occurrences
(0, 10) : 9
(3, 10) : 8
(0, 1) : 8
--------------------------------------------------------------------------------
Occurrences at depth 5
--------------------------------------------------------------------------------
feature pair : occurrences
(1, 5) : 14
(6, 10) : 9
(4, 5) : 9
--------------------------------------------------------------------------------
Occurrences at depth 6
--------------------------------------------------------------------------------
feature pair : occurrences
(3, 7) : 14
(5, 7) : 13
(1, 5) : 12
--------------------------------------------------------------------------------
Occurrences at depth 7
--------------------------------------------------------------------------------
feature pair : occurrences
(0, 10) : 13
(1, 3) : 12
(3, 4) : 12
--------------------------------------------------------------------------------
Occurrences at depth 8
--------------------------------------------------------------------------------
feature pair : occurrences
(3, 5) : 15
(3, 7) : 14
(4, 7) : 13
--------------------------------------------------------------------------------
Occurrences at depth 9
--------------------------------------------------------------------------------
feature pair : occurrences
(4, 6) : 14
(3, 4) : 14
(3, 10) : 12
--------------------------------------------------------------------------------
Occurrences at depth 10
--------------------------------------------------------------------------------
feature pair : occurrences
(6, 9) : 15
(7, 8) : 13
(0, 2) : 12
--------------------------------------------------------------------------------
Occurrences at depth 11
--------------------------------------------------------------------------------
feature pair : occurrences
(4, 7) : 9
(4, 5) : 9
(0, 9) : 9
--------------------------------------------------------------------------------
Occurrences at depth 12
--------------------------------------------------------------------------------
feature pair : occurrences
(7, 10) : 9
(1, 3) : 8
(1, 9) : 8
--------------------------------------------------------------------------------
Occurrences at depth 13
--------------------------------------------------------------------------------
feature pair : occurrences
(0, 3) : 8
(3, 7) : 6
(6, 7) : 6
--------------------------------------------------------------------------------
Occurrences at depth 14
--------------------------------------------------------------------------------
feature pair : occurrences
(2, 5) : 5
(1, 2) : 5
(3, 8) : 5
--------------------------------------------------------------------------------
Occurrences at depth 15
--------------------------------------------------------------------------------
feature pair : occurrences
(2, 8) : 4
(6, 9) : 3
(2, 9) : 3
--------------------------------------------------------------------------------
Occurrences at depth 16
--------------------------------------------------------------------------------
feature pair : occurrences
(0, 6) : 4
(6, 6) : 3
(1, 6) : 3
--------------------------------------------------------------------------------
Occurrences at depth 17
--------------------------------------------------------------------------------
feature pair : occurrences
(6, 7) : 5
(4, 5) : 2
(0, 7) : 2
--------------------------------------------------------------------------------
Occurrences at depth 18
--------------------------------------------------------------------------------
feature pair : occurrences
(1, 3) : 2
(4, 7) : 2
(1, 6) : 2
--------------------------------------------------------------------------------
Occurrences at depth 19
--------------------------------------------------------------------------------
feature pair : occurrences
(4, 7) : 1
(1, 3) : 1
(6, 6) : 1
--------------------------------------------------------------------------------
Occurrences at depth 20
--------------------------------------------------------------------------------
feature pair : occurrences
(0, 0) : 1
(4, 9) : 1
(4, 6) : 1
--------------------------------------------------------------------------------
Occurrences at depth 21
--------------------------------------------------------------------------------
feature pair : occurrences
(0, 0) : 1
(0, 2) : 1
(3, 8) : 1
--------------------------------------------------------------------------------
Occurrences at depth 22
--------------------------------------------------------------------------------
feature pair : occurrences
(2, 6) : 1
(0, 7) : 1
(8, 8) : 1
--------------------------------------------------------------------------------
Occurrences at depth 23
--------------------------------------------------------------------------------
feature pair : occurrences
(3, 7) : 1
(6, 7) : 1
--------------------------------------------------------------------------------
Occurrences at depth 24
--------------------------------------------------------------------------------
feature pair : occurrences
(3, 3) : 1
Making PDF Reports¶
Finally, diogenes.display provides a simple way to make PDF reports
using diogenes.display.display.Report
.
- Add headings with
diogenes.display.display.Report.add_heading()
- Add text blocks with
diogenes.display.display.Report.add_text()
- Add tables with
diogenes.display.display.Report.add_table()
- Add figures with
diogenes.display.display.Report.add_fig()
- Build the report with
diogenes.display.display.Report.to_pdf()
report = diogenes.display.Report(report_path='display_sample_report.pdf')
report.add_heading('My Great Report About RF', level=1)
report.add_text('I did an experiment with the wine data set '
'(http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)')
report.add_heading('Top Features', level=2)
report.add_table(top_features)
report.add_heading('ROC Plot', level=2)
report.add_fig(roc_fig)
full_report_path = report.to_pdf(verbose=False)
Here’s the result:
from IPython.display import HTML
HTML('<iframe src=display_sample_report.pdf width=700 height=350></iframe>')