diogenes.grid_search package

Submodules

diogenes.grid_search.experiment module

Provides classes necessary for organizing an Experiment

class diogenes.grid_search.experiment.Experiment(M, labels, clfs=[{'clf': <class 'sklearn.ensemble.forest.RandomForestClassifier'>}], subsets=[{'subset': <class 'diogenes.grid_search.subset.SubsetNoSubset'>}], cvs=[{'cv': <class 'sklearn.cross_validation.KFold'>}], trials=None)

Bases: object

Class to execute and organize grid searches.

Several of the init arguments are of type list of dict. Experiment expects these to be in a particular format:

[{CLASS_SPECIFIER: CLASS_1,
CLASS_1_PARAM_1: [CLASS_1_PARAM_1_VALUE_1, CLASS_1_PARAM_1_VALUE_2, ...
CLASS_1_PARAM_1_VALUE_N],
CLASS_1_PARAM_2: [CLASS_1_PARAM_2_VALUE_1, CLASS_1_PARAM_2_VALUE_2, ...
CLASS_1_PARAM_2_VALUE_N],

... CLASS_1_PARAM_M: [CLASS_1_PARAM_M_VALUE_1, CLASS_1_PARAM_M_VALUE_2, ...

CLASS_1_PARAM_M_VALUE_N]},
{CLASS_SPECIFIER: CLASS_2,
CLASS_2_PARAM_1: [CLASS_2_PARAM_1_VALUE_1, CLASS_2_PARAM_1_VALUE_2, ...
CLASS_2_PARAM_1_VALUE_N],
CLASS_2_PARAM_2: [CLASS_2_PARAM_2_VALUE_1, CLASS_2_PARAM_2_VALUE_2, ...
CLASS_2_PARAM_2_VALUE_N],

... CLASS_2_PARAM_M: [CLASS_2_PARAM_M_VALUE_1, CLASS_2_PARAM_M_VALUE_2, ...

CLASS_2_PARAM_M_VALUE_N]},

... {CLASS_SPECIFIER: CLASS_L,

CLASS_L_PARAM_1: [CLASS_L_PARAM_1_VALUE_1, CLASS_L_PARAM_1_VALUE_2, ...
CLASS_L_PARAM_1_VALUE_N],
CLASS_L_PARAM_2: [CLASS_L_PARAM_2_VALUE_1, CLASS_L_PARAM_2_VALUE_2, ...
CLASS_L_PARAM_2_VALUE_N],

... CLASS_L_PARAM_M: [CLASS_L_PARAM_M_VALUE_1, CLASS_L_PARAM_M_VALUE_2, ...

CLASS_L_PARAM_M_VALUE_N]}]

CLASS_SPECIFIER is a different string constant for each argument. for clfs, it is ‘clf’. For subsets, it is ‘subset’, and for cvs it is ‘cv’.

CLASS_* is a class object which will be used to either classify data (in clfs), take a subset of data (in subsets) or specify train/test splits (in cvs). In clfs, it should be a subclass of sklearn.base.BaseEstimator. In subsets, it should be a subclass of diogenes.grid_search.subset.BaseSubsetIter. In cvs, it should be a subclass of sklearn.cross_validation._PartitionIterator

CLASS_X_PARAM_* is an init argument of CLASS_X. For example, if CLASS_1 is sklearn.ensemble.RandomForest, CLASS_1_PARAM_1 might be the string literal ‘n_estimators’ or the string literal ‘n_features’

CLASS_X_PARAM_Y_VALUE_* is a single value to try as the argument for CLASS_X_PARAM. For example, if CLASS_1_PARAM_1 is ‘n_estimators’, then CLASS_1_PARAM_1_VALUE_1 could be 10 and CLASS_1_PARAM_1_VALUE_1 could be 100.

When we run the Experiment with Experiment.run, Experiment will create a Trial for each element in the cartesian product of all parameters for each class. if we have {‘clf’ : RandomForestEstimator, ‘n_estimators’ : [10, 100], ‘n_features’: [3, 5]} Then there will be a Trial for each of RandomForestEstimator(n_estimators=10, n_features=3), RandomForestEstimator(n_estimators=100, n_features=3), RandomForestEstimator(n_estimators=10, n_features=5), and RandomForestEstimator(n_estimators=100, n_features=5).

For examples of how to create these arguments, look at diogenes.grid_search.standard_clfs.py.

Parameters:
  • M (numpy.ndarray) – structured array corresponding to features to experiment on
  • labels (numpy.ndarray) – vector of labels
  • clfs (list of dict) – classifiers to run
  • subsets (list of dict) – subsetting operations to perform
  • cvs (list of dict) – directives to make train and test sets
  • trials (list or Trial or None) – If a number of Trials have already been run, and we just want to collect them into an Experiment rather than starting from scratch, Experiment can be initiazed with a list of already run Trials
trials

list of Trial

Trials corresponding to this experiment.

average_score()

Get average score for all Trials in this experiment

Returns:dict of Trial – Provides the average score of each trial
Return type:float
static csv_header()

Returns the header required to make a csv

has_run()

Returns boolean specifying whether this experiment has been run

iterate_over_dimension(dimension)

Iterates over sets of Trials with respect to dimension

For example, if we iterate across CLF, each iteration will include all the Trials that use a given CLF. If the experiment has RandomForestClassifier and SVC trials, then one iteration will have all trials with RandomForestClassifer and the other iteration will have all trials with SVC

Parameters:dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) –

dimension to iterate over:

CLF
iterate over Trials on classifier
CLF_PARAMS
iterate over Trials on classifier parameters
SUBSET
iterate over Trials on subset iterator
SUBSET_PARAMS
iterate over Trials on subset iterator parameters
CV
iterate over Trials on cross-validation partition iterator
CV_PARAMS
iterate over Trials on partition iterator
Returns:The first element of the tuple is the value of dimension that all trials in the second element of the tuple has.

The second element of the tuple is an Experiment contain all trials where the given dimension is equal to the value in the first element of the tuple.

Return type:iterator of (?, Experiment)
make_csv(file_name='report.csv')

Creates a csv summarizing the experiment

Parameters:file_name (str) – path of csv to be generated
Returns:path of generated csv
Return type:str
make_report(report_file_name='report.pdf', dimension=None, return_report_object=False, verbose=True)

Creates a pdf report of this experiment

Parameters:
  • report_file_name (str) – path of file for report output
  • dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS, None}) – If not None, will make a subreport for each unique value of dimension
  • return_report_object (boolean) – Iff True, this function returns the report file name and the diogenes.display.Report object. Otherwise, just returns the report file name.
  • verbose (boolean) – iff True, gives output about report generation
Returns:

If return_report_object is False, returns the file name of the generated report. Else, returns a tuple of the filename of the generated report as well as the Report object representing the report

Return type:

str or (str, diogenes.display.Report)

roc_auc()

Get average area under the roc curve for all Trials in this experiment

Returns:dict of Trial – Provides the average area under the roc curve of each trial
Return type:float
run()

Runs the experiment. Fits all classifiers

Returns:Trials with fitted classifiers
Return type:list of Trial
slice_by_best_score(dimension)

Returns trials that have the best score across dimension

Parameters:dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) –

dimension to find best trials over

CLF
find best Trials over classifier
CLF_PARAMS
find best Trials over classifier parameters
SUBSET
find best Trials over subset iterator
SUBSET_PARAMS
find best Trials over subset iterator parameters
CV
find best Trials over cross-validation partition iterator
CV_PARAMS
find best Trials over partition iterator
Returns:With only trials that have the best scores over the selected dimension
Return type:Experiment
slice_on_dimension(dimension, value)

Select Trials where dimension == value

Parameters:
  • dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) –

    dimension to slice on:

    CLF
    select Trials where the classifier == value
    CLF_PARAMS
    select Trials where the classifier parameters == value
    SUBSET
    select Trials where the subset iterator == value
    SUBSET_PARAMS
    select Trials where the subset iterator parameters == value
    CV
    select Trials where the cross-validation partition iterator == value
    CV_PARAMS
    select Trials where the partition iterator params == value
  • value – Value to match
Returns:

containing only the specified Trials

Return type:

Experiment

class diogenes.grid_search.experiment.Run(M, labels, col_names, clf, train_indices, test_indices, sub_col_names, sub_col_inds, subset_note, cv_note, M_test=None, labels_test=None)

Bases: object

Object encapsulating a single fitted classifier and specific data

Parameters:
  • M (numpy.ndarray) – Homogeneous (not structured) array. The array of features. If M_test is None, M contains both train and test sets. If M_test is not None, M contains only the training set.
  • labels (numpy.ndarray) – Array of labels. If labels_test is None, labels contains both train and test sets. If labels_test is not None, M contains only the training set.
  • col_names (list of str) – Names of features
  • clf (sklearn.base.BaseEstimator) – clf fitted with testing data
  • train_indices (np.ndarray or None) – If M_test and labels_test are None, The indices of M and labels that comprise the training set
  • test_indices (np.ndarray or None) – If M_test and labels_test are None, The indices of M and labels that comprise the testing set
  • sub_col_names (list of str) – If subset takes a subset of columns, these are the column names involved in this subset
  • sub_col_inds (np.ndarray) – If subset takes a subset of columns, these are the indices of the columns involved in this subset
  • subset_note (dict of str : ?) – Extra information about this Run provided by the subsetter
  • cv_note (dict of str : ?) – Extra information about this run provided by the partition iterator
  • M_test (np.ndarray or None) – If not None, the features in the test set
  • labels_test (np.ndarray or None) – If not None, the labels in the test set
M

numpy.ndarray

Homogeneous (not structured) array. The array of features. If M_test is None, M contains both train and test sets. If M_test is not None, M contains only the training set.

labels

numpy.ndarray

Array of labels. If labels_test is None, labels contains both train and test sets. If labels_test is not None, M contains only the training set.

col_names

list of str

Names of features

clf

sklearn.base.BaseEstimator

clf fitted with testing data

train_indices

np.ndarray or None

If M_test and labels_test are None, The indices of M and labels that comprise the training set

test_indices

np.ndarray or None

If M_test and labels_test are None, The indices of M and labels that comprise the testing set

sub_col_names

list of str

If subset takes a subset of columns, these are the column names involved in this subset

sub_col_inds

np.ndarray

If subset takes a subset of columns, these are the indices of the columns involved in this subset

subset_note

dict of str : ?

Extra information about this Run provided by the subsetter

cv_note

dict of str : ?

Extra information about this run provided by the partition iterator

M_test

np.ndarray or None

If not None, the features in the test set

labels_test

np.ndarray or None

If not None, the labels in the test set

static csv_header()

Returns a portion of the header necessary in constructing the csv

csv_row()

This Run’s portion of its row in produces csv

f1_score()

Returns f1 score

prec_recall_curve()

Returns matplotlib.figure.Figure of precision/recall curve

precision_at_thresholds(query_thresholds)

Returns precision at given thresholds

Parameters:query_thresholds (list of float) – for each element, 0 <= thresh <= 1
Returns:
Return type:list of float
roc_auc()

Returns area under ROC curve

roc_curve()

Returns matplotlib.figure.Figure of ROC curve

score()

Returns score of fitted clf

sorted_top_feat_importance(n=25)

Returns top feature importances

Parameters:n (int) – number of feature importances to return
Returns:names and scores of top features
Return type:[list of str, list of floats]
class diogenes.grid_search.experiment.Trial(M, labels, col_names, clf=<class 'sklearn.ensemble.forest.RandomForestClassifier'>, clf_params={}, subset=<class 'diogenes.grid_search.subset.SubsetNoSubset'>, subset_params={}, cv=<class 'diogenes.grid_search.partition_iterator.NoCV'>, cv_params={}, runs=None)

Bases: object

Object encapsulating all Runs for a given configuration

Parameters:
  • M (numpy.ndarray) – Homogeneous (not structured) array of features
  • labels (numpy.ndarray) – Array of labels
  • col_names (list of str) – Names of features
  • clf (sklearn.base.BaseEstimator class) – Classifier for this trial
  • clf_params (dict of str : ?) – init parameters for clf
  • subset (diogenes.grid_search.subset.BaseSubsetIter class) – class of object making subsets
  • subset_params (dict of str : ?) – init parameters for subset
  • cv (sklearn.cross_validation._PartitionIterator class) – class used to product train and test sets
  • cv_params (dict of str : ?) – init parameters of cv
  • runs (list of list of Run or None) – if not None, can initialize this trial with a list of Runs that have already been created.
M

numpy.ndarray

Homogeneous (not structured) array of features

labels

numpy.ndarray

Array of labels

col_names

list of str

Names of features

clf

sklearn.base.BaseEstimator class

Classifier for this trial

clf_params

dict of str : ?

init parameters for clf

subset

diogenes.grid_search.subset.BaseSubsetIter class

class of object making subsets

subset_params

dict of str : ?

init parameters for subset

cv

sklearn.cross_validation._PartitionIterator class

class used to product train and test sets

cv_params

dict of str : ?

init parameters of cv

runs

list or Run or None

Runs in this Trial. The outer list signifies different subsets. The outer lists signify different train/test splits using the same subset

average_score()

Returns average score accross all Runs

static csv_header()

Returns portion of header used when creating csv

csv_rows()

Returns portions of rows used in creating csv

has_run()

Returns True iff this Trial has been run

median_run()

Returns Run with median score

prec_recall_curve()

Returns matplotlib.figure.Figure of precision/recall curve

(of median run)

roc_auc()

Returns area under roc curve of median run

roc_curve()

Returns matplotlib.figure.Figure of roc_curve of median run

run()

Run the Trial

runs_flattened()

Returns list of all Runs (rather than list of list of Runs)

diogenes.grid_search.partition_iterator module

Custom objects used to produce train/test splits.

Also, anything in sklearn.cross_validation should work in Experiments

class diogenes.grid_search.partition_iterator.NoCV(n, indices=None)

Bases: sklearn.cross_validation._PartitionIterator

Partition iterator that just returns the entire set as the training set

Parameters:n (int) – The number of rows in the data
class diogenes.grid_search.partition_iterator.SlidingWindowIdx(n, train_start, train_window_size, test_start, test_window_size, inc_value, expanding_train=False)

Bases: sklearn.cross_validation._PartitionIterator

Partition iterator that iterates across indices of array

Has a moving window of indices for the training set and a moving window of indices for the testing set.

Parameters:
  • n (int) – Number of rows in the data
  • train_start (int) – start index of training window
  • train_window_size (int) – number of rows in the (initial) training window
  • test_start (int) – start index of testing window
  • test_window_size (int) – number of rows in testing window
  • inc_value (int) – number of rows to increment train and test sets in each iteration
  • expanding_train (bool) –

    If True, the end of the train window moves forward with each iteration, but the beginning of the train window does not. Consequently, more rows are added to the training set with each iteration

    If False, the beginning and end of the train window both move forward, so the training set remains the same size.

cv_note()

dict of str providing extra info about the current iteration

class diogenes.grid_search.partition_iterator.SlidingWindowValue(M, col_names, guide_col_name, train_start, train_window_size, test_start, test_window_size, inc_value, expanding_train=False)

Bases: sklearn.cross_validation._PartitionIterator

Partition iterator that iterates across values of a column in an array

Has a moving window of indices for the training set and a moving window of indices for the testing set.

Parameters:
  • M (numpy.ndarray) – homogeneous (not structured) array. Feature array from which to draw train and test sets
  • col_names (list of str) – names of features in M
  • guide_col_name (str) – name of feature to use to determine train and test sets
  • train_start (number) – start value for guide_col_name in training window
  • train_window_size (number) – size of the (initial) training window
  • test_start (number) – start value for guide_col_name in testing window
  • test_window_size (number) – size in testing window
  • inc_value (number) – value to increment train and test sets in each iteration
  • expanding_train (bool) –

    If True, the end of the train window moves forward with each iteration, but the beginning of the train window does not. Consequently, more rows are added to the training set with each iteration

    If False, the beginning and end of the train window both move forward, so the training window remains the same size.

cv_note()

dict of str providing extra info about the current iteration

diogenes.grid_search.standard_clfs module

Provides a standard set of classifiers and params to try with Experiment

diogenes.grid_search.standard_clfs.std_clfs

list of dict

A set of supervised, binary classifiers

diogenes.grid_search.standard_clfs.rg_clfs

list of dict

A much more extensive list

diogenes.grid_search.subset module

This module provides different ways to take subsets of data

class diogenes.grid_search.subset.BaseSubsetIter(y, col_names)

Bases: object

class diogenes.grid_search.subset.SubsetNoSubset(y, col_names)

Bases: diogenes.grid_search.subset.BaseSubsetIter

Generates a single subset consisting of all data

class diogenes.grid_search.subset.SubsetRandomRowsActualDistribution(y, col_names, subset_size, n_subsets=3, random_state=None)

Bases: diogenes.grid_search.subset.BaseSubsetIter

Generates subsets reflecting actual distribution of labels

Parameters:
  • y (np.ndarray) – labels in data
  • col_names (list of str) – names of all features in data
  • subset_size (int) – number of rows in each subset
  • n_subsets (int) – number of subsets to pick
  • random_state (int) – random seed
class diogenes.grid_search.subset.SubsetRandomRowsEvenDistribution(y, col_names, subset_size, n_subsets=3, random_state=None)

Bases: diogenes.grid_search.subset.BaseSubsetIter

Generates subsets where each label appears at about the same frequency

Parameters:
  • y (np.ndarray) – labels in data
  • col_names (list of str) – names of all features in data
  • subset_size (int) – number of rows in each subset
  • n_subsets (int) – number of subsets to pick
  • random_state (int) – random seed
class diogenes.grid_search.subset.SubsetSweepNumRows(y, col_names, num_rows, random_state=None)

Bases: diogenes.grid_search.subset.BaseSubsetIter

Generates subsets with varying number of rows

Parameters:
  • y (np.ndarray) – labels in data
  • col_names (list of str) – names of all features in data
  • num_rows (list of int) – number of rows in each subset
  • random_state (int) – random seed
class diogenes.grid_search.subset.SubsetSweepVaryStratification(y, col_names, proportions_positive, subset_size, random_state=None)

Bases: diogenes.grid_search.subset.BaseSubsetIter

Generates subsets with varying proportion of True and False labels

Parameters:
  • y (np.ndarray) – labels in data
  • col_names (list of str) – names of all features in data
  • proportions_positive (list of float) – proportions of positive labels in each subset
  • subset_size (int) – number of rows in each subset
  • random_state (int) – random seed