diogenes.grid_search package¶
Submodules¶
diogenes.grid_search.experiment module¶
Provides classes necessary for organizing an Experiment
-
class
diogenes.grid_search.experiment.
Experiment
(M, labels, clfs=[{'clf': <class 'sklearn.ensemble.forest.RandomForestClassifier'>}], subsets=[{'subset': <class 'diogenes.grid_search.subset.SubsetNoSubset'>}], cvs=[{'cv': <class 'sklearn.cross_validation.KFold'>}], trials=None)¶ Bases:
object
Class to execute and organize grid searches.
Several of the init arguments are of type list of dict. Experiment expects these to be in a particular format:
- [{CLASS_SPECIFIER: CLASS_1,
- CLASS_1_PARAM_1: [CLASS_1_PARAM_1_VALUE_1, CLASS_1_PARAM_1_VALUE_2, ...
- CLASS_1_PARAM_1_VALUE_N],
- CLASS_1_PARAM_2: [CLASS_1_PARAM_2_VALUE_1, CLASS_1_PARAM_2_VALUE_2, ...
- CLASS_1_PARAM_2_VALUE_N],
... CLASS_1_PARAM_M: [CLASS_1_PARAM_M_VALUE_1, CLASS_1_PARAM_M_VALUE_2, ...
CLASS_1_PARAM_M_VALUE_N]},- {CLASS_SPECIFIER: CLASS_2,
- CLASS_2_PARAM_1: [CLASS_2_PARAM_1_VALUE_1, CLASS_2_PARAM_1_VALUE_2, ...
- CLASS_2_PARAM_1_VALUE_N],
- CLASS_2_PARAM_2: [CLASS_2_PARAM_2_VALUE_1, CLASS_2_PARAM_2_VALUE_2, ...
- CLASS_2_PARAM_2_VALUE_N],
... CLASS_2_PARAM_M: [CLASS_2_PARAM_M_VALUE_1, CLASS_2_PARAM_M_VALUE_2, ...
CLASS_2_PARAM_M_VALUE_N]},
... {CLASS_SPECIFIER: CLASS_L,
- CLASS_L_PARAM_1: [CLASS_L_PARAM_1_VALUE_1, CLASS_L_PARAM_1_VALUE_2, ...
- CLASS_L_PARAM_1_VALUE_N],
- CLASS_L_PARAM_2: [CLASS_L_PARAM_2_VALUE_1, CLASS_L_PARAM_2_VALUE_2, ...
- CLASS_L_PARAM_2_VALUE_N],
... CLASS_L_PARAM_M: [CLASS_L_PARAM_M_VALUE_1, CLASS_L_PARAM_M_VALUE_2, ...
CLASS_L_PARAM_M_VALUE_N]}]
CLASS_SPECIFIER is a different string constant for each argument. for clfs, it is ‘clf’. For subsets, it is ‘subset’, and for cvs it is ‘cv’.
CLASS_* is a class object which will be used to either classify data (in clfs), take a subset of data (in subsets) or specify train/test splits (in cvs). In clfs, it should be a subclass of sklearn.base.BaseEstimator. In subsets, it should be a subclass of diogenes.grid_search.subset.BaseSubsetIter. In cvs, it should be a subclass of sklearn.cross_validation._PartitionIterator
CLASS_X_PARAM_* is an init argument of CLASS_X. For example, if CLASS_1 is sklearn.ensemble.RandomForest, CLASS_1_PARAM_1 might be the string literal ‘n_estimators’ or the string literal ‘n_features’
CLASS_X_PARAM_Y_VALUE_* is a single value to try as the argument for CLASS_X_PARAM. For example, if CLASS_1_PARAM_1 is ‘n_estimators’, then CLASS_1_PARAM_1_VALUE_1 could be 10 and CLASS_1_PARAM_1_VALUE_1 could be 100.
When we run the Experiment with Experiment.run, Experiment will create a Trial for each element in the cartesian product of all parameters for each class. if we have {‘clf’ : RandomForestEstimator, ‘n_estimators’ : [10, 100], ‘n_features’: [3, 5]} Then there will be a Trial for each of RandomForestEstimator(n_estimators=10, n_features=3), RandomForestEstimator(n_estimators=100, n_features=3), RandomForestEstimator(n_estimators=10, n_features=5), and RandomForestEstimator(n_estimators=100, n_features=5).
For examples of how to create these arguments, look at diogenes.grid_search.standard_clfs.py.
Parameters: - M (numpy.ndarray) – structured array corresponding to features to experiment on
- labels (numpy.ndarray) – vector of labels
- clfs (list of dict) – classifiers to run
- subsets (list of dict) – subsetting operations to perform
- cvs (list of dict) – directives to make train and test sets
- trials (list or Trial or None) – If a number of Trials have already been run, and we just want to collect them into an Experiment rather than starting from scratch, Experiment can be initiazed with a list of already run Trials
-
trials
¶ list of Trial
Trials corresponding to this experiment.
-
average_score
()¶ Get average score for all Trials in this experiment
Returns: dict of Trial – Provides the average score of each trial Return type: float
-
static
csv_header
()¶ Returns the header required to make a csv
-
has_run
()¶ Returns boolean specifying whether this experiment has been run
-
iterate_over_dimension
(dimension)¶ Iterates over sets of Trials with respect to dimension
For example, if we iterate across CLF, each iteration will include all the Trials that use a given CLF. If the experiment has RandomForestClassifier and SVC trials, then one iteration will have all trials with RandomForestClassifer and the other iteration will have all trials with SVC
Parameters: dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) – dimension to iterate over:
- CLF
- iterate over Trials on classifier
- CLF_PARAMS
- iterate over Trials on classifier parameters
- SUBSET
- iterate over Trials on subset iterator
- SUBSET_PARAMS
- iterate over Trials on subset iterator parameters
- CV
- iterate over Trials on cross-validation partition iterator
- CV_PARAMS
- iterate over Trials on partition iterator
Returns: The first element of the tuple is the value of dimension that all trials in the second element of the tuple has. The second element of the tuple is an Experiment contain all trials where the given dimension is equal to the value in the first element of the tuple.
Return type: iterator of (?, Experiment)
-
make_csv
(file_name='report.csv')¶ Creates a csv summarizing the experiment
Parameters: file_name (str) – path of csv to be generated Returns: path of generated csv Return type: str
-
make_report
(report_file_name='report.pdf', dimension=None, return_report_object=False, verbose=True)¶ Creates a pdf report of this experiment
Parameters: - report_file_name (str) – path of file for report output
- dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS, None}) – If not None, will make a subreport for each unique value of dimension
- return_report_object (boolean) – Iff True, this function returns the report file name and the diogenes.display.Report object. Otherwise, just returns the report file name.
- verbose (boolean) – iff True, gives output about report generation
Returns: If return_report_object is False, returns the file name of the generated report. Else, returns a tuple of the filename of the generated report as well as the Report object representing the report
Return type: str or (str, diogenes.display.Report)
-
roc_auc
()¶ Get average area under the roc curve for all Trials in this experiment
Returns: dict of Trial – Provides the average area under the roc curve of each trial Return type: float
-
run
()¶ Runs the experiment. Fits all classifiers
Returns: Trials with fitted classifiers Return type: list of Trial
-
slice_by_best_score
(dimension)¶ Returns trials that have the best score across dimension
Parameters: dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) – dimension to find best trials over
- CLF
- find best Trials over classifier
- CLF_PARAMS
- find best Trials over classifier parameters
- SUBSET
- find best Trials over subset iterator
- SUBSET_PARAMS
- find best Trials over subset iterator parameters
- CV
- find best Trials over cross-validation partition iterator
- CV_PARAMS
- find best Trials over partition iterator
Returns: With only trials that have the best scores over the selected dimension Return type: Experiment
-
slice_on_dimension
(dimension, value)¶ Select Trials where dimension == value
Parameters: - dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) –
dimension to slice on:
- CLF
- select Trials where the classifier == value
- CLF_PARAMS
- select Trials where the classifier parameters == value
- SUBSET
- select Trials where the subset iterator == value
- SUBSET_PARAMS
- select Trials where the subset iterator parameters == value
- CV
- select Trials where the cross-validation partition iterator == value
- CV_PARAMS
- select Trials where the partition iterator params == value
- value – Value to match
Returns: containing only the specified Trials
Return type: - dimension ({CLF, CLF_PARAMS, SUBSET, SUBSET_PARAMS, CV, CV_PARAMS}) –
-
class
diogenes.grid_search.experiment.
Run
(M, labels, col_names, clf, train_indices, test_indices, sub_col_names, sub_col_inds, subset_note, cv_note, M_test=None, labels_test=None)¶ Bases:
object
Object encapsulating a single fitted classifier and specific data
Parameters: - M (numpy.ndarray) – Homogeneous (not structured) array. The array of features. If M_test is None, M contains both train and test sets. If M_test is not None, M contains only the training set.
- labels (numpy.ndarray) – Array of labels. If labels_test is None, labels contains both train and test sets. If labels_test is not None, M contains only the training set.
- col_names (list of str) – Names of features
- clf (sklearn.base.BaseEstimator) – clf fitted with testing data
- train_indices (np.ndarray or None) – If M_test and labels_test are None, The indices of M and labels that comprise the training set
- test_indices (np.ndarray or None) – If M_test and labels_test are None, The indices of M and labels that comprise the testing set
- sub_col_names (list of str) – If subset takes a subset of columns, these are the column names involved in this subset
- sub_col_inds (np.ndarray) – If subset takes a subset of columns, these are the indices of the columns involved in this subset
- subset_note (dict of str : ?) – Extra information about this Run provided by the subsetter
- cv_note (dict of str : ?) – Extra information about this run provided by the partition iterator
- M_test (np.ndarray or None) – If not None, the features in the test set
- labels_test (np.ndarray or None) – If not None, the labels in the test set
-
M
¶ numpy.ndarray
Homogeneous (not structured) array. The array of features. If M_test is None, M contains both train and test sets. If M_test is not None, M contains only the training set.
-
labels
¶ numpy.ndarray
Array of labels. If labels_test is None, labels contains both train and test sets. If labels_test is not None, M contains only the training set.
-
col_names
¶ list of str
Names of features
-
clf
¶ sklearn.base.BaseEstimator
clf fitted with testing data
-
train_indices
¶ np.ndarray or None
If M_test and labels_test are None, The indices of M and labels that comprise the training set
-
test_indices
¶ np.ndarray or None
If M_test and labels_test are None, The indices of M and labels that comprise the testing set
-
sub_col_names
¶ list of str
If subset takes a subset of columns, these are the column names involved in this subset
-
sub_col_inds
¶ np.ndarray
If subset takes a subset of columns, these are the indices of the columns involved in this subset
-
subset_note
¶ dict of str : ?
Extra information about this Run provided by the subsetter
-
cv_note
¶ dict of str : ?
Extra information about this run provided by the partition iterator
-
M_test
¶ np.ndarray or None
If not None, the features in the test set
-
labels_test
¶ np.ndarray or None
If not None, the labels in the test set
-
static
csv_header
()¶ Returns a portion of the header necessary in constructing the csv
-
csv_row
()¶ This Run’s portion of its row in produces csv
-
f1_score
()¶ Returns f1 score
-
prec_recall_curve
()¶ Returns matplotlib.figure.Figure of precision/recall curve
-
precision_at_thresholds
(query_thresholds)¶ Returns precision at given thresholds
Parameters: query_thresholds (list of float) – for each element, 0 <= thresh <= 1 Returns: Return type: list of float
-
roc_auc
()¶ Returns area under ROC curve
-
roc_curve
()¶ Returns matplotlib.figure.Figure of ROC curve
-
score
()¶ Returns score of fitted clf
-
sorted_top_feat_importance
(n=25)¶ Returns top feature importances
Parameters: n (int) – number of feature importances to return Returns: names and scores of top features Return type: [list of str, list of floats]
-
class
diogenes.grid_search.experiment.
Trial
(M, labels, col_names, clf=<class 'sklearn.ensemble.forest.RandomForestClassifier'>, clf_params={}, subset=<class 'diogenes.grid_search.subset.SubsetNoSubset'>, subset_params={}, cv=<class 'diogenes.grid_search.partition_iterator.NoCV'>, cv_params={}, runs=None)¶ Bases:
object
Object encapsulating all Runs for a given configuration
Parameters: - M (numpy.ndarray) – Homogeneous (not structured) array of features
- labels (numpy.ndarray) – Array of labels
- col_names (list of str) – Names of features
- clf (sklearn.base.BaseEstimator class) – Classifier for this trial
- clf_params (dict of str : ?) – init parameters for clf
- subset (diogenes.grid_search.subset.BaseSubsetIter class) – class of object making subsets
- subset_params (dict of str : ?) – init parameters for subset
- cv (sklearn.cross_validation._PartitionIterator class) – class used to product train and test sets
- cv_params (dict of str : ?) – init parameters of cv
- runs (list of list of Run or None) – if not None, can initialize this trial with a list of Runs that have already been created.
-
M
¶ numpy.ndarray
Homogeneous (not structured) array of features
-
labels
¶ numpy.ndarray
Array of labels
-
col_names
¶ list of str
Names of features
-
clf
¶ sklearn.base.BaseEstimator class
Classifier for this trial
-
clf_params
¶ dict of str : ?
init parameters for clf
-
subset
¶ diogenes.grid_search.subset.BaseSubsetIter class
class of object making subsets
-
subset_params
¶ dict of str : ?
init parameters for subset
-
cv
¶ sklearn.cross_validation._PartitionIterator class
class used to product train and test sets
-
cv_params
¶ dict of str : ?
init parameters of cv
-
runs
¶ list or Run or None
Runs in this Trial. The outer list signifies different subsets. The outer lists signify different train/test splits using the same subset
-
average_score
()¶ Returns average score accross all Runs
-
static
csv_header
()¶ Returns portion of header used when creating csv
-
csv_rows
()¶ Returns portions of rows used in creating csv
-
has_run
()¶ Returns True iff this Trial has been run
-
median_run
()¶ Returns Run with median score
-
prec_recall_curve
()¶ Returns matplotlib.figure.Figure of precision/recall curve
(of median run)
-
roc_auc
()¶ Returns area under roc curve of median run
-
roc_curve
()¶ Returns matplotlib.figure.Figure of roc_curve of median run
-
run
()¶ Run the Trial
-
runs_flattened
()¶ Returns list of all Runs (rather than list of list of Runs)
diogenes.grid_search.partition_iterator module¶
Custom objects used to produce train/test splits.
Also, anything in sklearn.cross_validation should work in Experiments
-
class
diogenes.grid_search.partition_iterator.
NoCV
(n, indices=None)¶ Bases:
sklearn.cross_validation._PartitionIterator
Partition iterator that just returns the entire set as the training set
Parameters: n (int) – The number of rows in the data
-
class
diogenes.grid_search.partition_iterator.
SlidingWindowIdx
(n, train_start, train_window_size, test_start, test_window_size, inc_value, expanding_train=False)¶ Bases:
sklearn.cross_validation._PartitionIterator
Partition iterator that iterates across indices of array
Has a moving window of indices for the training set and a moving window of indices for the testing set.
Parameters: - n (int) – Number of rows in the data
- train_start (int) – start index of training window
- train_window_size (int) – number of rows in the (initial) training window
- test_start (int) – start index of testing window
- test_window_size (int) – number of rows in testing window
- inc_value (int) – number of rows to increment train and test sets in each iteration
- expanding_train (bool) –
If True, the end of the train window moves forward with each iteration, but the beginning of the train window does not. Consequently, more rows are added to the training set with each iteration
If False, the beginning and end of the train window both move forward, so the training set remains the same size.
-
cv_note
()¶ dict of str providing extra info about the current iteration
-
class
diogenes.grid_search.partition_iterator.
SlidingWindowValue
(M, col_names, guide_col_name, train_start, train_window_size, test_start, test_window_size, inc_value, expanding_train=False)¶ Bases:
sklearn.cross_validation._PartitionIterator
Partition iterator that iterates across values of a column in an array
Has a moving window of indices for the training set and a moving window of indices for the testing set.
Parameters: - M (numpy.ndarray) – homogeneous (not structured) array. Feature array from which to draw train and test sets
- col_names (list of str) – names of features in M
- guide_col_name (str) – name of feature to use to determine train and test sets
- train_start (number) – start value for guide_col_name in training window
- train_window_size (number) – size of the (initial) training window
- test_start (number) – start value for guide_col_name in testing window
- test_window_size (number) – size in testing window
- inc_value (number) – value to increment train and test sets in each iteration
- expanding_train (bool) –
If True, the end of the train window moves forward with each iteration, but the beginning of the train window does not. Consequently, more rows are added to the training set with each iteration
If False, the beginning and end of the train window both move forward, so the training window remains the same size.
-
cv_note
()¶ dict of str providing extra info about the current iteration
diogenes.grid_search.standard_clfs module¶
Provides a standard set of classifiers and params to try with Experiment
-
diogenes.grid_search.standard_clfs.
std_clfs
¶ list of dict
A set of supervised, binary classifiers
-
diogenes.grid_search.standard_clfs.
rg_clfs
¶ list of dict
A much more extensive list
diogenes.grid_search.subset module¶
This module provides different ways to take subsets of data
-
class
diogenes.grid_search.subset.
BaseSubsetIter
(y, col_names)¶ Bases:
object
-
class
diogenes.grid_search.subset.
SubsetNoSubset
(y, col_names)¶ Bases:
diogenes.grid_search.subset.BaseSubsetIter
Generates a single subset consisting of all data
-
class
diogenes.grid_search.subset.
SubsetRandomRowsActualDistribution
(y, col_names, subset_size, n_subsets=3, random_state=None)¶ Bases:
diogenes.grid_search.subset.BaseSubsetIter
Generates subsets reflecting actual distribution of labels
Parameters: - y (np.ndarray) – labels in data
- col_names (list of str) – names of all features in data
- subset_size (int) – number of rows in each subset
- n_subsets (int) – number of subsets to pick
- random_state (int) – random seed
-
class
diogenes.grid_search.subset.
SubsetRandomRowsEvenDistribution
(y, col_names, subset_size, n_subsets=3, random_state=None)¶ Bases:
diogenes.grid_search.subset.BaseSubsetIter
Generates subsets where each label appears at about the same frequency
Parameters: - y (np.ndarray) – labels in data
- col_names (list of str) – names of all features in data
- subset_size (int) – number of rows in each subset
- n_subsets (int) – number of subsets to pick
- random_state (int) – random seed
-
class
diogenes.grid_search.subset.
SubsetSweepNumRows
(y, col_names, num_rows, random_state=None)¶ Bases:
diogenes.grid_search.subset.BaseSubsetIter
Generates subsets with varying number of rows
Parameters: - y (np.ndarray) – labels in data
- col_names (list of str) – names of all features in data
- num_rows (list of int) – number of rows in each subset
- random_state (int) – random seed
-
class
diogenes.grid_search.subset.
SubsetSweepVaryStratification
(y, col_names, proportions_positive, subset_size, random_state=None)¶ Bases:
diogenes.grid_search.subset.BaseSubsetIter
Generates subsets with varying proportion of True and False labels
Parameters: - y (np.ndarray) – labels in data
- col_names (list of str) – names of all features in data
- proportions_positive (list of float) – proportions of positive labels in each subset
- subset_size (int) – number of rows in each subset
- random_state (int) – random seed