The Modify Module

diogenes.modify provides tools for manipulating arrays and generating features.

In-place Cleaning

Diogenes provides two functions for data cleaning:

For this example, we’ll look at Chicago’s “311 Service Requests - Tree Debris” data on the Chicago data portal (https://data.cityofchicago.org/)

import diogenes

data = diogenes.read.open_csv_url('https://data.cityofchicago.org/api/views/mab8-y9h3/rows.csv?accessType=DOWNLOAD',
                                  parse_datetimes=['Creation Date', 'Completion Date'])

The last row of this data set repeats the labels. We’re going to go ahead and omit it.

data = data[:-1]
data.dtype
dtype((numpy.record, [('Creation Date', '<M8[ns]'), ('Status', 'O'), ('Completion Date', '<M8[ns]'), ('Service Request Number', 'O'), ('Type of Service Request', 'O'), ('If Yes, where is the debris located?', 'O'), ('Current Activity', 'O'), ('Most Recent Action', 'O'), ('Street Address', 'O'), ('ZIP Code', '<f8'), ('X Coordinate', '<f8'), ('Y Coordinate', '<f8'), ('Ward', '<f8'), ('Police District', '<f8'), ('Community Area', '<f8'), ('Latitude', '<f8'), ('Longitude', '<f8'), ('Location', 'O')]))

We’re going to predict whether a job is still open, so our label will ultimately be the “Status” column.

from collections import Counter
print Counter(data['Status']).most_common()
[('Completed', 94431), ('Completed - Dup', 13912), ('Open', 144), ('Open - Dup', 6)]

We’ll remove the label from the rest of the data later. First, let’s do some cleaning. Notice that we have some missing data for our floating point variables (encoded as numpy.nan)

import numpy as np
print sum(np.isnan(data['ZIP Code']))
print sum(np.isnan(data['Ward']))
print sum(np.isnan(data['X Coordinate']))
270
31
39

Sklearn can’t tolerate these missing values, so we have to do something with them. Probably, a statistically sound thing to do with this data would be to leave these rows out, but for pedagogical purposes, let’s assume it makes sense to impute the data. We can do that with diogenes.modify.modify.replace_missing_vals().

We could, for instance, replace every nan with a 0:

data_with_zeros = diogenes.modify.replace_missing_vals(data, strategy='constant', constant=0)
print sum(np.isnan(data_with_zeros['ZIP Code']))
print sum(data_with_zeros['ZIP Code'] == 0)
0
277

Looks like there were a few entries that had 0 for a zip code already.

For the purposes of this tutorial, we will go ahead and replace missing values with the most frequent value in the column:

data = diogenes.modify.replace_missing_vals(data, strategy='most_frequent')

Our data also has a number of string columns. Strings must be converted to numbers before Scikit-Learn can analyze them, so we will use diogenes.modify.modify.label_encode() to convert them

print Counter(data['If Yes, where is the debris located?']).most_common()
data, classes = diogenes.modify.label_encode(data)
print Counter(data['If Yes, where is the debris located?']).most_common()
print classes['If Yes, where is the debris located?']
[('Parkway', 44385), ('Alley', 43146), ('', 16145), ('Vacant Lot', 4817)]
[(2, 44385), (1, 43146), (0, 16145), (3, 4817)]
['' 'Alley' 'Parkway' 'Vacant Lot']

Note that classes is a dictionary of arrays where each key is the column name and each value is an array of which string each number represents. For example, if we wanted to find out what category 1 represents, we would look at:

classes['If Yes, where is the debris located?'][1]
'Alley'

and find that category 1 is ‘Alley’

Selection

Diogenes provides a number of functions to retain only columns and rows matching a specific criteria:

These are explained in detail in the module documentation for diogenes.modify.modify. Explaining all the different things you can do with these selection operators is outside the scope of this tutorial.

We’ll start out by removing any columns for which every row is the same value by employing the diogenes.modify.modify.col_val_eq_any() column selection function:

print data.dtype.names
print
print Counter(data['Type of Service Request'])
print

arguments = [{'func': diogenes.modify.col_val_eq_any, 'vals': None}]
data = diogenes.modify.remove_cols_where(data, arguments)

print data.dtype.names
('Creation Date', 'Status', 'Completion Date', 'Service Request Number', 'Type of Service Request', 'If Yes, where is the debris located?', 'Current Activity', 'Most Recent Action', 'Street Address', 'ZIP Code', 'X Coordinate', 'Y Coordinate', 'Ward', 'Police District', 'Community Area', 'Latitude', 'Longitude', 'Location')

Counter({0: 108493})

('Creation Date', 'Status', 'Completion Date', 'Service Request Number', 'If Yes, where is the debris located?', 'Current Activity', 'Most Recent Action', 'Street Address', 'ZIP Code', 'X Coordinate', 'Y Coordinate', 'Ward', 'Police District', 'Community Area', 'Latitude', 'Longitude', 'Location')

Notice that “Type of Service Request” has been removed, since every value in the column was the same

Next, let’s assume that we’re only interested in requests made during the year 2015 and select only those rows using the diogenes.modify.modify.row_val_between() row selection function:

print data.shape
print data['Creation Date'].min()
print data['Creation Date'].max()
print

arguments = [{'func': diogenes.modify.row_val_between,
              'vals': [np.datetime64('2015-01-01T00:00:00', 'ns'), np.datetime64('2016-01-01T00:00:00', 'ns')],
              'col_name': 'Creation Date'}]
data = diogenes.modify.choose_rows_where(data, arguments)

print data.shape
print data['Creation Date'].min()
print data['Creation Date'].max()
(108493,)
2004-07-20T19:00:00.000000000-0500
2015-11-04T18:00:00.000000000-0600

(15566,)
2015-01-01T18:00:00.000000000-0600
2015-11-04T18:00:00.000000000-0600

Finally, let’s remove rows which the “Status” column claims are duplicates. We review our classes variable to find:

classes['Status']
array(['Completed', 'Completed - Dup', 'Open', 'Open - Dup'], dtype=object)

We want to remove rows that have either 1 or 3 in the status column. We don’t have a row selection function already defined to select rows that have one of several discrete values, so we will create one:

def row_val_in(M, col_name, vals):
    return np.logical_or(M[col_name] == vals[0], M[col_name] == vals[1])

print data.shape
print Counter(data['Status']).most_common()
print

arguments = [{'func': row_val_in, 'vals': [1, 3], 'col_name': 'Status'}]
data2 = diogenes.modify.remove_rows_where(data, arguments)

print data2.shape
print Counter(data2['Status']).most_common()
(15566,)
[(0, 13806), (1, 1612), (2, 144), (3, 4)]

(13950,)
[(0, 13806), (2, 144)]

Feature Generation

We can also create new features based on existing data. We’ll start out by generating a feature that calculates the distance of the service request from Cloud Gate in downtown Chicago (41.882773, -87.623304) using diogenes.modify.modify.distance_from_point().

dist_from_cloud_gate = diogenes.modify.distance_from_point(41.882773, -87.623304, data['Latitude'], data['Longitude'])
print dist_from_cloud_gate[:10]
[ 12.37824118  13.37711223  18.95988034  26.2061016   20.18279666
  12.08154288  17.63119029  21.29986356   8.09220669  11.27721597]

Now we’ll put those distances into 10 bins using diogenes.modify.modify.generate_bin().

dist_binned = diogenes.modify.generate_bin(dist_from_cloud_gate, 10)
print dist_binned[:10]
[4, 4, 6, 9, 7, 4, 6, 7, 2, 4]

Now we’ll make a binary feature that is true if and only if the tree is in a parkway in ward 10 using diogenes.modify.modify.where_all_are_true() (which has similar syntax to the selection functions).

print classes['If Yes, where is the debris located?']
['' 'Alley' 'Parkway' 'Vacant Lot']

We note that “Parkway” is category 2, so we will select items that equal 2 in the “If Yes, where is the debris located?” column and 10 in the “Ward” column.

arguments = [{'func': diogenes.modify.row_val_eq,
              'col_name': 'If Yes, where is the debris located?',
              'vals': 2},
             {'func': diogenes.modify.row_val_eq,
              'col_name': 'Ward',
              'vals': 10}]
parkway_in_ward_10 = diogenes.modify.where_all_are_true(data, arguments)
print np.where(parkway_in_ward_10)
(array([    3,   149,   174,   248,   274,   277,   598,   672,   675,
         698,   796,   945,   949,   963,  1061,  1184,  1206,  1408,
        1509,  1799,  1902,  2077,  2177,  2185,  2193,  2213,  2215,
        2230,  2341,  2439,  2444,  2562,  2668,  2683,  2790,  2807,
        2943,  3181,  3189,  3230,  3232,  3235,  3236,  3237,  3339,
        3345,  3603,  3609,  3624,  3639,  3824,  3950,  3979,  3998,
        4002,  4005,  4208,  4224,  4274,  4378,  4391,  4440,  4446,
        4460,  4486,  4557,  4558,  4630,  4751,  4893,  5074,  5190,
        5224,  5266,  5288,  5296,  5328,  5372,  5373,  5399,  5531,
        5603,  5613,  5728,  5729,  5819,  6040,  6052,  6056,  6191,
        6192,  6517,  6528,  6593,  6682,  7038,  7376,  7387,  7397,
        7398,  7600,  7883,  7884,  7948,  8132,  8177,  8344,  8499,
        8516,  8682,  8691,  8699,  8718,  8773,  8776,  8912,  8955,
        9333,  9424,  9435,  9501,  9518,  9681, 10363, 10731, 10732,
       10735, 11107, 11259, 11268, 11288, 11582, 11590, 11777, 11956,
       12064, 12135, 13028, 13067, 13402, 13493, 13603, 13787, 14093,
       14466, 14484, 14553, 14618, 14625, 14632, 14915, 15248, 15356,
       15364, 15407]),)

Finally, we’ll add all of our generated features to our data using diogenes.utils.append_cols()

data = diogenes.utils.append_cols(data, [dist_from_cloud_gate, dist_binned, parkway_in_ward_10],
                                  ['dist_from_cloud_gate', 'dist_binned', 'parkway_in_ward_10'])
print data.dtype
[('Creation Date', '<M8[ns]'), ('Status', '<i8'), ('Completion Date', '<M8[ns]'), ('Service Request Number', '<i8'), ('If Yes, where is the debris located?', '<i8'), ('Current Activity', '<i8'), ('Most Recent Action', '<i8'), ('Street Address', '<i8'), ('ZIP Code', '<f8'), ('X Coordinate', '<f8'), ('Y Coordinate', '<f8'), ('Ward', '<f8'), ('Police District', '<f8'), ('Community Area', '<f8'), ('Latitude', '<f8'), ('Longitude', '<f8'), ('Location', '<i8'), ('dist_from_cloud_gate', '<f8'), ('dist_binned', '<i8'), ('parkway_in_ward_10', '?')]

Last steps

Now, all we have to do is make remove the “Status” column from the rest of the data (along with the highly correlated “Completion Date”) and we’re ready to run an experiment.

labels = data['Status']
M = diogenes.utils.remove_cols(data, ['Status', 'Completion Date'])
exp = diogenes.grid_search.experiment.Experiment(M, labels)
exp.run()
[Trial(clf=<class 'sklearn.ensemble.forest.RandomForestClassifier'>, clf_params={}, subset=<class 'diogenes.grid_search.subset.SubsetNoSubset'>, subset_params={}, cv=<class 'sklearn.cross_validation.KFold'>, cv_params={})]