The Modify Module¶
diogenes.modify
provides tools for manipulating arrays and
generating features.
- Cleaning
- Selection
- Feature generation
In-place Cleaning¶
Diogenes provides two functions for data cleaning:
diogenes.modify.modify.replace_missing_vals()
, which replaces missing values with valid onces.diogenes.modify.modify.label_encode()
which replaces strings with corresponding integers.
For this example, we’ll look at Chicago’s “311 Service Requests - Tree Debris” data on the Chicago data portal (https://data.cityofchicago.org/)
import diogenes
data = diogenes.read.open_csv_url('https://data.cityofchicago.org/api/views/mab8-y9h3/rows.csv?accessType=DOWNLOAD',
parse_datetimes=['Creation Date', 'Completion Date'])
The last row of this data set repeats the labels. We’re going to go ahead and omit it.
data = data[:-1]
data.dtype
dtype((numpy.record, [('Creation Date', '<M8[ns]'), ('Status', 'O'), ('Completion Date', '<M8[ns]'), ('Service Request Number', 'O'), ('Type of Service Request', 'O'), ('If Yes, where is the debris located?', 'O'), ('Current Activity', 'O'), ('Most Recent Action', 'O'), ('Street Address', 'O'), ('ZIP Code', '<f8'), ('X Coordinate', '<f8'), ('Y Coordinate', '<f8'), ('Ward', '<f8'), ('Police District', '<f8'), ('Community Area', '<f8'), ('Latitude', '<f8'), ('Longitude', '<f8'), ('Location', 'O')]))
We’re going to predict whether a job is still open, so our label will ultimately be the “Status” column.
from collections import Counter
print Counter(data['Status']).most_common()
[('Completed', 94431), ('Completed - Dup', 13912), ('Open', 144), ('Open - Dup', 6)]
We’ll remove the label from the rest of the data later. First, let’s do some cleaning. Notice that we have some missing data for our floating point variables (encoded as numpy.nan)
import numpy as np
print sum(np.isnan(data['ZIP Code']))
print sum(np.isnan(data['Ward']))
print sum(np.isnan(data['X Coordinate']))
270
31
39
Sklearn can’t tolerate these missing values, so we have to do something
with them. Probably, a statistically sound thing to do with this data
would be to leave these rows out, but for pedagogical purposes, let’s
assume it makes sense to impute the data. We can do that with
diogenes.modify.modify.replace_missing_vals()
.
We could, for instance, replace every nan with a 0:
data_with_zeros = diogenes.modify.replace_missing_vals(data, strategy='constant', constant=0)
print sum(np.isnan(data_with_zeros['ZIP Code']))
print sum(data_with_zeros['ZIP Code'] == 0)
0
277
Looks like there were a few entries that had 0 for a zip code already.
For the purposes of this tutorial, we will go ahead and replace missing values with the most frequent value in the column:
data = diogenes.modify.replace_missing_vals(data, strategy='most_frequent')
Our data also has a number of string columns. Strings must be converted
to numbers before Scikit-Learn can analyze them, so we will use
diogenes.modify.modify.label_encode()
to convert them
print Counter(data['If Yes, where is the debris located?']).most_common()
data, classes = diogenes.modify.label_encode(data)
print Counter(data['If Yes, where is the debris located?']).most_common()
print classes['If Yes, where is the debris located?']
[('Parkway', 44385), ('Alley', 43146), ('', 16145), ('Vacant Lot', 4817)]
[(2, 44385), (1, 43146), (0, 16145), (3, 4817)]
['' 'Alley' 'Parkway' 'Vacant Lot']
Note that classes
is a dictionary of arrays where each key is the
column name and each value is an array of which string each number
represents. For example, if we wanted to find out what category 1
represents, we would look at:
classes['If Yes, where is the debris located?'][1]
'Alley'
and find that category 1 is ‘Alley’
Selection¶
Diogenes provides a number of functions to retain only columns and rows matching a specific criteria:
diogenes.modify.modify.choose_cols_where()
diogenes.modify.modify.remove_cols_where()
diogenes.modify.modify.choose_rows_where()
diogenes.modify.modify.remove_rows_where()
These are explained in detail in the module documentation for
diogenes.modify.modify
. Explaining all the different things you
can do with these selection operators is outside the scope of this
tutorial.
We’ll start out by removing any columns for which every row is the same
value by employing the diogenes.modify.modify.col_val_eq_any()
column selection function:
print data.dtype.names
print
print Counter(data['Type of Service Request'])
print
arguments = [{'func': diogenes.modify.col_val_eq_any, 'vals': None}]
data = diogenes.modify.remove_cols_where(data, arguments)
print data.dtype.names
('Creation Date', 'Status', 'Completion Date', 'Service Request Number', 'Type of Service Request', 'If Yes, where is the debris located?', 'Current Activity', 'Most Recent Action', 'Street Address', 'ZIP Code', 'X Coordinate', 'Y Coordinate', 'Ward', 'Police District', 'Community Area', 'Latitude', 'Longitude', 'Location')
Counter({0: 108493})
('Creation Date', 'Status', 'Completion Date', 'Service Request Number', 'If Yes, where is the debris located?', 'Current Activity', 'Most Recent Action', 'Street Address', 'ZIP Code', 'X Coordinate', 'Y Coordinate', 'Ward', 'Police District', 'Community Area', 'Latitude', 'Longitude', 'Location')
Notice that “Type of Service Request” has been removed, since every value in the column was the same
Next, let’s assume that we’re only interested in requests made during
the year 2015 and select only those rows using the
diogenes.modify.modify.row_val_between()
row selection function:
print data.shape
print data['Creation Date'].min()
print data['Creation Date'].max()
print
arguments = [{'func': diogenes.modify.row_val_between,
'vals': [np.datetime64('2015-01-01T00:00:00', 'ns'), np.datetime64('2016-01-01T00:00:00', 'ns')],
'col_name': 'Creation Date'}]
data = diogenes.modify.choose_rows_where(data, arguments)
print data.shape
print data['Creation Date'].min()
print data['Creation Date'].max()
(108493,)
2004-07-20T19:00:00.000000000-0500
2015-11-04T18:00:00.000000000-0600
(15566,)
2015-01-01T18:00:00.000000000-0600
2015-11-04T18:00:00.000000000-0600
Finally, let’s remove rows which the “Status” column claims are
duplicates. We review our classes
variable to find:
classes['Status']
array(['Completed', 'Completed - Dup', 'Open', 'Open - Dup'], dtype=object)
We want to remove rows that have either 1
or 3
in the status
column. We don’t have a row selection function already defined to select
rows that have one of several discrete values, so we will create one:
def row_val_in(M, col_name, vals):
return np.logical_or(M[col_name] == vals[0], M[col_name] == vals[1])
print data.shape
print Counter(data['Status']).most_common()
print
arguments = [{'func': row_val_in, 'vals': [1, 3], 'col_name': 'Status'}]
data2 = diogenes.modify.remove_rows_where(data, arguments)
print data2.shape
print Counter(data2['Status']).most_common()
(15566,)
[(0, 13806), (1, 1612), (2, 144), (3, 4)]
(13950,)
[(0, 13806), (2, 144)]
Feature Generation¶
We can also create new features based on existing data. We’ll start out
by generating a feature that calculates the distance of the service
request from Cloud Gate in downtown Chicago (41.882773, -87.623304)
using diogenes.modify.modify.distance_from_point()
.
dist_from_cloud_gate = diogenes.modify.distance_from_point(41.882773, -87.623304, data['Latitude'], data['Longitude'])
print dist_from_cloud_gate[:10]
[ 12.37824118 13.37711223 18.95988034 26.2061016 20.18279666
12.08154288 17.63119029 21.29986356 8.09220669 11.27721597]
Now we’ll put those distances into 10 bins using
diogenes.modify.modify.generate_bin()
.
dist_binned = diogenes.modify.generate_bin(dist_from_cloud_gate, 10)
print dist_binned[:10]
[4, 4, 6, 9, 7, 4, 6, 7, 2, 4]
Now we’ll make a binary feature that is true if and only if the tree is
in a parkway in ward 10 using
diogenes.modify.modify.where_all_are_true()
(which has similar
syntax to the selection functions).
print classes['If Yes, where is the debris located?']
['' 'Alley' 'Parkway' 'Vacant Lot']
We note that “Parkway” is category 2, so we will select items that equal 2 in the “If Yes, where is the debris located?” column and 10 in the “Ward” column.
arguments = [{'func': diogenes.modify.row_val_eq,
'col_name': 'If Yes, where is the debris located?',
'vals': 2},
{'func': diogenes.modify.row_val_eq,
'col_name': 'Ward',
'vals': 10}]
parkway_in_ward_10 = diogenes.modify.where_all_are_true(data, arguments)
print np.where(parkway_in_ward_10)
(array([ 3, 149, 174, 248, 274, 277, 598, 672, 675,
698, 796, 945, 949, 963, 1061, 1184, 1206, 1408,
1509, 1799, 1902, 2077, 2177, 2185, 2193, 2213, 2215,
2230, 2341, 2439, 2444, 2562, 2668, 2683, 2790, 2807,
2943, 3181, 3189, 3230, 3232, 3235, 3236, 3237, 3339,
3345, 3603, 3609, 3624, 3639, 3824, 3950, 3979, 3998,
4002, 4005, 4208, 4224, 4274, 4378, 4391, 4440, 4446,
4460, 4486, 4557, 4558, 4630, 4751, 4893, 5074, 5190,
5224, 5266, 5288, 5296, 5328, 5372, 5373, 5399, 5531,
5603, 5613, 5728, 5729, 5819, 6040, 6052, 6056, 6191,
6192, 6517, 6528, 6593, 6682, 7038, 7376, 7387, 7397,
7398, 7600, 7883, 7884, 7948, 8132, 8177, 8344, 8499,
8516, 8682, 8691, 8699, 8718, 8773, 8776, 8912, 8955,
9333, 9424, 9435, 9501, 9518, 9681, 10363, 10731, 10732,
10735, 11107, 11259, 11268, 11288, 11582, 11590, 11777, 11956,
12064, 12135, 13028, 13067, 13402, 13493, 13603, 13787, 14093,
14466, 14484, 14553, 14618, 14625, 14632, 14915, 15248, 15356,
15364, 15407]),)
Finally, we’ll add all of our generated features to our data using
diogenes.utils.append_cols()
data = diogenes.utils.append_cols(data, [dist_from_cloud_gate, dist_binned, parkway_in_ward_10],
['dist_from_cloud_gate', 'dist_binned', 'parkway_in_ward_10'])
print data.dtype
[('Creation Date', '<M8[ns]'), ('Status', '<i8'), ('Completion Date', '<M8[ns]'), ('Service Request Number', '<i8'), ('If Yes, where is the debris located?', '<i8'), ('Current Activity', '<i8'), ('Most Recent Action', '<i8'), ('Street Address', '<i8'), ('ZIP Code', '<f8'), ('X Coordinate', '<f8'), ('Y Coordinate', '<f8'), ('Ward', '<f8'), ('Police District', '<f8'), ('Community Area', '<f8'), ('Latitude', '<f8'), ('Longitude', '<f8'), ('Location', '<i8'), ('dist_from_cloud_gate', '<f8'), ('dist_binned', '<i8'), ('parkway_in_ward_10', '?')]
Last steps¶
Now, all we have to do is make remove the “Status” column from the rest of the data (along with the highly correlated “Completion Date”) and we’re ready to run an experiment.
labels = data['Status']
M = diogenes.utils.remove_cols(data, ['Status', 'Completion Date'])
exp = diogenes.grid_search.experiment.Experiment(M, labels)
exp.run()
[Trial(clf=<class 'sklearn.ensemble.forest.RandomForestClassifier'>, clf_params={}, subset=<class 'diogenes.grid_search.subset.SubsetNoSubset'>, subset_params={}, cv=<class 'sklearn.cross_validation.KFold'>, cv_params={})]