diogenes.modify package¶
Submodules¶
diogenes.modify.modify module¶
This module provides a number of operations to modify structured arrays
A number of functions take a parameter called “arguments” of type list of dict. Diogenes expects these parameters to be expressed in the following format for functions operating on columns (choose_cols_where, remove_cols_where):
- [{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1},
- {‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N}]
and in this format for functions operating on rows (choose_rows_where, remove_rows_where, where_all_are_true):
- [{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1, ‘col_name’: LAMBDA_COL_1},
- {‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2, ‘col_name’: LAMBDA_COL_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N, ‘col_name’: LAMBDA_COL_N}]
In either case, the user can think of arguments as a query to be matched against certain rows or columns. Some operation will then be performed on the matched rows or columns. For example, in choose_rows_where, an array will be returned that has only the rows that matched the query. In remove_cols_where, all columns that matched the query will be removed.
Each dictionary is a single directive. The value assigned to the ‘func’ key is a function that returns a binary array signifying the rows or columns that pass a certain check. The value assigned to the ‘vals’ key is an argument to be passed to the function assigned to the ‘func’ key. For queries affecting rows, the value assigned to the ‘col_name’ key is the column over which the ‘func’ function should be applied. For example, in order to pick all rows for which the ‘year’ column is between 1990 and 2000, we would create the directive:
- {‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
- ‘col_name’: ‘year’}
To pick columns where every cell in the column is 0, we would create the directive:
{‘func’: diogenes.modify.col_val_eq, ‘vals’: 0}
Ultimately, diogenes will pick the columns or rows for which all directives in the passed list are True. For example, if we want to pick rows for which the ‘year’ column is between 1990 and 2000. We use:
- arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
- ‘col_name’: ‘year’}]
If we want to pick rows for which the ‘year’ column is between 1990 and 2000 and the ‘gender’ column is ‘F’ we use:
- arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
‘col_name’: ‘year’},
- {‘func’: diogenes.modify.row_val_eq, ‘vals’: ‘F’,
- ‘col_name’: ‘gender’}]
Note that arguments must always be a list of dict, so even if there is only one directive it must be in a list.
-
diogenes.modify.modify.
choose_cols_where
(M, arguments)¶ Returns a structured array containing only columns adhering to a query
Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
Returns: Structured array with only specified columns
Return type: numpy.ndarray
-
diogenes.modify.modify.
choose_rows_where
(M, arguments)¶ Returns a structured array containing only rows adhering to a query
Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
Returns: Structured array with only specified rows
Return type: numpy.ndarray
-
diogenes.modify.modify.
col_fewer_than_n_nonzero
(M, boundary=2)¶ Pick columns that have fewer than a specified number of nonzeros
To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)
Parameters: - M (numpy.ndarray) – structured array
- boundary (int) – If the number of nonzeros is at or above boundary nonzeros, the column will not be picked
Returns: boolean array: True if column picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
col_has_lt_threshold_unique_values
(M, threshold)¶ Pick columns that have fewer than a specified number of unique values
To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)
Parameters: - M (numpy.ndarray) – structured array
- boundary (int) – If the number of nonzeros is at or above boundary unique values, the column will not be picked
Returns: boolean array: True if column picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
col_random
(M, boundary)¶ Pick random columns
To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)
Parameters: - M (numpy.ndarray) – structured array
- boundary (int) – number of columns to pick
Returns: boolean array: True if column picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
col_val_eq
(M, boundary)¶ Pick columns where every cell equals specified value
To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)
Parameters: - M (numpy.ndarray) – structured array
- boundary (number or str or bool) – if every cell==boundary, the column will be picked
Returns: boolean array: True if column picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
col_val_eq_any
(M, boundary=None)¶ Pick columns for which every cell has the same value
To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)
Parameters: - M (numpy.ndarray) – structured array
- boundary (None) – ignored
Returns: boolean array: True if column picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
combine_cols
(M, lambd, col_names)¶ Return an array that is the function of existing columns
Parameters: - lambd (list of np.array > np.array) – Function that takes a list of columns and produces a single column.
- col_names (list of str) – Names of columns to combine
-
diogenes.modify.modify.
distance_from_point
(lat_origin, lng_origin, lat_col, lng_col)¶ Generates a column of how far each record is from the origin
Parameters: - lat_origin (number) –
- lng_origin (number) –
- lat_col (np.ndarray) –
- lng_col (np.ndarray) –
Returns: Return type: np.ndarray
-
diogenes.modify.modify.
generate_bin
(col, num_bins)¶ Generates a column of categories, where each category is a bin.
Parameters: col (np.ndarray) – Returns: Return type: np.ndarray Examples
>>> M = np.array([0.1, 3.0, 0.0, 1.2, 2.5, 1.7, 2]) >>> generate_bin(M, 3) [0 3 0 1 2 1 2]
-
diogenes.modify.modify.
label_encode
(M, force_columns=[])¶ Changes string cols to ints so that there is a 1-1 mapping between strings and ints
Parameters: - M (numpy.ndarray) – structured array
- force_columns (list of str) – By default, label_encode will only encode string columns. If the name of a numerical column is also present in force_columns, then that column will also be label encoded
Returns: (numpy.ndarray, dict of str – A tuple: the first element is structured array with strings mapped to ints. The second element is a dictionary where keys are column names and values are arrays of the strings that belong to each class, as in: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Return type: array)
-
diogenes.modify.modify.
normalize
(col, mean=None, stddev=None, return_fit=False)¶ Generate a normalized column.
Normalize both mean and std dev.
Parameters: - col (np.ndarray) –
- mean (float or None) – Mean to use for fit. If none, will use 0
- stddev (float or None) –
- return_fit (boolean) – If True, returns tuple of fitted col, mean, and standard dev of fit. If False, only returns fitted col
Returns: Return type: np.ndarray or (np.array, float, float)
-
diogenes.modify.modify.
remove_cols_where
(M, arguments)¶ Returns a structured array containing columns not adhering to a query
Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
Returns: Structured array without specified columns
Return type: numpy.ndarray
-
diogenes.modify.modify.
remove_rows_where
(M, arguments)¶ Returns a structured array containing rows not adhering to a query
Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
Returns: Structured array without specified rows
Return type: numpy.ndarray
-
diogenes.modify.modify.
replace_missing_vals
(M, strategy, missing_val=nan, constant=0)¶ Replace values signifying missing data with some substitute
Parameters: - M (numpy.ndarray) – structured array
- strategy ({‘mean’, ‘median’, ‘most_frequent’, ‘constant’}) – method to use to replace missing data
- missing_val (value that M uses to represent missint data. i.e.) – numpy.nan for floats or -999 for integers
- constant (int) – If the ‘constant’ strategy is chosen, this is the value to replace missing_val with
-
diogenes.modify.modify.
row_is_nan
(M, col_name, boundary=None)¶ Picks rows for which cell is np.nan
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (None) – unused
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
row_is_outlier
(M, col_name, boundary=3.0)¶ Picks rows that are not within some a number of deviations of the mean
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (float) – number of standard deviations from mean required to be considered an outlier
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
row_is_within_region
(M, col_names, boundary)¶ Picks rows for which cell is within a spacial region
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_names (list of str) – pair of column names signifying x and y coordinates
- boundary (array of points) – shape which cell must be within
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
row_val_between
(M, col_name, boundary)¶ Picks rows for which cell is between the specified values
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary ((number, number)) – To pick a row, the cell must be greater than or equal to boundary[0] and less than or equal to boundary[1]
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
row_val_eq
(M, col_name, boundary)¶ Picks rows for which cell is equal to a specified value
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (number) – value to which cell must be equal
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
row_val_gt
(M, col_name, boundary)¶ Picks rows for which cell is greater than to a specified value
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (number) – value which cell must be greater than
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
row_val_lt
(M, col_name, boundary)¶ Picks rows for which cell is less than to a specified value
To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).
Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (number) – value which cell must be less than
Returns: boolean array: True if row picked, False if not.
Return type: numpy.ndarray
-
diogenes.modify.modify.
where_all_are_true
(M, arguments)¶ Returns a boolean array which specifies which rows pass a query
Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
Returns: boolean array specifying which rows pass a query
Return type: numpy.ndarray