diogenes.modify package¶
Submodules¶
diogenes.modify.modify module¶
This module provides a number of operations to modify structured arrays
A number of functions take a parameter called “arguments” of type list of dict. Diogenes expects these parameters to be expressed in the following format for functions operating on columns (choose_cols_where, remove_cols_where):
- [{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1},
- {‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N}]
and in this format for functions operating on rows (choose_rows_where, remove_rows_where, where_all_are_true):
- [{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1, ‘col_name’: LAMBDA_COL_1},
- {‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2, ‘col_name’: LAMBDA_COL_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N, ‘col_name’: LAMBDA_COL_N}]
In either case, the user can think of arguments as a query to be matched against certain rows or columns. Some operation will then be performed on the matched rows or columns. For example, in choose_rows_where, an array will be returned that has only the rows that matched the query. In remove_cols_where, all columns that matched the query will be removed.
Each dictionary is a single directive. The value assigned to the ‘func’ key is a function that returns a binary array signifying the rows or columns that pass a certain check. The value assigned to the ‘vals’ key is an argument to be passed to the function assigned to the ‘func’ key. For queries affecting rows, the value assigned to the ‘col_name’ key is the column over which the ‘func’ function should be applied. For example, in order to pick all rows for which the ‘year’ column is between 1990 and 2000, we would create the directive:
- {‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
- ‘col_name’: ‘year’}
To pick columns where every cell in the column is 0, we would create the directive:
{‘func’: diogenes.modify.col_val_eq, ‘vals’: 0}
Ultimately, diogenes will pick the columns or rows for which all directives in the passed list are True. For example, if we want to pick rows for which the ‘year’ column is between 1990 and 2000. We use:
- arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
- ‘col_name’: ‘year’}]
If we want to pick rows for which the ‘year’ column is between 1990 and 2000 and the ‘gender’ column is ‘F’ we use:
- arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
‘col_name’: ‘year’},
- {‘func’: diogenes.modify.row_val_eq, ‘vals’: ‘F’,
- ‘col_name’: ‘gender’}]
Note that arguments must always be a list of dict, so even if there is only one directive it must be in a list.
- 
diogenes.modify.modify.choose_cols_where(M, arguments)¶
- Returns a structured array containing only columns adhering to a query - Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
 - Returns: - Structured array with only specified columns - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.choose_rows_where(M, arguments)¶
- Returns a structured array containing only rows adhering to a query - Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
 - Returns: - Structured array with only specified rows - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.col_fewer_than_n_nonzero(M, boundary=2)¶
- Pick columns that have fewer than a specified number of nonzeros - To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation) - Parameters: - M (numpy.ndarray) – structured array
- boundary (int) – If the number of nonzeros is at or above boundary nonzeros, the column will not be picked
 - Returns: - boolean array: True if column picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.col_has_lt_threshold_unique_values(M, threshold)¶
- Pick columns that have fewer than a specified number of unique values - To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation) - Parameters: - M (numpy.ndarray) – structured array
- boundary (int) – If the number of nonzeros is at or above boundary unique values, the column will not be picked
 - Returns: - boolean array: True if column picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.col_random(M, boundary)¶
- Pick random columns - To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation) - Parameters: - M (numpy.ndarray) – structured array
- boundary (int) – number of columns to pick
 - Returns: - boolean array: True if column picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.col_val_eq(M, boundary)¶
- Pick columns where every cell equals specified value - To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation) - Parameters: - M (numpy.ndarray) – structured array
- boundary (number or str or bool) – if every cell==boundary, the column will be picked
 - Returns: - boolean array: True if column picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.col_val_eq_any(M, boundary=None)¶
- Pick columns for which every cell has the same value - To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation) - Parameters: - M (numpy.ndarray) – structured array
- boundary (None) – ignored
 - Returns: - boolean array: True if column picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.combine_cols(M, lambd, col_names)¶
- Return an array that is the function of existing columns - Parameters: - lambd (list of np.array > np.array) – Function that takes a list of columns and produces a single column.
- col_names (list of str) – Names of columns to combine
 
- 
diogenes.modify.modify.distance_from_point(lat_origin, lng_origin, lat_col, lng_col)¶
- Generates a column of how far each record is from the origin - Parameters: - lat_origin (number) –
- lng_origin (number) –
- lat_col (np.ndarray) –
- lng_col (np.ndarray) –
 - Returns: - Return type: - np.ndarray 
- 
diogenes.modify.modify.generate_bin(col, num_bins)¶
- Generates a column of categories, where each category is a bin. - Parameters: - col (np.ndarray) – - Returns: - Return type: - np.ndarray - Examples - >>> M = np.array([0.1, 3.0, 0.0, 1.2, 2.5, 1.7, 2]) >>> generate_bin(M, 3) [0 3 0 1 2 1 2] 
- 
diogenes.modify.modify.label_encode(M, force_columns=[])¶
- Changes string cols to ints so that there is a 1-1 mapping between strings and ints - Parameters: - M (numpy.ndarray) – structured array
- force_columns (list of str) – By default, label_encode will only encode string columns. If the name of a numerical column is also present in force_columns, then that column will also be label encoded
 - Returns: - (numpy.ndarray, dict of str – A tuple: the first element is structured array with strings mapped to ints. The second element is a dictionary where keys are column names and values are arrays of the strings that belong to each class, as in: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html - Return type: - array) 
- 
diogenes.modify.modify.normalize(col, mean=None, stddev=None, return_fit=False)¶
- Generate a normalized column. - Normalize both mean and std dev. - Parameters: - col (np.ndarray) –
- mean (float or None) – Mean to use for fit. If none, will use 0
- stddev (float or None) –
- return_fit (boolean) – If True, returns tuple of fitted col, mean, and standard dev of fit. If False, only returns fitted col
 - Returns: - Return type: - np.ndarray or (np.array, float, float) 
- 
diogenes.modify.modify.remove_cols_where(M, arguments)¶
- Returns a structured array containing columns not adhering to a query - Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
 - Returns: - Structured array without specified columns - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.remove_rows_where(M, arguments)¶
- Returns a structured array containing rows not adhering to a query - Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
 - Returns: - Structured array without specified rows - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.replace_missing_vals(M, strategy, missing_val=nan, constant=0)¶
- Replace values signifying missing data with some substitute - Parameters: - M (numpy.ndarray) – structured array
- strategy ({‘mean’, ‘median’, ‘most_frequent’, ‘constant’}) – method to use to replace missing data
- missing_val (value that M uses to represent missint data. i.e.) – numpy.nan for floats or -999 for integers
- constant (int) – If the ‘constant’ strategy is chosen, this is the value to replace missing_val with
 
- 
diogenes.modify.modify.row_is_nan(M, col_name, boundary=None)¶
- Picks rows for which cell is np.nan - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (None) – unused
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.row_is_outlier(M, col_name, boundary=3.0)¶
- Picks rows that are not within some a number of deviations of the mean - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (float) – number of standard deviations from mean required to be considered an outlier
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.row_is_within_region(M, col_names, boundary)¶
- Picks rows for which cell is within a spacial region - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_names (list of str) – pair of column names signifying x and y coordinates
- boundary (array of points) – shape which cell must be within
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.row_val_between(M, col_name, boundary)¶
- Picks rows for which cell is between the specified values - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary ((number, number)) – To pick a row, the cell must be greater than or equal to boundary[0] and less than or equal to boundary[1]
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.row_val_eq(M, col_name, boundary)¶
- Picks rows for which cell is equal to a specified value - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (number) – value to which cell must be equal
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.row_val_gt(M, col_name, boundary)¶
- Picks rows for which cell is greater than to a specified value - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (number) – value which cell must be greater than
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.row_val_lt(M, col_name, boundary)¶
- Picks rows for which cell is less than to a specified value - To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation). - Parameters: - M (numpy.ndarray) – structured array
- col_name (str) – name of column to check
- boundary (number) – value which cell must be less than
 - Returns: - boolean array: True if row picked, False if not. - Return type: - numpy.ndarray 
- 
diogenes.modify.modify.where_all_are_true(M, arguments)¶
- Returns a boolean array which specifies which rows pass a query - Parameters: - M (numpy.ndarray) – Structured array
- arguments (list of dict) – See module documentation
 - Returns: - boolean array specifying which rows pass a query - Return type: - numpy.ndarray