diogenes.modify package

Submodules

diogenes.modify.modify module

This module provides a number of operations to modify structured arrays

A number of functions take a parameter called “arguments” of type list of dict. Diogenes expects these parameters to be expressed in the following format for functions operating on columns (choose_cols_where, remove_cols_where):

[{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1},
{‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N}]

and in this format for functions operating on rows (choose_rows_where, remove_rows_where, where_all_are_true):

[{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1, ‘col_name’: LAMBDA_COL_1},
{‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2, ‘col_name’: LAMBDA_COL_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N, ‘col_name’: LAMBDA_COL_N}]

In either case, the user can think of arguments as a query to be matched against certain rows or columns. Some operation will then be performed on the matched rows or columns. For example, in choose_rows_where, an array will be returned that has only the rows that matched the query. In remove_cols_where, all columns that matched the query will be removed.

Each dictionary is a single directive. The value assigned to the ‘func’ key is a function that returns a binary array signifying the rows or columns that pass a certain check. The value assigned to the ‘vals’ key is an argument to be passed to the function assigned to the ‘func’ key. For queries affecting rows, the value assigned to the ‘col_name’ key is the column over which the ‘func’ function should be applied. For example, in order to pick all rows for which the ‘year’ column is between 1990 and 2000, we would create the directive:

{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
‘col_name’: ‘year’}

To pick columns where every cell in the column is 0, we would create the directive:

{‘func’: diogenes.modify.col_val_eq, ‘vals’: 0}

Ultimately, diogenes will pick the columns or rows for which all directives in the passed list are True. For example, if we want to pick rows for which the ‘year’ column is between 1990 and 2000. We use:

arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
‘col_name’: ‘year’}]

If we want to pick rows for which the ‘year’ column is between 1990 and 2000 and the ‘gender’ column is ‘F’ we use:

arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],
‘col_name’: ‘year’},
{‘func’: diogenes.modify.row_val_eq, ‘vals’: ‘F’,
‘col_name’: ‘gender’}]

Note that arguments must always be a list of dict, so even if there is only one directive it must be in a list.

diogenes.modify.modify.choose_cols_where(M, arguments)

Returns a structured array containing only columns adhering to a query

Parameters:
  • M (numpy.ndarray) – Structured array
  • arguments (list of dict) – See module documentation
Returns:

Structured array with only specified columns

Return type:

numpy.ndarray

diogenes.modify.modify.choose_rows_where(M, arguments)

Returns a structured array containing only rows adhering to a query

Parameters:
  • M (numpy.ndarray) – Structured array
  • arguments (list of dict) – See module documentation
Returns:

Structured array with only specified rows

Return type:

numpy.ndarray

diogenes.modify.modify.col_fewer_than_n_nonzero(M, boundary=2)

Pick columns that have fewer than a specified number of nonzeros

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:
  • M (numpy.ndarray) – structured array
  • boundary (int) – If the number of nonzeros is at or above boundary nonzeros, the column will not be picked
Returns:

boolean array: True if column picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.col_has_lt_threshold_unique_values(M, threshold)

Pick columns that have fewer than a specified number of unique values

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:
  • M (numpy.ndarray) – structured array
  • boundary (int) – If the number of nonzeros is at or above boundary unique values, the column will not be picked
Returns:

boolean array: True if column picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.col_random(M, boundary)

Pick random columns

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:
  • M (numpy.ndarray) – structured array
  • boundary (int) – number of columns to pick
Returns:

boolean array: True if column picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.col_val_eq(M, boundary)

Pick columns where every cell equals specified value

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:
  • M (numpy.ndarray) – structured array
  • boundary (number or str or bool) – if every cell==boundary, the column will be picked
Returns:

boolean array: True if column picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.col_val_eq_any(M, boundary=None)

Pick columns for which every cell has the same value

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:
  • M (numpy.ndarray) – structured array
  • boundary (None) – ignored
Returns:

boolean array: True if column picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.combine_cols(M, lambd, col_names)

Return an array that is the function of existing columns

Parameters:
  • lambd (list of np.array > np.array) – Function that takes a list of columns and produces a single column.
  • col_names (list of str) – Names of columns to combine
diogenes.modify.modify.distance_from_point(lat_origin, lng_origin, lat_col, lng_col)

Generates a column of how far each record is from the origin

Parameters:
  • lat_origin (number) –
  • lng_origin (number) –
  • lat_col (np.ndarray) –
  • lng_col (np.ndarray) –
Returns:

Return type:

np.ndarray

diogenes.modify.modify.generate_bin(col, num_bins)

Generates a column of categories, where each category is a bin.

Parameters:col (np.ndarray) –
Returns:
Return type:np.ndarray

Examples

>>> M = np.array([0.1, 3.0, 0.0, 1.2, 2.5, 1.7, 2])
>>> generate_bin(M, 3)
[0 3 0 1 2 1 2]
diogenes.modify.modify.label_encode(M, force_columns=[])

Changes string cols to ints so that there is a 1-1 mapping between strings and ints

Parameters:
  • M (numpy.ndarray) – structured array
  • force_columns (list of str) – By default, label_encode will only encode string columns. If the name of a numerical column is also present in force_columns, then that column will also be label encoded
Returns:

(numpy.ndarray, dict of str – A tuple: the first element is structured array with strings mapped to ints. The second element is a dictionary where keys are column names and values are arrays of the strings that belong to each class, as in: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Return type:

array)

diogenes.modify.modify.normalize(col, mean=None, stddev=None, return_fit=False)

Generate a normalized column.

Normalize both mean and std dev.

Parameters:
  • col (np.ndarray) –
  • mean (float or None) – Mean to use for fit. If none, will use 0
  • stddev (float or None) –
  • return_fit (boolean) – If True, returns tuple of fitted col, mean, and standard dev of fit. If False, only returns fitted col
Returns:

Return type:

np.ndarray or (np.array, float, float)

diogenes.modify.modify.remove_cols_where(M, arguments)

Returns a structured array containing columns not adhering to a query

Parameters:
  • M (numpy.ndarray) – Structured array
  • arguments (list of dict) – See module documentation
Returns:

Structured array without specified columns

Return type:

numpy.ndarray

diogenes.modify.modify.remove_rows_where(M, arguments)

Returns a structured array containing rows not adhering to a query

Parameters:
  • M (numpy.ndarray) – Structured array
  • arguments (list of dict) – See module documentation
Returns:

Structured array without specified rows

Return type:

numpy.ndarray

diogenes.modify.modify.replace_missing_vals(M, strategy, missing_val=nan, constant=0)

Replace values signifying missing data with some substitute

Parameters:
  • M (numpy.ndarray) – structured array
  • strategy ({‘mean’, ‘median’, ‘most_frequent’, ‘constant’}) – method to use to replace missing data
  • missing_val (value that M uses to represent missint data. i.e.) – numpy.nan for floats or -999 for integers
  • constant (int) – If the ‘constant’ strategy is chosen, this is the value to replace missing_val with
diogenes.modify.modify.row_is_nan(M, col_name, boundary=None)

Picks rows for which cell is np.nan

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_name (str) – name of column to check
  • boundary (None) – unused
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.row_is_outlier(M, col_name, boundary=3.0)

Picks rows that are not within some a number of deviations of the mean

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_name (str) – name of column to check
  • boundary (float) – number of standard deviations from mean required to be considered an outlier
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.row_is_within_region(M, col_names, boundary)

Picks rows for which cell is within a spacial region

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_names (list of str) – pair of column names signifying x and y coordinates
  • boundary (array of points) – shape which cell must be within
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.row_val_between(M, col_name, boundary)

Picks rows for which cell is between the specified values

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_name (str) – name of column to check
  • boundary ((number, number)) – To pick a row, the cell must be greater than or equal to boundary[0] and less than or equal to boundary[1]
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.row_val_eq(M, col_name, boundary)

Picks rows for which cell is equal to a specified value

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_name (str) – name of column to check
  • boundary (number) – value to which cell must be equal
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.row_val_gt(M, col_name, boundary)

Picks rows for which cell is greater than to a specified value

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_name (str) – name of column to check
  • boundary (number) – value which cell must be greater than
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.row_val_lt(M, col_name, boundary)

Picks rows for which cell is less than to a specified value

To be used as a ‘func’ argument in choose_rows_where, remove_rows_where, or where_all_are_true (see module documentation).

Parameters:
  • M (numpy.ndarray) – structured array
  • col_name (str) – name of column to check
  • boundary (number) – value which cell must be less than
Returns:

boolean array: True if row picked, False if not.

Return type:

numpy.ndarray

diogenes.modify.modify.where_all_are_true(M, arguments)

Returns a boolean array which specifies which rows pass a query

Parameters:
  • M (numpy.ndarray) – Structured array
  • arguments (list of dict) – See module documentation
Returns:

boolean array specifying which rows pass a query

Return type:

numpy.ndarray

Module contents