diogenes.modify package¶

Submodules¶

diogenes.modify.modify module¶

This module provides a number of operations to modify structured arrays

A number of functions take a parameter called “arguments” of type list of dict. Diogenes expects these parameters to be expressed in the following format for functions operating on columns (choose_cols_where, remove_cols_where):

[{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1},

{‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N}]

and in this format for functions operating on rows (choose_rows_where, remove_rows_where, where_all_are_true):

[{‘func’: LAMBDA_1, ‘vals’: LAMBDA_ARGUMENTS_1, ‘col_name’: LAMBDA_COL_1},

{‘func’: LAMBDA_2, ‘vals’: LAMBDA_ARGUMENTS_2, ‘col_name’: LAMBDA_COL_2}, ... {‘func’: LAMBDA_N, ‘vals’: LAMBDA_ARGUMENTS_N, ‘col_name’: LAMBDA_COL_N}]

In either case, the user can think of arguments as a query to be matched against certain rows or columns. Some operation will then be performed on the matched rows or columns. For example, in choose_rows_where, an array will be returned that has only the rows that matched the query. In remove_cols_where, all columns that matched the query will be removed.

Each dictionary is a single directive. The value assigned to the ‘func’ key is a function that returns a binary array signifying the rows or columns that pass a certain check. The value assigned to the ‘vals’ key is an argument to be passed to the function assigned to the ‘func’ key. For queries affecting rows, the value assigned to the ‘col_name’ key is the column over which the ‘func’ function should be applied. For example, in order to pick all rows for which the ‘year’ column is between 1990 and 2000, we would create the directive:

{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],

‘col_name’: ‘year’}

To pick columns where every cell in the column is 0, we would create the directive:

{‘func’: diogenes.modify.col_val_eq, ‘vals’: 0}

Ultimately, diogenes will pick the columns or rows for which all directives in the passed list are True. For example, if we want to pick rows for which the ‘year’ column is between 1990 and 2000. We use:

arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],

‘col_name’: ‘year’}]

If we want to pick rows for which the ‘year’ column is between 1990 and 2000 and the ‘gender’ column is ‘F’ we use:

arguments=[{‘func’: diogenes.modify.row_val_between, ‘vals’: [1990, 2000],

‘col_name’: ‘year’},

{‘func’: diogenes.modify.row_val_eq, ‘vals’: ‘F’,

‘col_name’: ‘gender’}]

Note that arguments must always be a list of dict, so even if there is only one directive it must be in a list.

diogenes.modify.modify.choose_cols_where(M, arguments)¶

Returns a structured array containing only columns adhering to a query

Parameters:	M (numpy.ndarray) – Structured array arguments (list of dict) – See module documentation
Returns:	Structured array with only specified columns
Return type:	numpy.ndarray

diogenes.modify.modify.choose_rows_where(M, arguments)¶

Returns a structured array containing only rows adhering to a query

Parameters:	M (numpy.ndarray) – Structured array arguments (list of dict) – See module documentation
Returns:	Structured array with only specified rows
Return type:	numpy.ndarray

diogenes.modify.modify.col_fewer_than_n_nonzero(M, boundary=2)¶

Pick columns that have fewer than a specified number of nonzeros

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:	M (numpy.ndarray) – structured array boundary (int) – If the number of nonzeros is at or above boundary nonzeros, the column will not be picked
Returns:	boolean array: True if column picked, False if not.
Return type:	numpy.ndarray

diogenes.modify.modify.col_has_lt_threshold_unique_values(M, threshold)¶

Pick columns that have fewer than a specified number of unique values

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:	M (numpy.ndarray) – structured array boundary (int) – If the number of nonzeros is at or above boundary unique values, the column will not be picked
Returns:	boolean array: True if column picked, False if not.
Return type:	numpy.ndarray

diogenes.modify.modify.col_random(M, boundary)¶

Pick random columns

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:	M (numpy.ndarray) – structured array boundary (int) – number of columns to pick
Returns:	boolean array: True if column picked, False if not.
Return type:	numpy.ndarray

diogenes.modify.modify.col_val_eq(M, boundary)¶

Pick columns where every cell equals specified value

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:	M (numpy.ndarray) – structured array boundary (number or str or bool) – if every cell==boundary, the column will be picked
Returns:	boolean array: True if column picked, False if not.
Return type:	numpy.ndarray

diogenes.modify.modify.col_val_eq_any(M, boundary=None)¶

Pick columns for which every cell has the same value

To be used as a ‘func’ argument in choose_cols_where or remove_cols_where (see module documentation)

Parameters:	M (numpy.ndarray) – structured array boundary (None) – ignored
Returns:	boolean array: True if column picked, False if not.
Return type:	numpy.ndarray

diogenes.modify.modify.combine_cols(M, lambd, col_names)¶

Return an array that is the function of existing columns

Parameters:	lambd (list of np.array > np.array) – Function that takes a list of columns and produces a single column. col_names (list of str) – Names of columns to combine

diogenes.modify.modify.distance_from_point(lat_origin, lng_origin, lat_col, lng_col)¶

Generates a column of how far each record is from the origin

Parameters:	lat_origin (number) – lng_origin (number) – lat_col (np.ndarray) – lng_col (np.ndarray) –
Returns:
Return type:	np.ndarray

diogenes.modify.modify.generate_bin(col, num_bins)¶

Generates a column of categories, where each category is a bin.

Parameters:	col (np.ndarray) –
Returns:
Return type:	np.ndarray

Examples

>>> M = np.array([0.1, 3.0, 0.0, 1.2, 2.5, 1.7, 2])
>>> generate_bin(M, 3)
[0 3 0 1 2 1 2]

diogenes.modify.modify.label_encode(M, force_columns=[])¶

Changes string cols to ints so that there is a 1-1 mapping between strings and ints

Parameters:	M (numpy.ndarray) – structured array force_columns (list of str) – By default, label_encode will only encode string columns. If the name of a numerical column is also present in force_columns, then that column will also be label encoded
Returns:	(numpy.ndarray, dict of str – A tuple: the first element is structured array with strings mapped to ints. The second element is a dictionary where keys are column names and values are arrays of the strings that belong to each class, as in: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Return type:	array)

diogenes.modify.modify.normalize(col, mean=None, stddev=None, return_fit=False)¶

Generate a normalized column.

Normalize both mean and std dev.

Parameters:	col (np.ndarray) – mean (float or None) – Mean to use for fit. If none, will use 0 stddev (float or None) – return_fit (boolean) – If True, returns tuple of fitted col, mean, and standard dev of fit. If False, only returns fitted col
Returns:
Return type:	np.ndarray or (np.array, float, float)

diogenes.modify.modify.remove_cols_where(M, arguments)¶

Returns a structured array containing columns not adhering to a query

Parameters:	M (numpy.ndarray) – Structured array arguments (list of dict) – See module documentation
Returns:	Structured array without specified columns
Return type:	numpy.ndarray

diogenes.modify.modify.remove_rows_where(M, arguments)¶

Returns a structured array containing rows not adhering to a query

Parameters:	M (numpy.ndarray) – Structured array arguments (list of dict) – See module documentation
Returns:	Structured array without specified rows
Return type:	numpy.ndarray

diogenes.modify.modify.replace_missing_vals(M, strategy, missing_val=nan, constant=0)¶

Replace values signifying missing data with some substitute

Parameters:	M (numpy.ndarray) – structured array strategy ({‘mean’, ‘median’, ‘most_frequent’, ‘constant’}) – method to use to replace missing data missing_val (value that M uses to represent missint data. i.e.) – numpy.nan for floats or -999 for integers constant (int) – If the ‘constant’ strategy is chosen, this is the value to replace missing_val with

diogenes.modify.modify.row_is_nan(M, col_name, boundary=None)¶

Picks rows for which cell is np.nan