diogenes package¶
Subpackages¶
Submodules¶
diogenes.array_emitter module¶
-
class
diogenes.array_emitter.
ArrayEmitter
(convert_to_unix_time=False)¶ Bases:
object
Array emitter is a tool that accepts tables from either SQL or CSVs in the RG format, then generates Numpy structured arrays in the M format based on selection criteria on those tables.
RG Tables
Tables can be specified from either a CSV file (using the get_rg_from_csv method) or from a SQL query (using the get_rg_from_SQL method). Imported tables must adhere to the RG format:
Table 1–an example RG-format table
student_id start_year end_year feature value 0 2005 2006 math_gpa 2.3 0 2005 2006 english_gpa 4.0 0 2005 2006 absences 7 0 2006 2007 math_gpa 2.1 0 2006 2007 english_gpa 3.9 0 2006 2007 absences 8 1 2005 2006 math_gpa 3.4 1 2005 2006 absences 0 1 2006 2007 math_gpa 3.5 1 2007 2008 english_gpa 2.4 2 2004 2005 math_gpa 2.4 2 2005 2006 math_gpa 3.4 2 2005 2006 absences 14 2 2006 2007 absences 96 In an RG-formatted table, there are five columns:
- The unique identifier of a unit. By “unit,” we mean unit in a statistical sense, where a population consists of a number of units. In Table 1, a unit is a student, and each student is uniquely identified by a value that appears in the student_id column. Table 1 defines data for students 0, 1, and 2.
- The time at which a certain record begins to be applicable. In Table 1, start_year is this start time.
- The time at which a certain record ceases to be applicable. In Table 1, end_year is this stop time.
- The name of a feature applicable to that unit at that time. In Table 1, this is “feature”
- The value of the feature for that unit at that time. In Table 1, this is Value
The values in the first column uniquely identify each unit, but there can be more than one row in the table per unit. These tables give us information in the form of: “For unit u, from time t1 to time t2, feature f had value x”
In Table 1, the values of the student_id column each correspond to one student. Each student may have multiple rows on this table corresponding to multiple features at multiple times. For example, during 2005-2006, student 0 had a math_gpa of 2.3 and an english_gpa of 4.0. During 2006-2007, student 0’s math_gpa dropped to 2.1, while his or her english_gpa dropped to 3.9.
If a record does not have a time frame, but can be considered to last “forever” (somebody’s name, for example) then the start time and end time columns can be left NULL. These records will appear in all time intervals
If a record only has one time associated (for example, the time that a parking tissue was issued) then either start time or stop time can be left NULL, and the other can be filled in.
M Tables
ArrayEmitter generates M formatted tables based on RG formatted tables. For example, the RG-formatted table Table 1 might result in the following M-formatted table:
Table 2
student_id math_gpa_AVG english_gpa_AVG absences_MAX 0 2.2 3.95 8 1 3.45 nan 0 2 3.4 nan 96 In an M-formatted table, each unit has a single row, and each feature has its own column. Notice that the student_ids in Table 2 correspond to the student_ids in Table 1, and the names of the columns in Table 2 correspond to the entries in the “feature” column of Table 1. The process used to determine the values in these columns is elucidated below.
Converting an RG-formatted table to an M-formatted table.
In order to decide what values appear in our M-formatted table, we:
- Optionally select a aggregation methods with set_aggregation and set_default_aggregation
- Select a timeframe with emit_M
When creating the M table, we first take only entries in the RG table table that fall within the timeframe specified in emit_M, then we aggregate those entries using the user_specified aggretation method. If an aggreagation method is not specified, ArrayGenerator will take the mean. For example, if we have Table 1 stored in table1.csv, and run the following:
>>> ae = ArrayEmitter() >>> ae = ae.get_rg_from_csv('table1.csv') >>> ae = ae.set_aggregation('math_gpa', 'AVG') >>> ae = ae.set_aggregation('absences', 'MAX') >>> ae = ae.set_interval(2005, 2006) >>> table2 = ae.emit_M()
we end up with Table 2
Notice that math_gpa and english_gpa are the average for 2005 and 2006 per student, while absences is the max over 2005 and 2006. Also notice that english_gpa for student 1 is nan, since the only english_gpa for student 1 is from 2007, which is outside of our range. For student 2, english_gpa is nan because student 2 has no entries in the table for english_gpa.
Taking subsets of units
In addition to taking subsets of items in RG tables, we might also want to take subsets of units (i.e. rows in M-format tables) according to some perameter. For example, we might want to consider only students with a math_gpa at or below 3.4. In order to subset units, we use the select_rows_in_M function. For example:
>>> ae = ArrayEmitter() >>> ae = ae.get_rg_from_csv('table1.csv') >>> ae = ae.set_aggregation('math_gpa', 'AVG') >>> ae = ae.set_aggregation('absences', 'MAX') >>> ae = ae.select_rows_in_M('math_gpa_AVG <= 3.4') >>> ae = ae.set_interval(2005, 2006) >>> table3 = ae.emit_M()
Gives us
Table 3:
student_id math_gpa_AVG english_gpa_AVG absences_MAX 0 2.2 3.95 8 2 3.4 nan 96 Notice that Table 3 is identical to Table 2, except student 1 has been omitted because his/her GPA is higher than 3.4.
Taking labels and features from different time intervals
If you need to take labels and the rest of your features from different time intervals, set the label column with set_label_feature and set the label interval with set_label_interval.
Note on function call semantics
Most methods of ArrayEmitter return new ArrayEmitters rather than modifying the existing ArrayEmitter.
Parameters: convert_to_unix_time (boolean) – Iff true, user queries in set_interval will be translated from datetimes to unix time (seconds since The Epoch). The user may wish to set this variable if the database stores times in unix time -
emit_M
()¶ Creates a structured array in M-format
Returns: Numpy structured array constructed using the specified queries and subsets Return type: np.ndarray
-
get_query
()¶ Returns SQL query that will be used to create the M-formatted table
-
get_rg_from_csv
(csv_file_path, parse_datetimes=[], unit_id_col=None, start_time_col=None, stop_time_col=None, feature_col=None, val_col=None)¶ Get an RG-formatted table from a CSV file.
Parameters: - csv_file_path (str) – Path of the csv file to import table from
- parse_datetimes (list of col names) – Columns that should be interpreted as datetimes
- unit_id_col (str or None) – The name of the column containing unique unit IDs. For example, in Table 1, this is ‘student_id’. If None, ArrayEmitter will pick the first otherwise unspecified column
- start_time_col (str or None) – The name of the column containing start time. In Table 1, this is ‘start_year’. If None, ArrayEmitter will pick the second otherwise unspecified column.
- end_time_col (str or None) – The name of the column containing the stop time. In Table 1, this is ‘end_year’. If None, ArrayEmitter will pick the third otherwise unspecified column.
- feature_col (str or None) – The name of the column containing the feature name. In Table 1, this is ‘feature’. If None, ArrayEmitter will pick the fourth otherwise unspecified column.
- val_col (str or None) – The name of the column containing the value for the given feature for the given user at the given time. In Table 1, this is ‘value’. If None, ArrayEmitter will pick the fifth otherwise unspecified column.
Returns: Copy of this ArrayGenerator which has rg_table specified
Return type: ArrayGenerator
Examples
>>> ae = ArrayEmitter() >>> ae = ae.get_rg_from_csv('table_1.csv')
-
get_rg_from_sql
(conn_str, table_name, unit_id_col=None, start_time_col=None, stop_time_col=None, feature_col=None, val_col=None)¶ Gets an RG-formatted matrix from a CSV file
Parameters: - conn_str (str) – SQLAlchemy connection string to connect to the database and run the query.
- table_name (str) – The name of the RG-formatted table in the database
- unit_id_col : str or None
- The name of the column containing unique unit IDs. For example, in Table 1, this is ‘student_id’. If None, ArrayEmitter will pick the first otherwise unspecified column
- start_time_col : str or None
- The name of the column containing start time. In Table 1, this is ‘start_year’. If None, ArrayEmitter will pick the second otherwise unspecified column.
- end_time_col : str or None
- The name of the column containing the stop time. In Table 1, this is ‘end_year’. If None, ArrayEmitter will pick the third otherwise unspecified column.
- feature_col : str or None
- The name of the column containing the feature name. In Table 1, this is ‘feature’. If None, ArrayEmitter will pick the fourth otherwise unspecified column.
- val_col : str or None
- The name of the column containing the value for the given feature for the given user at the given time. In Table 1, this is ‘value’. If None, ArrayEmitter will pick the fifth otherwise unspecified column.
Returns: Copy of this ArrayGenerator which has rg_table specified Return type: ArrayGenerator Examples
>>> conn_str = ... >>> ae = ArrayEmitter() >>> ae = ae.get_rg_from_SQL('SELECT * FROM table_1', 'student_id', ... conn_str=conn_str)
-
select_rows_in_M
(where)¶ Specifies a subset of the units to be returned in the M-table according to some constraint.
Parameters: where (str) – A statement required to be true about the returned table using at least one column name, constant values, parentheses and the operators: =, !=, <, >, <=, >=, AND, OR, NOT, and other things that can appear in a SQL WHERE statement Returns: A copy of the current ArrayGenerator with the additional where condition added Return type: ArrayGenerator Examples
>>> ae = ArrayEmitter() >>> ... # Populate ag with Table 1 and Table 2 >>> ae = ae.set_aggregation('math_gpa', 'mean') >>> ae = ae.set_aggregation('absences', 'max') >>> ae = ae.select_rows_in_M('grad_year == 2007') >>> ae = ae.set_interval(2005, 2006) >>> sa = ae.emit_M()
-
set_aggregation
(feature_name, method)¶ Sets the method or methods used to aggregate across dates in the RG table.
Parameters: - feature_name (str) – Name of feature for which we are aggregating
- method (str or list of strs) –
Method or methods used to aggregate the feature across year. If a str, can be one of:
- ‘AVG’
- Mean average
- ‘COUNT’
- Number of results
- ‘MAX’
- Largest result
- ‘MIN’
- Smallest result
- ‘SUM’
- Sum of results
Additionally, method can be any aggregation function supported by the database in which the RG table lives.
If a list, will create one aggregate column for each method in the list, for example: [‘AVG’, ‘MIN’, ‘MAX’]
Returns: Copy of this ArrayGenerator with aggregation set
Return type: ArrayGenerator
Examples
>>> ae = ArrayEmitter() >>> ... # Populate ag with Table 1 and Table 2 >>> ae = ae.set_aggregation('math_gpa', 'mean') >>> ae = ae.set_aggregation('absences', 'max') >>> ae = ae.set_interval(2005, 2006) >>> sa = ae.emit_M()
-
set_default_aggregation
(method)¶ Sets the default method used to aggregate across dates
ArrayEmitter will use the value of set_default_aggregation when a method has not been set for a given feature using the set_aggregation method.
When set_default_aggregation has not been called, the default aggregation method is ‘AVG’
Parameters: method (str) – Method used to aggregate features across year. Can be one of:
- ‘AVG’
- Mean average
- ‘COUNT’
- Number of results
- ‘MAX’
- Largest result
- ‘MIN’
- Smallest result
- ‘SUM’
- Sum of results
Additionally, method can be any aggregation function supported by the database in which the RG table lives.
Returns: Copy of this ArrayGenerator with default aggregation set Return type: ArrayGenerator
-
set_interval
(start_time, stop_time)¶ Sets interval used to create M-formatted table
Start times and stop times are inclusive
Parameters: - start_time (number or datetime.datetime) – Start time of log tables to include in this sa
- stop_time (number or datetime.datetime) – Stop time of log tables to include in this sa
Returns: With interval set
Return type:
-
set_label_feature
(feature_name)¶ Sets the feature in the array which will be considered the label
Returns: Copy of this ArrayGenerator with specified label column Return type: ArrayGenerator
-
set_label_interval
(start_time, stop_time)¶ Sets interval from which to select labels
Parameters: - start_time (number or datetime.datetime) – Start time of log tables to include in this sa’s labels
- stop_time (number or datetime.datetime) – Stop time of log tables to include in this sa’s labels
Returns: With label interval set
Return type:
-
subset_over
(label_col, interval_train_window_start, interval_train_window_end, interval_test_window_start, interval_test_window_end, interval_inc_value, label_col_aggr_of_interest='AVG', interval_expanding=False, label_interval_train_window_start=None, label_interval_train_window_end=None, label_interval_test_window_start=None, label_interval_test_window_end=None, label_interval_inc_value=None, label_interval_expanding=False, row_M_col_name=None, row_M_col_aggr_of_interest='AVG', row_M_train_window_start=None, row_M_train_window_end=None, row_M_test_window_start=None, row_M_test_window_end=None, row_M_inc_value=None, row_M_expanding=False, clfs=[{'clf': <class 'sklearn.ensemble.forest.RandomForestClassifier'>}], feature_gen_lambda=None)¶ Generates ArrayGenerators according to some subsetting directive.
There are three ways that we determine what the train and test sets are for each trial:
- The start time/stop time interval. This is the interval used to create features in the M-formatted matrix. Setting the start time/stop time of this interval is equalivalent to passing values to set_interval. variables pertaining to this interval have the interval* prefix.
- The start time/stop time interval for labels. If these values are set, then time intervals for the label are different than the time intervals for the other features. Variables pertaining to this interval have the label_interval* prefix.
- The rows of the M matrix to select, based on the value of some column in the M matrix. Setting the start and end of this interval is equivalent to passing values to select_rows_in_M. Values pertaining to this set of rows have the row_M* prefix. Taking subsets over rows of M is optional, and it will only occur if row_M_col_name is not None
Parameters: - label_col (str) – The name of the column containing labels
- interval_train_window_start (number or datetime) – start of training interval
- interval_train_window_size (number or datetime) – (Initial) size of training interval
- interval_test_window_start (number or datetime) – start of testing interval
- interval_test_window_size (number or datetime) – size of testing interval
- interval_inc_value (datetime, timedelta, or number) – interval to increment train and test interval
- label_col_aggr_of_interest (str) – The type of aggregation which will signify the label (for example, use ‘AVG’ if the label is the ‘AVG’ of the label column in the M-formatted matrix)
- interval_expanding (boolean) – whether or not the training interval is expanding
- label_interval_train_window_start (number or datetime or None) – start of training interval for labels
- label_interval_train_window_size (number or datetime or None) – (Initial) size of training interval for labels
- label_interval_test_window_start (number or datetime or None) – start of testing interval for labels
- label_interval_test_window_size (number or datetime or None) – size of testing interval for labels
- label_interval_inc_value (datetime, timedelta, or number or None) – interval to increment train and test interval for labels
- label_interval_expanding (boolean) – whether or not the training interval for labels is expanding
- row_M_col_name (str or None) –
If not None, the name of the feature which will be used to select different training and testing sets in addition to the interval
If None, train and testing sets will use all rows given a particular time interval
- row_M_col_aggr_of_interest (str) – The name of the aggregation used to subset rows of M. (For example, use ‘AVG’ if we want to select rows based on the average of the values in the interval)
- row_M_train_window_start (? or None) – Start of train window for M rows. If None, uses interval_train_window_start
- row_M_train_window_size (? or None) – (Initial) size of train window for M rows. If None, uses interval_train_window_size
- row_M_test_window_start (? or None) – Start of test window for M rows. If None, uses interval_test_window_start
- row_M_train_window_size – size of test window for M rows. If None, uses interval_test_window_size
- row_M_inc_value (? or None) – interval to increment train and test window for M rows. If None, uses interval_inc_value
- row_M_expanding (bool) – whether or not the training window for M rows is expanding
- clfs (list of dict) – classifiers and parameters to run with each train/test set. See documentation for diogenes.grid_search.experiment.Experiment.
- feature_gen_lambda ((np.ndarray, str, ?, ?, ?, ?) -> np.ndarray or None) –
If not None,function to by applied to generated arrays before they are fit to classifiers. Must be a function of signature:
- f(M, test_or_train, interval_start, interval_end, row_M_start,
- row_M_end)
Where: * M is the generated array, * test_or_train is ‘test’ if this is a test set or ‘train’ if it’s
a train set- interval_start and interval_end define the interval
- row_M_start and row_M_end define the rows of M that are included
Returns: Experiment collecting train/test sets that have been run
Return type:
-
diogenes.array_emitter.
M_to_rg
(conn_str, from_table, to_table, unit_id_col, start_time_col=None, stop_time_col=None, feature_cols=None)¶ Convert a table in M-format to a table in RG-format
Data in the M-formatted from_table is appended to the RG-formatted to_table. Consequently, the user can run M_to_rg multiple times to convert multiple M-formatted tables to the same RG-formatted table
For character or text columns in from_table, entries will be encoded to integers when they are dumped into the RG-formatted table. For each of these columns, M_to_rg will create a table with a name of the form: [to_table]_encode_[from_table]_[col_name]
Parameters: - conn_str (str) – SQLAlchemy connection string to access database
- from_table (str) – Name of M-formatted table to convert to RG-formatted table
- to_table (str) –
Name of RG-formatted table that data will be inserted into. This table must either: 1. Not yet exist in the database or 2. Adhere to the schema:
- (row_id SERIAL PRIMARY KEY, unit_id INT, start_time TIMESTAMP,
- end_time TIMESTAMP, feat TEXT, val REAL);
- unit_id_col (str) – The column in from_table which contains the unique unit ID
- start_time_col (str or None) – The column in from_table which contains the start time. If None, start times will be left NULL in the RG-formatted array
- stop_time_col (str or None) – The column in from_table which contains the stop time. If None, stop times will be left NULL in the RG-formatted array
- feature_cols (list of str or None) – The columns of from_table which will be inserted into to_table in RG-format. If None, will use all columns in from_table except for unit_id_col, start_time_col and stop_time_col
diogenes.utils module¶
-
diogenes.utils.
append_cols
(M, cols, col_names)¶ Append columns to an existing structured array
Parameters: - M (numpy.ndarray) – structured array
- cols (list of numpy.ndarray) –
- col_names (list of str) – names for new columns
Returns: structured array with new columns
Return type: numpy.ndarray
-
diogenes.utils.
cast_list_of_list_to_sa
(L, col_names=None)¶ Transforms a list of lists to a numpy structured array
Parameters: - L (list of lists) – Signifies a table. Each inner list should have the same length
- col_names (list of str or None) – Names for columns. If unspecified, names will be arbitrarily chosen
Returns: Structured array
Return type: numpy.ndarray
-
diogenes.utils.
cast_np_sa_to_nd
(sa)¶ Returns a view of a numpy structured array as a single-type 1 or 2-dimensional array. If the resulting nd array would be a column vector, returns a 1-d array instead. If the resulting array would have a single entry, returns a 0-d array instead All elements are converted to the most permissive type. permissiveness is determined first by finding the most permissive type in the ordering: datetime64 < int < float < string then by selecting the longest typelength among all columns with with that type. If the sa does not have a homogeneous datatype already, this may require copying and type conversion rather than just casting. Consequently, this operation should be avoided for heterogeneous arrays Based on http://wiki.scipy.org/Cookbook/Recarray.
Parameters: sa (numpy.ndarray) – The structured array to view Returns: Return type: np.ndarray
-
diogenes.utils.
check_arguments
(args, required_keys, optional_keys_take_lists=False, argument_name='arguments')¶ Verifies that args adheres to the “arguments” format.
The arguments format is the format expected by “arguments” in, for example, diogenes.modify.choose_cols_where, diogenes.modify.remove_rows_where, diogenes.modify.where_all_are_true, and diogenes.grid_search.experiment.Experiment. If args does not adhere to this format, raises a ValueError
Parameters: - args (list of dict) – Arguments to verify
- required_keys (dict of str : ((? -> bool) or None)) –
A dictionary specifying which keys will be required in each dict in args. If a value in required_keys is not None, it should be a lambda that takes the argument passed to the key in args and returns a bool signifying whether or not the input is valid for that required key. For example, if every dict in args requires the key ‘func’ and the argument for that key must be callable, you could pass:
required_keys = {‘func’: lambda f: hasattr(f, ‘__call__’)}If a key in required_keys has the value None, then the corresponding key in args will not be verified.
- optional_keys_take_lists (bool) – Iff True, will make sure that arguments for keys in args that are not required_keys have values that are lists. This is a consolation to diogenes.grid_search.Experiment
- argument_name (str) – Name of variable that was supposed to be in argument format
Returns: The verified args
Return type: list of dict
-
diogenes.utils.
check_col
(col, argument_name='col', n_rows=None)¶ Verifies that col is a 1-dimensional array. Otherwise, throws an error
If col is not a numpy array, but is an iterable that can be converted to an array, the conversion will be performed and an error will not be thrown.
Parameters: - col – Object to check
- argument_name (str) – Name of variable that was supposed to be a 1-dimensional array
- n_rows (int or None) – If not None, number or rows that col should have
Returns: The verified (and possibly converted) col
Return type: np.ndarray
-
diogenes.utils.
check_col_names
(col_names, argument_name='col_names', n_cols=None)¶ Makes sure that col_names is a valid list of str.
If col_names is a str, will transform it into a list of str
If any of col_names is unicode, translates to ascii
Parameters: - col_names – Object to check
- argument_name (str) – Name of variable that was supposed to be in col_names format
- n_cols (None or int) – If not None, number of entries that col_names should have
Returns: transformed col_names
Return type: list of str
-
diogenes.utils.
check_consistent
(M, col=None, col_names=None, M_argument_name='M', col_argument_name='col', col_names_argument_name='col_names', n_rows=None, n_cols=None, col_names_if_M_converted=None)¶ Makes sure that input is valid and self-consistent
- Makes sure that M is a valid structured array.
- If col is provided, makes sure it’s a valid column.
- If col is provided, makes sure that M and col have the same number of rows
- If col_names is provided, makes sure that col_names is a list of str
- If col_names is provided, make sure that the col_names are in M
-
diogenes.utils.
check_sa
(M, argument_name='M', n_rows=None, n_cols=None, col_names_if_converted=None)¶ Verifies that M is a structured array. Otherwise, throws an error
If M is not a structured array, but can be converted to a structured array, this function will return the converted structured array without throwing an error.
Parameters: - M – Object to check
- argument_name (str) – Name of variable that was supposed to be a structured array
- n_rows (int or None) – If not None, number or rows that M should have
- n_cols (int or None) – If not None, number of columns that M should have
- col_names_if_converted (list of str or None) – If M is converted to a structured array from a list of lists or a homogeneous numpy array, the created structured array will use these names for columns
Returns: The verified (and possibly converted) M
Return type: numpy.ndarray
-
diogenes.utils.
convert_to_sa
(M, col_names=None)¶ Converts an list of lists or a np ndarray to a Structured Arrray :param M: This is the Matrix M, that it is assumed is the basis for the ML algorithm :type M: List of List or np.ndarray or pandas.Dataframe :param col_names: Column names for new sa. If M is already a structured array, col_names
will be ignored. If M is not a structured array and col_names is None, names will be generatedReturns: structured array Return type: np.ndarray
-
diogenes.utils.
csv_to_sql
(conn, csv_path, table_name=None, parse_datetimes=[])¶ Converts a csv to a table in SQL
Parameters: - conn (sqlalchemy engine) – Connection to database
- csv_path (str) – Path to csv
- table_name (str or None) – Name of table to add to db. if None, will use the name of the csv with the .csv suffix stripped
Returns: THe table name
Return type: str
-
diogenes.utils.
dist_less_than
(lat_1, lon_1, lat_2, lon_2, threshold)¶ Tests whether distance between two points is less than a threshold
Parameters: - lat1 (float) –
- lon1 (float) –
- lat2 (float) –
- lon2 (float) –
- threshold (float) – max distance in kilometers
Returns: Return type: boolean
-
diogenes.utils.
distance
(lat_1, lon_1, lat_2, lon_2)¶ Calculate the great circle distance between two points on earth
In Kilometers
-
diogenes.utils.
fix_pandas_datetimes
(df, dtime_cols)¶
-
diogenes.utils.
invert_dictionary
(aDict)¶ Transforms a dict so that keys become values and values become keys
-
diogenes.utils.
is_nd
(M)¶ Returns True iff M is a numpy.ndarray
-
diogenes.utils.
is_not_a_time
(dt)¶ True iff dt is equlivalent to numpy.datetime64(‘NaT’) Does casting so It’s the correct “NOT A TIME”
-
diogenes.utils.
is_sa
(M)¶ Returns True iff M is a structured array
-
diogenes.utils.
join
(left, right, how, left_on, right_on, suffixes=('_x', '_y'))¶ Does SQL-stype join between two numpy tables
Supports equality join on an arbitrary number of columns
Approximates Pandas DataFrame.merge http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html Implements a hash join http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx
Parameters: - left (numpy.ndarray) – left structured array to join
- right (numpy.ndarray) – right structured array to join
- how ({‘inner’, ‘outer’, ‘left’, ‘right’}) – As in SQL, signifies whether rows on one table should be included when they do not have matches in the other table.
- left_on (str or list or str) – names of column(s) for left table to join on. If a list, the nth entry of left_on is joined to the nth entry of right_on
- right_on (str or list or str) – names of column(s) for right table to join on If a list, the nth entry of left_on is joined to the nth entry of right_on
- suffixes ((str, str)) – Suffixes to add to column names when left and right share column names
Returns: joined structured array
Return type: numpy.ndarray
-
diogenes.utils.
np_dtype_is_homogeneous
(A)¶ True iff dtype is nonstructured or every sub dtype is the same
-
diogenes.utils.
on_headless_server
()¶ True iff the host doesn’t appear to have a display to plot to
-
diogenes.utils.
open_csv_as_sa
(fin, delimiter=', ', header=True, col_names=None, verbose=True, parse_datetimes=[])¶ Converts a csv to a structured array
Parameters: - fin (file-like object) – file-like object containing csv
- delimiter (str) – Character used to delimit csv fields
- header (bool) – If True, assumes the first line of the csv has column names
- col_names (list of str or None) – If header is False, this list will be used for column names
Returns: - numpy.ndarray – structured array corresponding to the csv
- If header is False and col_names is None, diogenes will assign
- arbitrary column names
-
diogenes.utils.
remove_cols
(M, col_names)¶ Remove columns specified by col_names from structured array
Parameters: - M (numpy.ndarray) – structured array
- col_names (list of str) – names for columns to remove
Returns: structured array without columns
Return type: numpy.ndarray
-
diogenes.utils.
sa_from_cols
(cols, col_names=None)¶ Converts a list of columns to a structured array
-
diogenes.utils.
stack_rows
(args)¶ Returns a structured array containing all the rows in its arguments
Each argument must be a structured array with the same column names and column types. Similar to SQL UNION
-
diogenes.utils.
str_to_time
(date_text)¶ Returns the datetime.datetime representation of a string
Returns NOT_A_TIME if the string does not signify a valid datetime
-
diogenes.utils.
to_unix_time
(dt)¶ Converts a datetime.datetime to seconds since epoch
-
diogenes.utils.
transpose_dict_of_lists
(dol)¶ Transforms a dictionary of lists into a list of dictionaries
-
diogenes.utils.
utf_to_ascii
(s)¶ Converts a unicode string to an ascii string.
If the argument is not a unicode string, returns the argument.