pyxplr package

Submodules

pyxplr.explore_feature_map module

pyxplr.explore_feature_map.explore_feature_map(df, features=[])

Returns a cumulative faceted plot on pairwise feature relationships. The plot consists of NxN mini-charts where N is number of features. Main diagonal shows feature distribution. Pairwise Pearson correlations are shown above main diagonal. Pairwise feature joint distributions are shown below main diagonal.

NOTE: Non-numeric features will be skipped. All passed features should not include any missing data, otherwise an error will be raised.

Parameters:
  • dataframe (pandas.DataFrame) – The target dataframe to explore
  • features (Array-like) – An array of strings representing feature names to include in the plot. Empty array means all features (Default = [])
Returns:

The Altair chart is returned as the result.

Return type:

Altair.Chart

Raises:
  • TypeError – Invalid data frame.
  • ValueError – Invalid features specification. No numeric features present in the dataset. Dataframe must not include any missing data. Features specification includes a non-existent feature. Features specification includes a non-numeric feature.

Notes

The function will only work with numeric features. Non-numeric features will be omitted.

Current implementation has performance limitation imposed by Altair - large datasets may take some time to render.

Examples

>>> df = pd.DataFrame({'col1': [1, 2, 4, 3, -1, 10],
>>>                    'col2': [3, 1 ,5, -2, 3, -1],
>>>                    'col3': [8, 1, 2, 3, 11, 10]})
>>> explore_feature_map(df)

pyxplr.explore_missing module

Created on February 28, 2020 @author: Braden Tam Implementation of the explore_missing function in the pyxplr package.

pyxplr.explore_missing.explore_missing(df, num_rows=0, df_type='location')

explore_missing will identify missing observations within df. It will return 1 of 2 tables: (location) 1 table of the exact location in the dataframe where there is missing data or (count) another table showing how many observationsare missing and the proportion of how much data is missing for each feature.

Parameters:
  • df (pandas.DataFrame) – The target dataframe to explore
  • num_rows (integer) – The number of rows above and below the missing value to output
  • df_type (str) – The desired type of output (location or count)
Returns:

type – The resultant dataframe

Return type:

pandas.DataFrame

Raises:
  • ValueError – num_rows must be a positive integer num_rows must be of type int There are no missing values in the dataframe
  • TypeError – Data must be a pandas DataFrame
  • NameError – Type must be either “count” or “location”

Examples

>>> test = pd.DataFrame({'col1': [1, 2, None, 3, 4],
>>>                      'col2': [2, 3, 4, 5, 6]})
>>> explore_missing(test, num_rows = 1)
>>> explore_missing(test, df_type = "count")

pyxplr.explore_outliers module

pyxplr.explore_outliers.explore_outliers(df, std_range)

Explores outliers in each feature of dataset based on given standard deviation range. Before calculation, NA rows are dropped and only numeric columns are considered for calculation.

Parameters:
  • df (pandas.DataFrame) – Target dataframe to explore
  • std_range (integer) – Number of standard deviations used to find outliers
Returns:

DataFrame – Dataframe containing the number of outliers for each numeric feature.

Return type:

pandas.DataFrame

Raises:

TypeError. Raises exception if the input is not pandas.DataFrame.

Notes

Does not consider non-numeric features.

Examples

>>> df = pd.DataFrame({'col1': [1, 2, 1.00, 3, -1, 100],
>>>                    'col2': [3, 1 ,5, -2, 3, -1]})
>>> explore_outliers(df, 2)

pyxplr.explore_summary module

pyxplr.explore_summary.explore_summary(df)

Print out the column names for categorical columns and numeric columns and the basic statistics summary: mean, variance, 0.25, 0.5, 0.75 quantile, min and max for numeric columns from provided data.

Parameters:dataframe (pandas.DataFrame) – The target dataframe to explore
Returns:Dataframe with summary details on each numeric feature
Return type:pandas.DataFrame
Raises:Error – Description

Examples

>>> df = pd.DataFrame({"A":[12, 4, 5, 44, 1],
>>>                    "B":["apple", "banada", "orange",
>>>                         "strawberry", "blueberry"],
>>>                    "C":["2", "1", "3", "4", "6"],
>>>                    "D":[14, 3, 17, 2, 6]})
>>> explore_summary(df)

Module contents