To see the whole process, please watch the following Dr Spotfire session:
The Python Code
Below is the code and the configuration for generating correlation scores using the pandas package. Pandas has 3 inbuilt correlation scoring methods: Spearman, Kendall, and Pearson. You will need to register a new data function, select Python as the type and then paste in this code. See this video for a guide on how to set up data functions in Spotfire (particularly around 10:30 minutes in):
This code will compare all columns with all other columns in your data frame/data table passed into Spotfire:
""" [Exploratory] Column Correlation October 2021 Version: 1.2.0 datascience@tibco.com Calculates the correlation coefficients between columns of data. Multiple correlation methods are available which are: Pearson, Spearman and Kendall. Inputs ---------- df : data table Data table containing the columns to be tested for correlation correlation_method: String Correlation method to be used. Must be one of: Spearman, Kendall or Pearson encode_strings: Boolean (Optional) Whether columns of strings should be encoded to allow for correlation calculation. If False, or omitted, then string columns are ignored selected_features : column (Optional) Column containing the names of columns in your data table to be tested for correlation i.e. to select a subset of columns instead of testing all columns in the data Outputs ------- output_corr_df : data table Data tabel containing all correlation scores in the format of {Column 1, Column 2, Correlation Score, Correlation Method} Packages Required ------- pandas scikit-learn """ import pandas as pd from sklearn.preprocessing import LabelEncoder ## check correlation method is valid valid_correlation_methods = ['pearson','kendall','spearman'] if any([method == str(correlation_method).lower() for method in valid_correlation_methods]) == False: raise ValueError("Invalid correlation method specified. Please use one of: Pearson, Kendall or Spearman") ## Filter data table passed from Spotfire to only columns of interest based ## upon another column input i.e. from feature selection marking ## If nothing is passed in default to performing correlation on all columns if 'selected_features' in globals() and selected_features is not None: if len(selected_features.unique()) >= 2: feature_names = selected_features.unique() ## filter data frame down to only columns of interest features_df = df.loc[:,feature_names] elif len(selected_features.unique()) == 0: features_df = df.copy() feature_names = features_df.columns else: raise ValueError("Not enough features selected (" + str(len(selected_features.unique())) + " were supplied). The minimum is 2.") else: features_df = df.copy() feature_names = features_df.columns ## Handle string columns if required if 'encode_strings' in globals() and encode_strings is not None: if encode_strings == True: label_encoder = LabelEncoder() features_df = features_df.apply(lambda x: label_encoder.fit_transform(x) if x.dtype == 'object' else x) ## Check we have some columns to compare - if not, and empty data frame is returned if len(feature_names) >= 2: ## run correlation output_corr_df = features_df.corr(method=correlation_method.lower()) output_corr_df['Column1'] = output_corr_df.index output_corr_df = output_corr_df.melt(id_vars=['Column1'], var_name='Column2', value_name='Correlation Score') ## add in correlation method column for reference output_corr_df["Correlation Method"] = correlation_method ## remove same column comparison output_corr_df = output_corr_df[output_corr_df.Column1 != output_corr_df.Column2] else: ## return empty data frame output_corr_df = output_corr_df.append(pd.DataFrame({"Column1": [""], "Column2": "", "Correlation Score": 0, "Correlation Method": correlation_method })) # Copyright (c) 2021. TIBCO Software inc.
Data Function Inputs
For this data function we have specified 3 inputs which are described below:
- df - the table containing the columns to be analyzed for correlation
- encode_strings - (Optional input) true/false on whether to encode categorical and string columns to include these columns in the correlation analysis
- selected_features - (Optional input) a column from a table that has the names of the columns we want to include in the correlation analysis.
- correlation_method - a String value that states what correlation method to use. Must be one of: Spearman, Kendall, or Pearson
Data Function Outputs
There is a single output which is our table of results for the feature importance scores that a user can then use to select features to use:
- output_corr_df - this is a table with each column's names compared and their correlation score (from -1 to 1)
Optional Inputs in Python Data Functions
Note the line of code in the above script:
if 'selected_features' in globals() and selected_features is not None
This allows us to make inputs optional as it checks for the name of your input variable in Python globals. If a user doesn't pass this optional input, using an If a statement like the above is a way to check whether it was passed or not. See more tips and tricks in Python like this in this article.
Making the Correlation Assessment Dynamic
We can make the correlation assessment react to the selection of columns in another chart by simply altering the input settings for the selected_features input, to react to filtering or marking. In this example below, the marking is based upon the marking on the feature selection bar charts as shown in the previous article on performing feature selection in Spotfire using Python, and in the
covering this topic:
Now the columns passed into the python function are limited to those marked on another. In this way, we can control what columns we are assessing for correlation before proceeding to use these data further. The final display in the Dr Spotfire session is shown below:
In this example, we are using a drop-down control in a text area to set the correlation method to pass to the Python data function, and then an action control button to trigger the data function running. The correlation results are shown using a table, and a heatmap to make the assessment and identification of hotspots easy and intuitive.
Additional Notes
The corr function in Pandas automatically removes non numerical columns from the comparison. However, this can be circumvented by encoding your categorical and string columns using some form of encoder function. In the code example above, a label encoder is used. However, this is just a simple example and other encoders are likely to provide better encoding such as count, binary or hash encoders.
Recommended Comments
There are no comments to display.