How to create a data function to copy and preserve contents of a column table

YB1 · September 9

Hi,

I am fairly new to Spotfire data functions and I am trying to start simple.

I am running the Kmeans clustering from spotfire (tools menu). The results sometimes get changed, so I would like to copy the results after calculation by copying the values explicitely rather than creating a column that would only refer to the values.

TBH I am strugling as I am not sure how to pass the input column (or should it be the full table) then use pandas to create and fill a new column in same table.

I guess once I manage that I will be able to do more on my own or using existinf data functions as inspiration...

I hope some kind Spotfire guru can give me a hand here...

Thanks

Gaia Paolini · September 9

what version of Spotfire are you using and do you have use/knowledge of Python ?

YB1 · September 9

Hi Gaia, thanks for answering,

I am currently using 12.0.3.

I have some basics of python and pandas and of course GPTs can help, but the main issue I have is interfacing with spotfire, knowing how to obtain and write columns in spotfire in a compatible way with pandas. I am sure it should be fairly simple, but I am missing the starting bricks... I did a few tries, also using AI GPTs but failed miserably so far. I am more of scientist than a software programmer, so spend no more than 10% of my time on code, which doesn't help build the comfort.

YB1 · September 9

Also, I wouldn't bother if the Kmeans would not appear to get corrupted every 5 minutes, leading to having to recompute/update. This might have to do with the setup. I am using Spotfire in Schlumberger DELFI environment, with ingestion from OSDU through Dataiku... May be the Kmeans results would be more stable outside of this setup...

Gaia Paolini · September 10

I do not understand what you mean by K-Means getting corrupted. Maybe, if your data changes all the time, the clusters do change, or K-Means is not the right method to capture the structure in your data?

I asked Spotfire Copilot to come up with a K-Means Python data function (with column normalization, search for optimal number of clusters, and results in a separate table). It did a good job, with some debugging needed. I saved this example in Spotfire 14.0 which I hope you can open. Script below. (There may be a warning about cyclic dependencies, you can ignore it).
First you need to have (or create) a column containing the id of each row. I created a calculated column called "idColumn" using the RowId() expression function.
The data function accepts the input data table (you can input all columns, as it will use the numeric columns only), the name of the id column, and a min/max number of clusters.
If you want to have a pre-determined number of clusters K, just set min=max=your desired K.
The output is a separate table, which can be column-matched to the original one via this id column. I joined the original columns back to this table to visualize the results (see screenshot).
If you want to calculate different clusters, you can change the output table of the data function.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np


# Isolate numeric columns and normalize
numeric_cols = inputData.select_dtypes(include=np.number)
scaler = StandardScaler()
normalized_data = scaler.fit_transform(numeric_cols)


# Determine the optimal number of clusters
range_n_clusters = list(range(minClusters, maxClusters+1))
silhouette_avg = []
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(normalized_data)
    cluster_labels = kmeans.labels_
    silhouette_avg.append(silhouette_score(normalized_data, cluster_labels))

# Select the optimal number of clusters
optimal_clusters = range_n_clusters[silhouette_avg.index(max(silhouette_avg))]

# Apply KMeans with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters, random_state=0).fit(normalized_data)
inputData['Cluster'] = kmeans.labels_

# Prepare the silhouette score curve data
curve_data = pd.DataFrame({'Clusters': range_n_clusters, 'SilhouetteScore': silhouette_avg})

# Outputs
outputData = inputData[[idColumnName, 'Cluster']].copy()
optimalCurve = curve_data

k_means.dxp

YB1 · September 10

This is how my time series look like :

After Kmeans, they should look the same with just K-means categories colors on top. However, after a while, the time series look like that (i use column names a a color control as I have 2 columns on the same plot):

So this is what I mean with Kmeans values becoming corrupted. That is whay I have been trying to generate a column that is not dependent on Kmeans process but contains Kmeans results. My data doesn't really change but is read from Spotfire Library and needs to be associated with a token value to be loaded (DataIku + SLB DELFI environment)....

So it could be that refresh of the data triggers a mess on the Kmeans process... If I just re run it it looks fine,

but doing it 20x a day is not.... Thanks for your support

David Boot-Olazabal · September 10

Hi YB1,

Not sure why your data seems weird after a while.
You mentioned you had to add in a token, to connect to DataIku + SLB DELFI. Could it be that this token becomes invalid after a while and you need to refresh it?

To go back to your original request, copying data to another data table so it becomes independent of the back end, could you do with this iron python script?

Kind regards,

David

YB1 · September 10

Is, it could be the token. refreshing it could be as annoying as refreshing the KMean.

Regarding your second question, that was actually my first question. This is exactly what I want to do: Copy the values using python so that the results of Kmeans get preserved, token or not...

YB1 · September 10

David, Thanks, will give it a try..

Sign In

How to create a data function to copy and preserve contents of a column table

Recommended Posts

YB1

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

YB1

Link to comment

Share on other sites

YB1

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

YB1

Link to comment

Share on other sites

David Boot-Olazabal

Link to comment

Share on other sites

YB1

Link to comment

Share on other sites

YB1

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Industries