Jump to content
  • Analyze any Text Data in Spotfire


    This article explains how to use the NLP Python Toolkit - available from the Community Exchange.
    screen_shot_2022-05-19_at_3_48.04_pm_thumbnail.png.d2d90199f8d8bb08fc9a105c7b79c7fd.png
     
    Quick Links:

     

    What is NLP and Why Should Businesses Care

    Natural Language Processing (NLP) is a branch of Artificial Intelligence that focuses on the ability of machines to understand text. Text can refer to online documents, transcribed speech, real-time logs, and more. According to an article by Businesswire, The Global Natural Language Processing (NLP) Market Size is expected to grow from 11.6 Billion (USD) in 2020 to 35.1 Billion by 2026, at a Compound Annual Growth Rate (CAGR) of 20.3%. Some popular use cases for NLP include market intelligence, root cause analysis, automatic summarization, language translation, chatbots, and more. Moreover, there has been an influx of Open Source libraries and research papers in NLP, especially in the last decade. This allows analysts and data scientists to easily use the latest advancements (image below from John Snow Labs article)

    spark-nlp-stats_1.thumb.png.73c78a75109771756043c0625774c1a8.png

    Introducing spotfire-dsml: Advancing Data Science and Machine Learning in Spotfire

    spotfire-dsml is a Python package introduced at The Analytics Forum (TAF) in 2023 to redefine the way we approach data science within the Spotfire platform; it empowers data scientists and analysts by significantly reducing the time to value for creating analytics-rich applications within Spotfire. The spotfire-dsml module includes the following packages: NLP, Time Series, Geoanalytics, Explainability, and more. 

    The NLP module includes the following 5 data functions:  Preprocessing and N-gram Keywords, Tagging, Supervised Machine Learning, Topic Modeling, and Unsupervised Machine Learning and Open-Source Embeddings.

    The DSML Toolkit documentation and Spotfire dxp examples can be found in this exchange component.

    More details about the DSML Toolkit can be found in this community article.

     

    Why Use the NLP Toolkit in Spotfire

    The NLP Python Toolkit for Spotfire is a versatile toolkit that performs a range of exploratory text analytics on any text. It includes two Python data functions that use the popular NLP libraries NLTK and spaCy and their pre-trained language models. The data functions inside the toolkit can be used via Spotfire's no/lo-code data function Flyout interface and/or through the step-by-step instructions in the Spotfire DXP. It also shows how to visualize results to highlight quick insights from the text. Below is a quick video of configuring one of the data functions from the data function flyout - you simply need to specify the text column and set a few parameters.

    See an amination below.

    image.gif.dfd52165da877279e0dd9250eda58f13.gif

     

    What functionalities are in the NLP Toolkit?

    'NLP Python Toolkit - Features' preprocesses or cleans the text and engineers the text into n-grams. It includes the option to remove numbers and special characters (these are with respect to the English language). And, it removes stop words and performs text normalization.  Stop words are commonly used words in a language that do not provide much semantic meaning for a given task (ex. the, not, we, etc.). Text normalization standardizes words using either stemming or lemmatization. Stemming crops words to remove suffixes or prefixes while lemmatization morphs words into their base form. For example, stemming would convert the word 'studies' into 'studi' while lemmatization would convert the word 'studies' into 'study'. It also performs binary and multiclass classification if a test dataset is provided. This is a new 2024 features that allows for machine learning on top of preprocessing text data for predictive analytics. The supported models include linear regression and support vector machine. 

    The next step is to engineer the text into a numerical, structural format (i.e. text vectorization). This is done with an N-gram matrix. N-grams are word combinations or any continuous sequences of text. In the phrase, 'The dog ran away', you get the following n-grams: 'the', 'dog', 'ran', 'away', 'the dog', 'dog ran', 'ran away', 'the dog ran', etc. Usually, it is common to go up to 4-word N-grams at max (i.e. unigrams, bigrams, trigrams, quadgrams). First, the top N-grams across all the text observations are retrieved. Then, each text observation has a score for each N-gram feature; this score is either by frequency or TF-IDF (term frequency-inverse document frequency). They are both simple measures and we suggest using the latter to better quantify N-gram importance. Altogether, this results in an N-gram matrix (rows are text observations, columns are N-gram features, and values are scores). From this N-gram matrix, the scores are used to identify the top N-gram keywords per each text observation.

    ezgif.com-gif-maker(4).png.b850218703148d6cebfbfc581b5eb08c.png

     

    Screenshot2024-03-22at1_06_14PM.thumb.png.c7f783923fcbe2a5b17e4b41bb91e7a5.png

     

    'NLP Python Toolkit - Entities and Sentiment' performs the following: Named Entity Recognition (NER), Part-of-Speech tagging (POS), and Sentiment Analysis. The term 'tagging' refers to models picking apart a word or groups of words and ascribing a label/tag to it. NER tags predefined entities from text (geographical regions, peoples? names, languages or nationalities, etc.). POS tags each word with its grammatical tag (i.e. noun, adjective, etc.). Sentiment tags phrases with a polarity and subjectivity metric and assigns an overall polarity and subjectivity metric to each document. Sentiment analysis will give an overall polarity score (what we refer to as sentiment - i.e. how negative or positive something is on a scale from -1 to 1) to a text observation. It also tags any phrases that the model identifies as polar. So, each document will definitely have one 'document polarity score' and can have 0+ phrases each tagged with a score ('phrase polarity score'). The sentiment analysis is done through a library called TextBlob and this is supported in spaCy. The underlying NER, POS, and Sentiment models work on raw text data and are pre-trained models from the spaCy library. spaCy is easy to use and has pre-trained models available for a variety of languages and tasks (others include token vectors, word vectors, stemming/lemmatization, dependency parsing, etc.). In this data function, you can specify the type and size of the language model to use, so it is not specific to just English text, and it does not require preprocessing (recommended not to use preprocessing since 'not happy' converted to 'happy' impacts sentiment scoring...).

    ezgif.com-gif-maker(3).png.b0c8c1081a1129be9d64fdb6fb9307f2.png
     
     

    Topic Modeling, Unsupervised ML with Embeddings (New Features 2024)

    The Topic Modeling function uses the BERTopic library to perform topic modeling on text data, identifying themes and
    topics within the dataset and providing detailed information about topics and documents.
     
    The Unsupervised ML and Open-Source Embeddings function performs advanced text analysis using Sentence Transformers and BM25 python libraries for embeddings, retrieval, and clustering. This is unsupervised learning pipeline for text data. It's suggested to use the NLP preprocessing functions before to clean the text for optimal results.

    Sample Use Case - Yelp Dataset

    Here is a quick overview of a sample NLP use case. This data is a subset of the Yelp dataset centered on businesses around New Jersey, Philadelphia, and Delaware. There are three different text categories in this dataset: Business Categories, Reviews, and Tips. Interestingly, the nature of the text categories differs in both length and semantic meaning.

    ezgif.com-gif-maker(2).png.e44608a35e1326bc41780e9e09663326.png

    ezgif.com-gif-maker(1).png.f927e36398717da6a4e7d559d37b46e8.png

     

    Yelp - Search Functionality

    First, we wanted a dashboard page that acts as a search engine or aggregates the text into organized hierarchies. We ran the 'NLP Python Toolkit - Features' data function on both the Tips and Categories text columns. The top N-grams from the Categories help filter the businesses into restaurants, services, etc., and clicking on anyone further breaks them down into types (ex. Italian restaurants, plumbing services). The N-gram keywords can be used in the filter panel for specific search terms relevant to businesses: 'accept cash', 'vegetarian'. We ran 'NLP Python Toolkit - Entities and Sentiment' to retrieve the NER  results on the Reviews text column. Looking at the 'Nationalities or Religious Political Groups' Label, we see that the tagged entities and corresponding text reviews filter across food cuisines.

    ezgif.com-gif-maker.png.7af89090c13e35a520b16ced128c9f36.png

     

    Yelp - Evaluate Businesses

    The next page is about evaluating businesses. It shows the Sentiment and POS results in the Reviews text column. Overall, it is used to show which businesses have positive and negative reviews, and the corresponding phrases and POS tags help explain the model's scores. This can potentially be a less biased metric compared to each user's 1-5 scale rating.

    ezgif.com-gif-maker(3).png.3c280dc82a6581df0b3be21db62ff8eb.png

     

    Yelp - Business Recommendations

    Now that we have explored and evaluated our business data, we want to optimize recommendations to our customers. Each review corresponds to a single user and a single business. We ran the 'NLP Python Toolkit - Features' data function on the Reviews column. The N-gram matrix calculated and stored is a numerical representation of all the reviews (each review is a vector or row in the matrix). We used the K-Means Cluster Python Data Function for Spotfire to cluster the reviews. This results in 'similar' reviews being grouped together in a cluster (the term 'similar' is vague; here it is determined by distance in TF-IDF N-gram scores. This likely means 'similar' reviews in a cluster are grouped together because they share higher scores for certain N-gram features). We can then view any cluster and visualize its reviews by looking at the text directly and N-gram keywords in that cluster. Below, exploring a single cluster shows that its reviews discuss happy hours, bars, and drinks. Lastly, we can filter down to the reviews with positive sentiment scores and/or higher ratings (4 or 5 stars) in that cluster, and recommend their corresponding businesses to all the corresponding users who wrote this group of reviews.

    ezgif.com-gif-maker(2).png.12cc7e50f640883bd0c3a2915d3f2857.png

     

    The DSML Toolkit documentation and Spotfire dxp examples can be found in this exchange component.

    More details about the DSML Toolkit can be found in this community article.

    Authors:

    ezgif.com-gif-maker(1).png.c9f520fc77123912c7f83398751070da.png

    Sweta Kotha is a Data Scientist at Spotfire. Her experience spans data science, natural language processing, and biostatistics. She likes trying out new technologies and methods to address analytics challenges and is interested in effectively communicating with data.

     

     


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...