Jump to content
  • Using Python and R in Data Science


    As a collaborative data science platform designed for enabling advanced analytics for everyone, TIBCO Data Science provides an intuitive way of creating drag-and-drop workflows and using various out-of-the-box machine learning algorithms. Meanwhile, there are built-in Python and R executors in this platform, which means we can bring powerful user-defined functions to the workflows in terms of writing Python or R code with great flexibility. In this blog, we will learn how to use embedded Jupyter Notebooks in TIBCO Data Science to train a Light GBM model, and how to start writing R code to train a decision tree model based on the iris dataset.

     

    Part 1: Using Python for Building a Light GBM Model in TIBCO Data Science

    The main steps for using Python in TIBCO Data Science are:

    1. Create a Jupyter Notebook in the workspace

    2. Open the Jupyter Notebook and fill in the Python code

    3. Create a workflow in the workspace and add a Python Execute operator

    Let?s go through the details of each step.

    Step 1:  Create a Jupyter Notebook in the workspace

    After creating a workspace, we can easily add a Jupyter Notebook in the work files section.

     

    step1.png.17ff351b86255c8ac82a276c7975fc4f.png

    Step 2: Open the Jupyter Notebook and fill in the Python code

     

    From the notebook content section, we can start importing Python packages or installing any packages needed (i.e. typing in the code "! pip install lightgbm" and running it to install the LightGBM package).

     

    step3.thumb.png.fcdec20c926494e2ca7e42743edc7fb4.png

     

    Then we can import the dataset for analysis. Here we need to use the iris data table from a PostgreSQL database, so we should define the data source name, table name, schema name, and database name for targeting and importing the data.

    Note that we have set "use_input_substitution" to be the true value and the "execution_label" to be 1. This will serve the purpose of reading in data from the workflow later.

     

    fig4.thumb.png.5dfe33346dcd5c399ab4a7a01db33c5e.png

     

    There is a useful trick for importing the dataset without typing any code. First of all, we need to find the required dataset in "Data Sources" (find it by clicking the top left menu icon), and associate it with our workspace.

     

    Screenshot2022-09-16103027.png.f334dec29f5bd0e1ba6917363faa2f35.png

     

    After making this association, we can go to the notebook and find the function "Import Dataset into Notebook" under the data menu.

     

    Screenshot2022-09-16111022.thumb.png.85c0b0520087b6ed57ff63990b113acb.png

     

    By clicking the "Import" button, the data importing code will be automatically generated in the notebook with the default settings, which means you may want to modify the values of parameters such as "use_input_substitution" and "execution_label" manually.

     

    Screenshot2022-09-16114250.png.fc28fd1ae035dae8964ce1bdcf5e7aac.png

     

    Then we can split the dataset for training and testing and build a Light GBM classifier. For predicting the iris species, there exists strong explanatory power in the flower's sepal length, sepal width, petal length, and petal width; therefore we can obtain a pretty good classifier with all default settings. The evaluation shows that the testing data can be predicted accurately. 

     

    Screenshot2022-09-16111818.thumb.png.f9c5c39243762c96cbae236319964733.png

     

    At the end of the notebook, we can define the output and save it as a data table in our database. Here we show the example of saving testing data with prediction results generated from our light GBM model.

     

    Screenshot2022-09-16112010.thumb.png.f83fde8ce34931150ae68bfd7e8b69fc.png

     

    So far we have learned how to create a Jupyter Notebook and build a model in Python. In the next step, we will integrate the notebook with our workflow.

    Step 3: Create a workflow in the workspace and add a Python Execute operator

    Here we have created a simple workflow including importing the iris dataset, standardizing the column types (this step is only about changing the data column types  you may omit it in your own project), and passing it to the Python Execute operator.

    Screenshot2022-09-16112157.png.f1a881e2512f7c54c83c08c76fc49c2b.png

     

    Editing the Python Execute operator, we will need to select the desired notebook as well as the substitute input. As we have set the "execution_label" to be 1 previously, here we need to configure the input of the "Substitute Input 1" section.

     

    Screenshot2022-09-16114951.thumb.png.42086eabeb7c8a6dcc985aa22baf77c9.png

    Now we are able to run the workflow and see the output of running the Python Jupyter Notebook. We can also observe that the results have been saved in a new data table as we defined.

     

    Screenshot2022-09-16115457.png.6aef13618bdbc7038de750c5e059d391.png

     

    Part 2: Using R for Building a Decision Tree Model in TIBCO Data Science

    We can also run R code in a workflow by using the R Execute operator.

    Here we have created a simple workflow of importing the iris dataset and then connecting it with the R Execute operator. 

    Screenshot2022-09-16120046.png.90e9c25bfaec0878385d981eed32194f.png

    Then we can start writing R code by clicking the "Define Clause" button on the editing page. 

     

    Screenshot2022-09-16120555.png.2f4fe3a96de5d3e41182e9b1c1e36bf3.png

     

    As we have learned that the iris data can be used for building a classifier, here we build a decision tree model for better understanding variable importance using the full dataset as our training data.

     

    Screenshot2022-09-16121103.png.5f46aba0fba8e01bc0d4af48254b80c8.png

     

    Running the entire workflow, we are able to view the detailed modeling results including variable importance from the R Execute operator.

     

    Screenshot2022-09-16121548.png.f90e49e990af56cfbbdc59eee5fbfb8d.png

    Conclusion

    Through these simple examples of using Python and R to build machine learning models on the iris dataset in TIBCO Data Science, we hope you can understand how convenient it is to integrate code in your visual workflows. In this way, you can extend the analytics capabilities of your workflows, and use the combination of highly scalable data prep operators with advanced open-source functions in Python and R.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...