Generate extra rows for each month between dates in two columns

Paul Hickford 2 · April 2

Hi, I have a data set that contains various columns, including a START and END date.

I want to expand each date range with a row per month in the date period, so for example a set that starts looking like this:

Start	End	Person	Project
01-Jan-24	01-Apr-24	name1	project1
01-Feb-24	01-Apr-24	name1	project2
01-Jan-24	01-Mar-24	name2	project3
01-Feb-24	01-Apr-24	name2	project4

Would be expanded to more rows like this where there is a row for the 1st of the Month for every month a task is ongoing (in my test data the tasks all end on the 1st of the month but this isn't always the case). In this example the first 3 rows are from the 1st input row, then the next 2 are from the second input row and so on:

Start	End	Person	Project
01-Jan-24	01-Apr-24	name1	project1
01-Feb-24	01-Apr-24	name1	project1
01-Mar-24	01-Apr-24	name1	project1
01-Feb-24	01-Apr-24	name1	project2
01-Mar-24	01-Apr-24	name1	project2
01-Jan-24	01-Mar-24	name2	project3
01-Feb-24	01-Mar-24	name2	project3
01-Feb-24	01-Apr-24	name2	project4
01-Mar-24	01-Apr-24	name2	project4

It's similar to this example I've found in some old documentation https://support.tibco.com/s/article/How-to-create-a-continuous-date-range-given-start-and-end-dates-in-TIBCO-Spotfire-with-Data-Functions but I've not used R within Spotfire (or much at all) so am struggling to implement this.
Is anyone able to point me in the right direction?

Paul Hickford 2 · April 2

As a possibly clearer (or possibly not) indication of what I'm trying to do, the Sheet1 on left is the format of my original data set, the Sheet2 on right is an example I have manually prepared, but would like to automate on a large set. Also an example of the kind of output my users are looking for, tracking various outputs over time from a set of data loaded with start/end dates.

David Boot-Olazabal · April 2

Hi Paul,

Are you able to solve the above using a modified SQL Query?

The article you are referring to not only proposes 2 solutions involving TERR, but also a 'SQL Query' solution. If that is also applicable to your use case, you can start with a query that is explained here.

Kind regards,

David

Paul Hickford 2 · April 2

Yes, I'm looking at doing this in the database using SQL, but the person responsible for the database in question is currently reluctant to deploy the view that I had prepared. The article you linked too may be useful, that's not a function I'm aware of so will investigate. I don't want to spread the values out across the rows, but want to repeat them.

I'd hoped to be able to adapt the examples in the article, but these seem to sum the values whereas I need a defined start and end date.

Edited April 2 by Paul Hickford 2

Paul Hickford 2 · April 2

Hmm, if I try and run the SQL suggested in Spotfire I just get a wall of errors, but running the same query directly in database either seems to run forever in a test set limited to a single user, or gets an ORA-01436 CONNECT BY loop in user data if I remove the dbms_random.string line.

I'm on the verge of giving up and just accepting that Spotfire can't do this even though it feels like it should be trivial, and it seems to be a simple request in a lot of other software. I can make the transformation quite easily in Knime but was hoping to remove the need for any manual data manipulations.

Gaia Paolini · April 2

You can do it with a data function, as you are generating a table that has more rows than the initial.
Here I have done it with Python.
You need to have the Python packages pandas and dateutils available. Probably dateutils needs installing with Menu > Tools > Python tools > Package management.

import pandas as pd
from dateutil.relativedelta import relativedelta

# Expand as many rows as difference in months
def months_delta(a,b):
    r = relativedelta(b, a)
    return (r.years * 12) + r.months

#Add a running month to the start date
def running_month(a,m):
    aa = a+relativedelta(months=m-1)
    return aa.strftime("%d-%b-%Y")

df['TASK_START'] = pd.to_datetime(df['TASK_START']) 
df['TASK_END'] = pd.to_datetime(df['TASK_END']) 

df['__DELTA']= df.apply(lambda x: months_delta(x['TASK_START'], x['TASK_END']), axis=1)
df2=df.reindex(df.index.repeat(df['__DELTA']))

df2['__MONTH_INDEX'] = df2.groupby(['USER','PROJECT']).cumcount()+1

df2['TASK_START']= df2.apply(lambda x: running_month(x['TASK_START'], x['__MONTH_INDEX']), axis=1)
df2['TASK_END'] = df2['TASK_END'].dt.strftime("%d-%b-%Y") 

df2.drop(columns=['__MONTH_INDEX','__DELTA'],inplace=True)

Andre Kooy · April 3

Hi Paul

Have you tried to implement the SQL in an information link, in stead of having it run in the database?

Kind regards, Andre

Andre Kooy · April 3

Hi Paul,

It can also be done just in Spotfire.

You have your original file (data table 1). Add data table 2 that has all months of the year and a dummy column with value "X".

Now create data table 3 that does a pivot on data table 1, that has only the project names as identifiers (so just 1 column). Then add a transformation for a calculated column called Dummy with value "X" and add it to data table 3. This gives you a table with all months for all projects.

Then create data table 4 with the following steps:

1) load data table 1

2) join data table 3, joining on project name.

3) add the following calculated column If([Month]>[End],"Exclude","Include")

Now filter to the included rows only.

I added the dxp I created as an example

Extra rows.dxp

Andre Kooy · April 3

Paul Hickford 2 · April 15

Thanks all for the suggestions and apologies for not responding sooner, I will try and have a proper look at these tomorrow when I'm back in the office.

Paul Hickford 2 · April 17

Many thanks all for the suggestions. I had a look at the two approaches of Python and using the separate tables in the Data Canvas. I got the Python version working, but I the Data Canvas approach seems to have much better performance; I set the various dummy columns and filters to be used in the necessary transformations and then removed so the end users won't need to deal with any extra columns. It will break in a hundred years or so when my list of months runs out, but will deal with that later.

Sign In

Generate extra rows for each month between dates in two columns

Recommended Posts

Paul Hickford 2

Link to comment

Share on other sites

Paul Hickford 2

Link to comment

Share on other sites

David Boot-Olazabal

Link to comment

Share on other sites

Paul Hickford 2

Link to comment

Share on other sites

Paul Hickford 2

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

Andre Kooy

Link to comment

Share on other sites

Andre Kooy

Link to comment

Share on other sites

Andre Kooy

Link to comment

Share on other sites

Paul Hickford 2

Link to comment

Share on other sites

Paul Hickford 2

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Industries