Using RXReplace to Remove Duplicates in String

S H · January 21, 2019

Hello,

I have a data set with a country column that contains several comma separated lists. I want to do a uniqueconcatenate across this column, but because the column can already contain multiple values (i.e. a single record might have "Canada, United States") I am not getting the output I want. My uniqueconcatenation comes out as:

"Canada, Canada, India, Canada, Israel, Canada, United States, India, United States, United States"

I know that an RXReplace could help me here, but I am beyond a beginner at these. Cribbing from the internet, the best I got was:

RXReplace(UniqueConcatenate([COUNTRY]),"([^,]+)(,[ ]*\1)+","\1","")

Which only gets me halfway there with an output of "Canada, India, Canada, Israel, Canada, United States, India, United States, United States".

Any thoughts on how this could be made to work

(And at the same time, any favourite resources for learning regular expressions)

Thank you in advance!

Gaia Paolini · January 22, 2019

Hi

I am not sure I understand the desired result, it would be great if you couldclarify with a sample dataset showing the desired result.

I am interpreting it this way:

- you have data in which each row is a different concatenation of countries. Example:

COUNTRY

A,A,B

B,B,C,A

D,A,D,A

C,B,B,B,B

-you want a new column, that for each row contains the sameunique concatenation of all the countries in all rows.

You could achieve it by creating a calculated column like this: (type all in one row)

TERR_String("output

S H · January 22, 2019

Hello,Yes, you have the right idea. In this case I need this in a calculated value, but the expression would be the same as doing it in a calculated column I believe. The only difference between your sample data and the real data is that for a given record a country will only appear once. So if the data set is:COUNTRY

A, B

A

B, C

B, D, E

D

D, EMy desired output of the calculated value is A, B, C, D, E. Simply doing a uniqueconcatenate doesn't work since it looks at the whole string and not subsets.Ideally I would prefer to use regular expressions if possible, as our company does not have a license that permits TERR to be used in webplayer.Appreciate your help!

Gaia Paolini · January 22, 2019

Hi

I think you googled what I have now googled. From what I see, the problems are 1) the uniqueConcatenate adds a space which is non-avoidable, 2) the proposed solution does not work if the repeated countries are not adjacent.

Here is something that worked for the example, although it is a bit of a pastiche. Unfortunately I do not have the resources to clean it or rationalise it further.

RXReplace(Trim(RXReplace(Concatenate(",",RXReplace(UniqueConcatenate([COUNTRY]),"\s+","","g")),"(.)(=.*\1)"," ","g")),"\s+",",","g)

The inner coreRXReplace(UniqueConcatenate([COUNTRY]),"\s+","","g) does the concatenation, then removes every inner space.

The basic idea is to apply a positive lookahead (.)(=.*\1)

The rest is garnish to replace spurious spaces and commas

Gaia

S H · January 22, 2019

I expect you are correct and we landed on the same reference page. And yes, the issue with that solution is that the concatenated list is not sorted and so the solution only removes adjacent duplicate values.Unfortunately when I entered the below solution, the output I get is:Input: RXReplace(Trim(RXReplace(Concatenate(",",RXReplace(UniqueConcatenate([COUNTRY]),"\s+","","g")),"(.)(=.*\1)"," ","g")),"\s+",",","g")

Output: "r,l,C,I,,Uni,dS,ates"In the above, UniqueConcatenate([COUNTRY]) returns the string "Canada, Canada, India, Canada, Israel, Canada, United States, India, United States, United States".Did the above work for you Could SF versions be at play here (I am running v7.7)Thank you again for your help!

Gaia Paolini · January 22, 2019

hi

It is the sample data I am using, with more than one character per word. Without the right data, it is difficult to provide a solution.

Gaia

S H · January 22, 2019

Apologies if I wasn't clear - that list of countries is the data set in question. Any function that works on that string in place of UniqueConcatenate([COUNTRY]) would solve the general problem.

Gaia Paolini · January 22, 2019

Hi

this might work - it does not add commas but spaces.

RXReplace(RXReplace(UniqueConcatenate([COUNTRY]),","," ",g"),"([A-Za-z]+)\b\s+(=.*\b\1)","","g")

Gaia

S H · January 22, 2019

This may just do the trick. Thank you for all your help with this!

Sign In

Using RXReplace to Remove Duplicates in String

Recommended Posts

S H

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

S H

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

S H

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

S H

Link to comment

Share on other sites

Gaia Paolini

Link to comment

Share on other sites

S H

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Industries