Formulae: Regex - Extracting Subset of String?

shaji_o · October 9, 2012 7:01PM

I've got a field that contains Product_Strategy_Message_Size_Creative Type, including the underscores. I need to extract the Message part, i.e. the string that sits between the 2nd "_" and 3rd "_". Have you got a formula that would do that for me?

tjbate · October 10, 2012 7:29AM

Shaji - Omniscope text manipulation options go well beyond spreadsheet functions. Some Omniscope text string manipulation functions expose Regular Expressions (RegEx), an advanced string processing standard. The good news is that RegEx is immensely flexible and powerful, enabling you to process and re-write complex text string fields in Omniscope formulae fields to clarify patterns enabling better filtering and visualisation. The bad news is that most people who are not developers do not have a lot of fluency writing RegEx, but we can help:

The formula that will do what you want looks like this:

REPLACEREGEX([Text Field], ".*?_.*?_(.*?)_.*", "$1",true)

The comments below document this formula and expression:

The Omniscope REPLACEREGEX function says to replace string values in a given field (first parameter of the function) by looking for a specific pattern (known as regular expression), if it is not found then we mark the record in the column as empty (see last parameter of the function). The REPLACEREGEX (as opposed to SEARCHREGEX) function is used when you want to re-write the input text string as a formula result.

The 4 REPLACEREGEX function parameters do the following:

[Text Field]: Specifies which Text field you want to search.
".*?_.*?_(.*?)_.*": The string inside the double quotes is known as a regular expression. This defines the text string pattern we are looking for in a given cell. In this case, the specified pattern translates as follows:
1. .*?: Has any number of characters (0 or more, excluding new lines, but may include spaces).
2. _: Followed by an underscore.
3. .*?: Followed by any characters (0 or more, excluding new lines, but may include spaces).
4. _:Followed by an underscore.
_: Followed by an underscore.
.*?: Followed by any number of characters (0 or more, excluding new lines, but may include spaces).
"$1": The text we want to use as a formula result to replace all the values in the Text Field. In this case, we use what is called a back-reference ($1). The back-reference in this case is the reference to all the text followed by the second underscore all the way up to the third underscore.
true: The true value means that if the exact specified pattern is not found in a row in the Text field, then simply show the cell for that record as null. If false is used, then the original text will repeated as formulae result.

tjbate · October 10, 2012 9:31AM

This is a general solution to this type of text string processing problem, and one that can be repeatedly applied. In the case of the perfectly-consistent pattern in your text string data (where each value is separated/delimited by an underscore) there are also simpler options.

For example, you could add an Expand Columns operations block to your data flow, and specify "_" as the separator to divide the values concatenated in [Product_Strategy_Message_Size_Creative Type] by. This will create 4 new columns, one of which will isolate the [Message] value you want, the others can be deleted.

Welcome!

Categories

Ideas Parade

Tagged