Visokio website     Downloads     Video tutorials     KnowledgeBase  
Formulae: Regex - Extracting Subset of String? - Visokio Forums
Formulae: Regex - Extracting Subset of String?
  •     shaji_o October 9, 2012 7:01PM
    I've got a field that contains Product_Strategy_Message_Size_Creative Type, including the underscores. I need to extract the Message part, i.e. the string that sits between the 2nd "_" and 3rd "_". Have you got a formula that would do that for me?
  • 2 Comments
  •     tjbate October 10, 2012 7:29AM
    Shaji - Omniscope text manipulation options go well beyond spreadsheet functions. Some Omniscope text string manipulation functions expose Regular Expressions (RegEx), an advanced string processing standard. The good news is that RegEx is immensely flexible and powerful, enabling you to process and re-write complex text string fields in Omniscope formulae fields to clarify patterns enabling better filtering and visualisation. The bad news is that most people who are not developers do not have a lot of fluency writing RegEx, but we can help:

    The formula that will do what you want looks like this:

    REPLACEREGEX([Text Field], ".*?_.*?_(.*?)_.*", "$1",true)

    The comments below document this formula and expression:

    The Omniscope REPLACEREGEX function says to replace string values in a given field (first parameter of the function) by looking for a specific pattern (known as regular expression), if it is not found then we mark the record in the column as empty (see last parameter of the function). The REPLACEREGEX (as opposed to SEARCHREGEX) function is used when you want to re-write the input text string as a formula result.

    The 4 REPLACEREGEX function parameters do the following:
    1. [Text Field]: Specifies which Text field you want to search.
    2. ".*?_.*?_(.*?)_.*": The string inside the double quotes is known as a regular expression. This defines the text string pattern we are looking for in a given cell. In this case, the specified pattern translates as follows:
      1. .*?: Has any number of characters (0 or more, excluding new lines, but may include spaces).
      2. _: Followed by an underscore.
      3. .*?: Followed by any characters (0 or more, excluding new lines, but may include spaces).
      4. _:Followed by an underscore.
      5. (.*?): Is followed by any characters (0 or more, excluding new lines, but may include spaces). Note this part of the string is surrounded in parentheses so that we can refer to all the text found within this section, in other words group all the text and refer to it using an identifier. Surrounding certain pattern in parentheses in Regular expressions allows us to use what is called a back-reference ($) to refer to a specified pattern of characters using the dollar sign followed by the unique identifier.In this case, we only have one set of parentheses. If we had more than one, the identifier for each (pattern) is increased by 1... $2, $3 etc.
    3. _: Followed by an underscore.
    4. .*?: Followed by any number of characters (0 or more, excluding new lines, but may include spaces).
    5. "$1": The text we want to use as a formula result to replace all the values in the Text Field. In this case, we use what is called a back-reference ($1). The back-reference in this case is the reference to all the text followed by the second underscore all the way up to the third underscore.
    6. true: The true value means that if the exact specified pattern is not found in a row in the Text field, then simply show the cell for that record as null. If false is used, then the original text will repeated as formulae result.

  •     tjbate October 10, 2012 9:31AM
    This is a general solution to this type of text string processing problem, and one that can be repeatedly applied. In the case of the perfectly-consistent pattern in your text string data (where each value is separated/delimited by an underscore) there are also simpler options.

    For example, you could add an Expand Columns operations block to your data flow, and specify "_" as the separator to divide the values concatenated in [Product_Strategy_Message_Size_Creative Type] by. This will create 4 new columns, one of which will isolate the [Message] value you want, the others can be deleted.

Welcome!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In Apply for Membership