Split column of text into column of lists in Pandas dataframe with no unambiguous split sequence

  pandas, python, regex

I have a dataframe that contains of column of text that gives a numeric code followed by a colon and a text description. The text may include one or many code descriptors each separated by a comma and a space.

myDF = pd.DataFrame({'origtext':['012: some text','012: some text, 123: other text','012: some text, 234: text, strings and numbers']})

The dataframe looks like:

                                         origtext
0                                  012: some text
1                 012: some text, 123: other text
2  012: some text, 234: text, strings and numbers

I need to convert the text in the ‘origtext’ column to lists where each element of the list consists of the numeric code, colon and text descriptor.

My first approach was to use .split() to split the text at ', ' such as:

myDF['textlist'] = myDF['origtext'].str.split(', ')

to produce…

                                           textlist  
0                                  [012: some text]  
1                 [012: some text, 123: other text]  
2  [012: some text, 234: text, strings and numbers]  

In my real-world dataframe, that worked well for the majority of rows but there were a few cases where the text description contained ', '. This meant that the bottom list in the above example contained 3 elements (rather than 2) and the final element did not begin with 'nnn: '. This made the .split() method unsuitable.

Is there a way to use a matched group in a regular expression to identify something like ', 123:' and replace it with 'xxxxx123:' and then split based on 'xxxxx'? I’ve been able to replace the matched group with a string but I haven’t been able to work out how to add some text to the matched group whilst keeping the matched text intact.

Or is there another method to achieve the desired outcome?

Source: Python Questions

LEAVE A COMMENT