Merge if one column’s contents are same or longest substrings of another column from another dataframe in Python

  dataframe, pandas, python-3.x, string

Given two dataframes df1 and df2 as follows:

df1:

   id                                            address         full_name
0   1                   PSC 1263, Box 4382nAPO AA 25610     Gregory Smith
1   2  9898 Daniel TrafficwaynLake Kellyhaven, WA 12826  Arthur Patterson
2   3                         USNS JacksonnFPO AA 35094  Stephanie Castro
3   4               2860 Cook RapidnEast John, MS 58005      Tiffany Long
4   5         7672 Padilla RoadnSouth Jessica, HI 23553       Amanda Todd

df2:

  first_name department  salary
0    Tiffany   strategy    2000
1     Castro  marketing    1200
2  Stephanie    finance    1010
3    Stephan         IT    1500

I would like to merge df2 to df1 if the string in messy_name are identical or longest substring of full_name, which means, for example, Stephanie and Stephan both are substrings of Stephanie Castro in df1, but Stephanie are longest matched substring, so Stephan has been ignored, if there is no Stephanie but Stephan only in df2, Stephan will match to Stephanie Castro as well.

     messy_name department  salary
0  Tiffany Long   strategy    2000
1        Arthur  marketing    1200
2     Stephanie    finance    1010
3       Stephan         IT    1500

The expected result:

enter image description here

Example code for reference:

pat = "|".join([re.escape(x) for x in df2.messy_name])
df1.insert(0, 'messy_name', df1['full_name'].str.extract("(" + pat + ')', expand=False))

How could I obtain the expected result in Pandas? Thanks for your kind help at advance.

Source: Python-3x Questions

LEAVE A COMMENT