How to extract dataframe by row values by conditions with other columns?

  pandas, python

I have a dataframe as follows:

#values
a=["003C", "003P1", "003P1", "003P1", "004C", "004P1", "004P2", "003C", "003P2", "003P1", "003C", "003P1", "003P2", "003C", "003P1", "004C", "004P2", "001C", "001P1"]
b=["chr18", "chr20", "chr8", "chr8", "chr11", "chr11", "chr11", "chr11", "chr11", "chr11", "chr1", "chr1", "chr1", "chr1", "chr1", "chr11", "chr11", "chr9", "chr9"]
c=[48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]
d=["C", "G", "C", "C", "C", "C", "C", "G", "G", "G", "T", "T", "T", "T", "T", "C", "C", "T", "T"]
e=["A", "T", "A", "A", "G", "G", "G", "A", "A", "A", "A", "A", "A", "A", "A", "G", "G", "C", "C"]
#Make dataframe
df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})

df

    Sample  CHROM   POS         REF  ALT
0   003C    chr18   48399       C    A
1   003P1   chr20   145653      G    T
2   003P1   chr8    244695      C    A
3   003P1   chr8    244695      C    A
4   004C    chr11   1163940     C    G
5   004P1   chr11   1163940     C    G
6   004P2   chr11   1163940     C    G
7   003C    chr11   5986513     G    A
8   003P2   chr11   5986513     G    A
9   003P1   chr11   5986513     G    A
10  003C    chr1    248650751   T    A
11  003P1   chr1    248650751   T    A
12  003P2   chr1    248650751   T    A
13  003C    chr1    125895      T    A
14  003P1   chr1    125895      T    A
15  004C    chr11   2587895     C    G
16  004P2   chr11   2587895     C    G
17  001C    chr9    14587952    T   C
18  001P1   chr9    14587952    T   C

I wanted to extract dataframe that matches ‘CHROM’ ‘POS’ ‘REF’ ‘ALT’ for df['Sample'] with C common with P1 or P2 or P1 & P2.
For example 003C : has its corrsponding 003P1 or 003P2 with with all matching values 'CHROM' 'POS' 'REF' 'ALT' see index 7,8,9 and 13,14 and 10,11,12. I wanted to extract them all:

The expected output is:

    Sample  CHROM   POS       REF   ALT
0   003C    chr1    125895     T    A
1   003P1   chr1    125895     T    A
2   004C    chr11   1163940    C    G
3   004P1   chr11   1163940    C    G
4   004P2   chr11   1163940    C    G
5   004C    chr11   2587895    C    G
6   004P2   chr11   2587895    C    G
7   003C    chr11   5986513    G    A
8   003P2   chr11   5986513    G    A
9   003P1   chr11   5986513    G    A
10  001C    chr9    14587952   T    C
11  001P1   chr9    14587952   T    C
12  003C    chr1    248650751  T    A
13  003P1   chr1    248650751  T    A
14  003P2   chr1    248650751  T    A

I tried following code:

df[['INT','STR']] = df['Sample'].str.extract('(d+)(.*)')
df = df[df.groupby(['CHROM', 'POS', 'REF', 'ALT', 'INT'])['POS'].transform('size').eq(3)]

But it pulls onnly common in all the three like C, P1 and P2 not C, P1 or P2.

Anyhelp appreciated. Thanks

Source: Python Questions

LEAVE A COMMENT