How to separate strings from a column in pandas?答案

【问题标题】：How to separate strings from a column in pandas?How to separate strings from a column in pandas?
【发布时间】：2022-12-28 01:08:42
【问题描述】：

I have 2 columns:

A	B
1	ABCSD
2	SSNFs
3 CVY KIP
4 MSSSQ
5	ABCSD
6 MMS LLS
7	QQLL

This is an example actual files contains these type of cases in 1000+ rows. I want to separate all the alphabets from column A and get them as output in column B: Expected Output:

A	B
1	ABCSD
2	SSNFs
3	CVY KIP
4	MSSSQ
5	ABCSD
6	MMS LLS
7	QQLL

So Far I have tried this which works but looking for a better way:


df['B2'] = df['A'].str.split(' ').str[1:]

def try_join(l):
    try:
        return ' '.join(map(str, l))
    except TypeError:
        return np.nan
df['B2'] = [try_join(l) for l in df['B2']]

df = df.replace('', np.nan)
append=df['B2']
df['B']=df['B'].combine_first(append)
df['A']=[str(x).split(' ')[0] for x in df['A']]
df.drop(['B2'],axis=1,inplace=True)
df

【问题讨论】：

What have you tried so far?
Edited , you can see my approach now

标签： python pandas string

【解决方案1】：

You could try as follows.

Eitheruse str.extractall with two named capture groups (generic: (?P<name>...)) as A and B. First one for the digit(s) at the start, second one for the rest of the string. (You can easily adjust these patterns if your actual strings are less straightforward.) Finally, drop the added index level (1) by using df.droplevel.
Oruse str.split with n=1 and expand=True and rename the columns (0 and 1 to A and B).
Either option can be placed inside df.update with overwrite=True to get the desired outcome.

import pandas as pd
import numpy as np

data = {'A': {0: '1', 1: '2', 2: '3 CVY KIP', 3: '4 MSSSQ', 
              4: '5', 5: '6 MMS LLS', 6: '7'}, 
        'B': {0: 'ABCSD', 1: 'SSNFs', 2: np.nan, 3: np.nan, 
              4: 'ABCSD', 5: np.nan, 6: 'QQLL'}
        }

df = pd.DataFrame(data)

df.update(df.A.str.extractall(r'(?P<A>^d+)s(?P<B>.*)').droplevel(1), 
          overwrite=True)

# or in this case probably easier:
# df.update(df.A.str.split(pat=' ', n=1, expand=True)
#          .rename(columns={0:'A',1:'B'}),overwrite=True)

df['A'] = df.A.astype(int)

print(df)

   A        B
0  1    ABCSD
1  2    SSNFs
2  3  CVY KIP
3  4    MSSSQ
4  5    ABCSD
5  6  MMS LLS
6  7     QQLL

【讨论】：

【解决方案2】：

You can split on ' ' as it seems that the numeric value is always at the beginning and the text is after a space.

split = df.A.str.split(' ', 1)
df.loc[df.B.isnull(), 'B'] = split.str[1]
df.loc[:, 'A'] = split.str[0]

【讨论】：

【解决方案3】：

You could use str.split() if your number appears first.

df['A'].str.split(n=1,expand=True).set_axis(df.columns,axis=1).combine_first(df)

df['A'].str.extract(r'(?P<A>d+) (?P<B>[A-Za-z ]+)').combine_first(df)

Output:

   A        B
0  1    ABCSD
1  2    SSNFs
2  3  CVY KIP
3  4    MSSSQ
4  5    ABCSD
5  6  MMS LLS
6  7     QQLL

【讨论】：