【问题标题】:PANDAS Finding the exact word and before word in a column of string and append that new column in python (pandas) columnPANDAS 在一列字符串中查找确切的单词和之前的单词,并将该新列附加到 python (pandas) 列中
【发布时间】:2019-08-08 22:09:42
【问题描述】:

在 col_a 中查找目标词和前一个词,并在 col_b_PY 和 col_c_LG 列中追加匹配的字符串

    This code i have tried to achive this functionality but not able to 
get the expected output. if any help appreciated
Here is the below code i approach with regular expressions:

df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+) 
{0,1}PY")

df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)

数据框如下所示

col_a

Python PY is a general-purpose language LG

Programming language LG in Python PY 

Its easier LG to understand  PY

The syntax of the language LG is clean PY 

期望的输出:

col_a                                       col_b_PY      col_c_LG
Python PY is a general-purpose language LG  Python PY     language LG

Programming language LG in Python PY        Python PY     language LG

Its easier LG to understand  PY            understand PY easier LG

The syntax of the language LG is clean PY   clean  PY     language LG

【问题讨论】:

  • 可能是df['col_b_PY'] = df['col_a'].str.extract(r'([a-zA-Z'-]+\s+PY)\b')df['col_c_LG'] = df['col_a'].str.extract(r'([a-zA-Z'-]+\s+LG)\b')
  • 非常感谢! @Wiktor Stribizew 花了很多时间来找出答案
  • 我添加了一个带有解释的答案。请注意,extract 需要一个捕获组才能真正提取字符串,它只提取一个 captured 子字符串。
  • Col_a Python PY is a general purpose PY language LG 在 col_a 中包含 PY 是两次我需要捕获 python py 和目的 py 我们的正则表达式模式只捕获一次 output Python PY purpose PY
  • 好的,使用extractall 很容易修复,请参阅我的更新答案。

标签: regex python-3.x pandas


【解决方案1】:

你可以使用

df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")

或者,提取所有匹配项并用空格连接它们:

df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)

请注意,您需要在正则表达式模式中使用捕获组,以便extract 可以实际提取文本:

将正则表达式pat中的捕获组提取为DataFrame中的列。

注意\b 单词边界是匹配PY / LG 作为一个完整单词所必需的。

另外,如果你只想从一个字母开始匹配,你可以将模式修改为

r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b"
r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b"
   ^^^^^^^^          ^

[a-zA-Z] 将匹配一个字母,[a-zA-Z'-]* 将匹配 0 个或多个字母、撇号或连字符。

Python 3.7 和 Pandas 0.24.2:

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)

df = pd.DataFrame({
    'col_a': ['Python PY is a general-purpose language LG',
             'Programming language LG in Python PY',
             'Its easier LG to understand  PY',
             'The syntax of the language LG is clean PY',
             'Python PY is a general purpose PY language LG']
    })
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)

输出:

                                           col_a              col_b_PY     col_c_LG
0     Python PY is a general-purpose language LG             Python PY  language LG
1           Programming language LG in Python PY             Python PY  language LG
2                Its easier LG to understand  PY        understand  PY    easier LG
3      The syntax of the language LG is clean PY              clean PY  language LG
4  Python PY is a general purpose PY language LG  Python PY purpose PY  language LG

【讨论】:

  • 把单引号改成双引号?
【解决方案2】:

检查

df['col_c_LG'],df['col_c_PY']=df['col_a'].str.extract(r"(\w+\s+LG)"),df['col_a'].str.extract(r"(\w+\s+PY)")
df
Out[474]: 
                                        col_a       ...              col_c_PY
0  Python PY is a general-purpose language LG       ...             Python PY
1       Programming language LG in Python PY        ...             Python PY
2             Its easier LG to understand  PY       ...        understand  PY
3   The syntax of the language LG is clean PY       ...              clean PY
[4 rows x 3 columns]

【讨论】:

  • 非常感谢! @Wen-Ben 你想出了新的解决方案和完整的答案
  • Col_a Python PY is a general purpose PY language LG 在 col_a 中包含 PY 是两次我需要捕获 python py 和目的 py 我们的正则表达式模式只捕获一次 output Python PY purpose PY
猜你喜欢
  • 2019-08-09
  • 2020-10-07
  • 1970-01-01
  • 1970-01-01
  • 2018-03-15
  • 1970-01-01
  • 2018-09-21
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多