使用python从字符串边缘删除动词缩写和标点符号[关闭]答案

【问题标题】：Remove verb abbreviation and punctuation from string edges with python [closed]使用python从字符串边缘删除动词缩写和标点符号[关闭]
【发布时间】：2018-08-31 03:18:18
【问题描述】：

如何去除单词（或单词序列）边缘的噪音。我所说的噪音是指：'s、're、.、?、,、; 等。换句话说，punctuation 和 abbreviations强>。但它只需要来自左右边缘，单词内的噪音应该保留。

例子：

Apple.            Apple
Donald Trump's    Trump
They're           They
I'm               I
¿Hablas espanol?  Hablas espanhol
$12               12
H4ck3r            H4ck3r
What's up         What's up

所以基本上删除撇号、动词缩写和标点符号，但仅适用于字符串边缘（右/左）。似乎strip 不适用于完全匹配，并且找不到仅适用于边缘的re 合适的方法。

【问题讨论】：

您需要完全定义问题。什么原则或类使$ 成为标点符号？你会如何处理“I would have”的收缩：I'd've？

标签： python regex nltk text-processing

【解决方案1】：

怎么样

import re

strings = ['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"]

rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in strings for m in [rx.search(string)] if m]
print(filtered)

产量

['Apple', 'Trump', 'They', 'I', 'Hablas', '12', 'H4ck3r']

它不是从左边或右边吃东西，而是简单地取单词字符的第一个匹配项（即[a-zA-Z0-9_]）。

要“在野外”应用它，您可以先拆分句子，如下所示：

sentence = "Apple. Trump's They're I'm ¿Hablas $12 H4ck3r"

rx = re.compile(r'\b\w+\b')
filtered = [m.group(0) for string in sentence.split() for m in [rx.search(string)] if m]
print(filtered)

这显然会产生与上面相同的列表。

【讨论】：

[r.findall(s)[0] for s in strings] using just \w+ 不会做同样的事情，只是更短一些？
@ctwheels：写起来可能会更短，但是使用re.findall()，您首先查找所有匹配项，然后使用第一个匹配项而丢弃其余匹配项。 re.search() 首先不会查找所有匹配项。不过我还没有计时，会很有趣。

【解决方案2】：

使用熊猫：

import pandas as pd
s = pd.Series(['Apple.', "Trump's", "They're", "I'm", "¿Hablas", "$12", "H4ck3r"])

s.str.extract(r'(\w+)')

输出：

0     Apple
1     Trump
2      They
3         I
4    Hablas
5        12
6    H4ck3r
Name: 0, dtype: object

【讨论】：

pandas 重载的好处在哪里？