【发布时间】:2020-08-02 14:26:56
【问题描述】:
我想做的事
我想在 NLP 预处理的句子中去掉连字符而是破折号。
输入
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
预期输出
#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']
以上句子来自以下两篇关于连字符和破折号的文章。
问题
- 第一个去掉'-'符号的处理失败,很难理解为什么第二句和第三句没有单引号('')结合起来。
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
- 我不知道如何编写代码来区分连字符和破折号。
当前代码
samples = [
'A former employee of the accused company, — — —, offered a statement off the record.', #dash
'He is afraid of two things—spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
ignore_symbol = ['-']
for i in range(len(samples)):
text = samples[i]
ret = []
for word in text.split(' '):
ignore = len(word) <= 0
for iw in ignore_symbol:
if word == iw:
ignore = True
break
if not ignore:
ret.append(word)
text = ' '.join(ret)
samples[i] = text
print(samples)
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
for i in range (len(samples)):
list_temp = []
text = samples[i]
list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
samples[i] = list_temp
print(samples)
#output
[['A former employee of the accused company',
'— — —',
'offered a statement off the record.'],
['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
'fifty-six bottles of pop.']]
开发环境
Python 3.7.0
【问题讨论】:
-
distinguish hyphen and dash- 你能用文字描述连字符和破折号之间的区别吗?你熟悉正则表达式吗? -
在您的示例文本中,连字符是 Unicode 代码点
45,破折号是 Unicode 代码点8212- 您的示例是从您的真实文本复制和粘贴? ..for c in s: print(c,ord(c)).
标签: python python-3.x nlp character processing