【问题标题】:Preprocessing to get rid of not hyphen but dash in sentences预处理以摆脱句子中的连字符而是破折号
【发布时间】:2020-08-02 14:26:56
【问题描述】:

我想做的事

我想在 NLP 预处理的句子中去掉连字符而是破折号。

输入

samples = [
    'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
    'He is afraid of two things — spiders and senior prom.' #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

预期输出

#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']

以上句子来自以下两篇关于连字符和破折号的文章。

问题

  1. 第一个去掉'-'符号的处理失败,很难理解为什么第二句和第三句没有单引号('')结合起来。
#output
['A former employee of the accused company, — — —, offered a statement off the record.', 
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
  1. 我不知道如何编写代码来区分连字符和破折号。

当前代码

samples = [
    'A former employee of the accused company, — — —, offered a statement off the record.', #dash
    'He is afraid of two things—spiders and senior prom.' #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

ignore_symbol = ['-']
for i in range(len(samples)):
    text = samples[i]
    ret = []
    for word in text.split(' '):
        ignore = len(word) <= 0 
        for iw in ignore_symbol:
            if word == iw:
                ignore = True
                break
        if not ignore:
            ret.append(word)

    text = ' '.join(ret)
    samples[i] = text
print(samples)

#output
['A former employee of the accused company, — — —, offered a statement off the record.', 
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']

for i in range (len(samples)):
    list_temp = []
    text = samples[i]
    list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
    samples[i] = list_temp
print(samples)

#output
[['A former employee of the accused company',
  '— — —',
  'offered a statement off the record.'],
 ['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
  'fifty-six bottles of pop.']]

开发环境

Python 3.7.0

【问题讨论】:

  • distinguish hyphen and dash - 你能用文字描述连字符和破折号之间的区别吗?你熟悉正则表达式吗?
  • 在您的示例文本中,连字符是 Unicode 代码点 45,破折号是 Unicode 代码点 8212 - 您的示例是从您的真实文本复制和粘贴? ..for c in s: print(c,ord(c)).

标签: python python-3.x nlp character processing


【解决方案1】:

如果您正在寻找非正则表达式的解决方案,破折号的 Unicode 点是8212,因此您可以将它们替换为',',然后用',' 分割,然后添加非空白句子:

>>> samples = [
    'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
    'He is afraid of two things — spiders and senior prom.', #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
>>> output = [[
               sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',') 
               if sentence.strip()
              ] for elem in samples]
>>> output
[['A former employee of the accused company',
  'offered a statement off the record.'],
 ['He is afraid of two things', 'spiders and senior prom.'],
 ['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]

【讨论】:

    【解决方案2】:

    尝试使用与re.split 分割的正则表达式(正则表达式)。 Python 的 String.split() 在功能上太有限了。然后,您需要传递“连字符”字符的 Unicode 版本。

    类似:

    re.split('[\002D]', text)
    

    【讨论】:

      【解决方案3】:

      首先,将第二句和第三句合并,因为没有逗号分隔两个字符串。在 Python 中,tmp = 'a''b' 相当于 tmp = 'ab',这就是为什么 samples 中只有 2 个字符串(第 2 个和第 3 个合并了)。

      关于你的问题: 下面的函数remove_dash_preserve_hyphen 删除了str_sentence 参数中的所有破折号,并返回一个干净的str_sentence。 然后将该函数应用于samples 列表中的所有字符串元素,从而生成干净的samples_without_dash

      samples = [
          'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
          'He is afraid of two things — spiders and senior prom.',#**(COMMA HERE)** #dash
          'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
      ]
      
      def remove_dash_preserve_hyphen(str_sentence, dash_signatures=['—']):
          for dash_sig in dash_signatures:
              str_sentence = str_sentence.replace(dash_sig, '')
          return str_sentence
      
      samples_without_dash = [remove_dash_preserve_hyphen(sentence) for sentence in samples]
      

      有问题的确切破折号是带有 unicode 'U+2014' 的 'em-dash'。 样本中可能有更多类型的破折号,这是您不想要的。您需要明智地跟踪它,并在调用remove_dash_preserve_hyphen 函数时在dash_signatures 参数中传递所有破折号类型(您不想要的那些)的列表。

      【讨论】:

        猜你喜欢
        • 2012-12-05
        • 2020-04-10
        • 1970-01-01
        • 2018-03-11
        • 2015-02-26
        • 1970-01-01
        • 2011-01-26
        • 2015-05-14
        • 2020-04-02
        相关资源
        最近更新 更多