Pyparsing 对 Wikipedia 自定义预处理需要太长时间答案

【问题标题】：Pyparsing takes too long for Wikipedia custom preprocessingPyparsing 对 Wikipedia 自定义预处理需要太长时间
【发布时间】：2021-09-05 17:36:46
【问题描述】：

我正在尝试通过将 tokenizer_func 参数设置为自定义 tokenize 函数来自定义 Gensim 的 WikiCorpus 的语料库处理：

# Set tokenizer to our custom tokenizer
wiki = WikiCorpus(input, tokenizer_func=tokenize)

但是 PyParsing 处理文本花费的时间太长（例如，运行一天后甚至没有处理一篇文章）。就我而言，我想像往常一样清理维基百科语料库，除了保持与我拥有的单词列表（可能包含数字、下划线或 & 符号）匹配的任何单词不变。

假设一个可变长度的单词列表phrase_list，其中包括： 81、麦当劳、21、Happy 10、Sam's car、Ham & Eggs

以下是一些要清理的输入示例：

这里有一些示例文本可以转换为清理后的文本！ & 8 * 牙仙子 81 是一个数字，麦当劳是一家快餐连锁店。二十一岁呢？这也是一个数字。这里有一些时态：run run 正在运行。

不知道快乐 10 的感觉如何，但山姆的车是很好。 In-N-Out 当然是经典，有时人们会写虽然在 N Out。 7-11 有人吗？还是7-11？火腿和鸡蛋是也很有趣的书-我猜-- 5nonalpha

以及所需的输出：

这里有一些示例文本要转换为清理后的文本牙齿仙女 81 是 number 和 mcdonalds 是 fast_food 连锁店 20one 怎么样还有数字这里有一些时态奔跑奔跑不知道感觉如何关于happy_10 但sams_car 还不错 in_n_out 是经典的当然有时人们会写 in_n_out 虽然 7_11 任何人，或者是 7_11个火腿_&_eggs也是一本很有趣的书，猜猜非阿尔法

请注意，此示例文本的处理速度非常快，但在实际的 Wikipedia 语料库上却不是（我正在关注本教程，但对其进行了自定义：https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html）

这是我编写的使用 PyParsing 的自定义标记器（和一些辅助函数）：

from typing import List
from pyparsing import *
from gensim.utils import to_unicode
from gensim.corpora import WikiCorpus
import string
import re

TOKEN_MIN_LEN = 2
TOKEN_MAX_LEN = 30

def pre_phrase_tokenize_processing(sentence):
    """
    Helper: For cleaning sentence before phrases have been combined into single underscored tokens. Apply to entire sentence.
    Removes hyphens, and punctuation except ampersands.
    """
    # replace all hyphens with spaces since some phrases use them; just consider as multiword so they can be combined with underscore later
    sentence = sentence.translate(str.maketrans('-', ' '))
    
    # remove all punctuation except ampersands, since some phrases use them
    remove_punct = string.punctuation.replace("&", "")
    sentence = sentence.translate(str.maketrans('', '', remove_punct))
    
    return sentence

def turn_phrases_into_tokens(phrases, sentence):
    """
    Helper: Turns all individual phrases in a sentence into single underscored tokens according to a provided phrase dictionary.
    """
    regex = re.compile("|".join([r"\b{}\b".format(phrase) for phrase in phrases]))
    sentence = regex.sub(lambda m: phrases[m.group(0)], sentence)
    return sentence

#@traceParseAction
def post_phrase_tokenize_processing(toks):
    """
    For cleaning sentence after phrases have been combined into single underscored tokens. Apply to each non-phrase word.
    Removes numeric characters and punctuation. Note toks is a list passed by pyparsing.
    """
    # remove numeric characters, since only non-brand words are passed in
    word = re.sub(r'\d+', '', toks[0])
    
    # remove all punctuation (including &), since only non-brand words are passed in
    word = word.translate(str.maketrans('', '', string.punctuation))
    return word

# our phrase dictionary - actual list may continue many more phrases
phrases = {"81": "81", "mcdonalds": "mcdonalds", "twenty one": "twenty_one", "happy 10": "happy_10", "sams car": "sams_car", "ham & eggs": "ham_&_eggs"}

def tokenize(content: str, token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True) -> List[str]:
    """Overrides original tokenize method in wikicorpus.py
    Tokenize a piece of text from Wikipedia.

    Parameters
    ----------
    content : str
        String without markup (see :func:`~gensim.corpora.wikicorpus.filter_wiki`).
    token_min_len : int
        Minimal token length.
    token_max_len : int
        Maximal token length.
    lower : bool
         Convert `content` to lower case?
    Returns
    -------
    list of str
        List of tokens from `content`.
    """
    content = to_unicode(content, encoding='utf8', errors='ignore')
    if lower:
        content = content.lower()
    
    content = pre_phrase_tokenize_processing(content)
    
    # Combine any phrases into single tokens
    content = turn_phrases_into_tokens(phrases, content)
    
    
    # Match either one of our phrases, or any other nonwhitespace word (in which case we process)
    phrase_list = list(phrases.values())
    parser = Combine(
        OneOrMore(
            oneOf(phrase_list, asKeyword=True)
            | Word(alphas)
            | Word(printables).setParseAction(post_phrase_tokenize_processing)
        ),
        joinString=' ',
        adjacent=False
    )
    content = parser.transformString(content)
    
    return [
        to_unicode(token) for token in content.split()
        if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
    ]

也仅供参考，这是我在清理 Wikipedia 语料库（而不是示例文本）时用来调用标记器的实际代码 - 更多内容可以在上面的同一教程中找到：

def make_corpus(in_f, out_f):
    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w')
    
    # Set tokenizer to our custom tokenizer
    wiki = WikiCorpus(in_f, tokenizer_func=tokenize)

    i = 0
    for text in wiki.get_texts():
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        i = i + 1
        if (i % 100 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processing complete!')

到目前为止，我很确定问题出在 PyParsing（我称之为 parser = Combine(...) 的部分）——而不是匹配每个非空白单词，我应该只匹配需要清理的单词——但我我有点坚持如何做到这一点，因为我对这个库没有太多经验。我也有一个问题，当它们重新组合在一起时，单词之间的空格会被删除，这就是为什么我不得不用joinString=' ' 打电话给Combine，所以如果有任何建议，将不胜感激！

【问题讨论】：

请清理您的示例代码。指未定义的名称to_unicode、phrases、phrase_list 和post_phrase_tokenize_processing。
很抱歉试图保持简洁但遗漏了重要信息！代码应该有更多的上下文，我还链接了我的自定义代码所基于的教程。
您的实际短语列表中有多少项？我认为这可能是瓶颈。
啊，我有大约 330 个词组左右 - 你对我如何处理这个长长的列表有什么建议吗？
查看添加到我的答案中的编辑

标签： parsing nlp data-cleaning wikipedia pyparsing

【解决方案1】：

在注释掉对pre_phrase_tokenize_processing 和turn_phrases_into_tokens 的内容清理调用（因为没有足够的代码来运行它们），并注释掉对post_phrase_tokenize_processing 的解析操作之后，它在我的系统上运行了大约1 秒。

post_phrase_tokenize_processing 到底发生了什么？

尝试使用非常小的输入文本运行，并分析解析操作，以了解下一步该做什么。

此外，您可以通过将其包装在 pyparsing traceParseAction 提供的诊断装饰器中来对该解析操作进行一些粗略的检测。您可以将其作为装饰器添加到post_phrase_tokenize_processing，或者将其内联到您对 setParseAction 的调用中：

...
| Word(printables).setParseAction(traceParseAction(post_phrase_tokenize_processing))
...

编辑：您可以使用 Regex 获得与更新后的 oneOf 相同的行为。

这是使用普通正则表达式的解决方案：

import re
phrases = "ab abc def a".split()
phrases_re = re.compile(r"\b(" + '|'.join(re.escape(w) for w in phrases) + r")\b" )
print(phrases_re.findall("abcd bc abc a bc def"))
['abc', 'a', 'def']

您可以使用以下方法将其变成 pyparsing Regex：

import pyparsing as pp
phrases_expr = pp.Regex(phrases_re)
print(phrases_expr.searchString("abcd bc abc a bc def"))
[['abc'], ['a'], ['def']]

【讨论】：

感谢您的回复！我更新了我的帖子以包含post_phrase_tokenize_processing。早些时候我尝试过你提到的 traseParseAction ，并意识到它基本上与每个单词匹配，我认为这是时间接收器。另请注意 - 此示例文本的处理速度非常快，但无法扩展到实际的 Wikipedia 语料库（我让它运行了一天，它甚至不会处理一篇文章）。