【问题标题】:Looking for a database of n-grams taken from wikipedia寻找取自维基百科的 n-gram 数据库
【发布时间】:2011-01-20 11:29:39
【问题描述】:

我正在有效地尝试解决与这个问题相同的问题:

Finding related words (specifically physical objects) to a specific word

减去单词代表物理对象的要求。答案和编辑过的问题似乎表明,一个好的开始是使用维基百科文本作为语料库来构建 n-gram 频率列表。在我开始下载庞大的维基百科转储之前,有谁知道这样的列表是否已经存在?

PS 如果上一个问题的原始发帖人看到了这一点,我很想知道你是如何解决这个问题的,因为你的结果看起来很棒:-)

【问题讨论】:

    标签: nlp semantics wikipedia


    【解决方案1】:

    Google has a publicly available TB n-garam 数据库(最多 5 个)。
    您可以订购 6 张 DVD 或找到托管它的 torrent。

    【讨论】:

    • 是的,我考虑过这个数据集——比维基百科的转储还要大!
    • 不可商用
    • 有人找到它的种子吗?
    【解决方案2】:

    您可以找到 2008 年 6 月的 Wikipedia n-gram here。此外,它还有词条和标记句子。我尝试创建自己的 n-gram,但在 bigram 上内存不足(32Gb)(当前的英文维基百科非常庞大)。提取 xml 大约需要 8 个小时,unigrams 需要 5 个小时,bigrams 需要 8 个小时。

    链接的 n-gram 还具有一些被清理的好处,因为 mediawiki 和 Wikipedia 在文本之间有很多垃圾。

    这是我的 Python 代码:

    from nltk.tokenize import sent_tokenize
    from nltk.tokenize import word_tokenize
    from nltk.tokenize import wordpunct_tokenize
    from datetime import datetime
    from collections import deque
    from collections import defaultdict
    from collections import OrderedDict
    import operator
    import os
    
    # Loop through all the English Wikipedia Article files and store their path and filename in a list. 4 minutes.
    dir = r'D:\Downloads\Wikipedia\articles'
    l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
    
    t1 = datetime.now()
    
    # For each article (file) loop through all the words and generate unigrams. 1175MB memory use spotted.
    # 12 minutes to first output. 4200000: 4:37:24.586706 was last output.
    c = 1
    d1s = defaultdict(int)
    for file in l:
        try:
            with open(file, encoding="utf8") as f_in:
                content = f_in.read()
        except:
            with open(file, encoding="latin-1") as f_in:
                content = f_in.read()        
        words = wordpunct_tokenize(content)    # word_tokenize handles 'n ʼn and ʼn as a single word. wordpunct_tokenize does not.
        # Take all the words from the sentence and count them.
        for i, word in enumerate(words):    
            d1s[word] = d1s[word] + 1   
        c = c + 1
        if c % 200000 == 0:
            t2 = datetime.now()
            print(str(c) + ': ' + str(t2 - t1))
    
    t2 = datetime.now()
    print('After unigram: ' + str(t2 - t1))
    
    t1 = datetime.now()
    # Sort the defaultdict in descending order and write the unigrams to a file.
    # 0:00:27.740082 was output. 3285Mb memory. 165Mb output file.
    d1ss = OrderedDict(sorted(d1s.items(), key=operator.itemgetter(1), reverse=True))
    with open("D:\\Downloads\\Wikipedia\\en_ngram1.txt", mode="w", encoding="utf-8") as f_out:
        for k, v in d1ss.items():
            f_out.write(k + '┼' + str(v) + "\n")
    t2 = datetime.now()
    print('After unigram write: ' + str(t2 - t1))
    
    # Determine the lowest 1gram count we are interested in.
    low_count = 20 - 1
    d1s = {}
    # Get all the 1gram counts as a dict.
    for word, count in d1ss.items():
        # Stop adding 1gram counts when we reach the lowest 1gram count.
        if count == low_count:
            break
        # Add the count to the dict.
        d1s[word] = count
    
    t1 = datetime.now()
    
    # For each article (file) loop through all the sentences and generate 2grams. 13GB memory use spotted.
    # 17 minutes to first output. 4200000: 4:37:24.586706 was last output.
    c = 1
    d2s = defaultdict(int)
    for file in l:
        try:
            with open(file, encoding="utf8") as f_in:
                content = f_in.read()
        except:
            with open(file, encoding="latin-1") as f_in:
                content = f_in.read()   
        # Extract the sentences in the file content.         
        sentences = deque()
        sentences.extend(sent_tokenize(content))            
        # Get all the words for one sentence.
        for sentence in sentences:        
            words = wordpunct_tokenize(sentence)    # word_tokenize handles 'n ʼn and ʼn as a single word. wordpunct_tokenize does not.
            # Take all the words from the sentence with high 1gram count that are next to each other and count them.
            for i, word in enumerate(words):    
                if word in d1s:
                    try:
                        word2 = words[i+1]
                        if word2 in d1s:
                            gram2 = word + ' ' + word2
                            d2s[gram2] = d2s[gram2] + 1
                    except:
                        pass
        c = c + 1
        if c % 200000 == 0:
            t2 = datetime.now()
            print(str(c) + ': ' + str(t2 - t1))
    
    t2 = datetime.now()
    print('After bigram: ' + str(t2 - t1))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-05-24
      • 2019-07-10
      • 2019-07-20
      • 2016-08-04
      • 1970-01-01
      相关资源
      最近更新 更多