【问题标题】:string matching from unicode text file? python来自unicode文本文件的字符串匹配? Python
【发布时间】:2014-09-18 00:10:10
【问题描述】:
import re, codecs
import string
import sys
stopwords=codecs.open('stopwords_harkat1.txt','r','utf_8')
lines=codecs.open('Corpus_v2.txt','r','utf_8')
for line in lines:
    line = line.rstrip().lstrip()
    #print line
    tokens = line.split('\t')
    token=tokens[4]

    if token in stopwords:
            print token

此代码没有错误,但它不适用于来自不同文件的字符串匹配。任何人都可以帮助我吗?

$我也尝试了方法匹配但不起作用

【问题讨论】:

    标签: python python-2.7 unicode unicode-string python-unicode


    【解决方案1】:

    您需要加载内容文件,而不仅仅是打开它。

    替换以下行:

    stopwords = codecs.open('stopwords_harkat1.txt','r','utf_8')
    

    与:

    with codecs.open('stopwords_harkat1.txt','r','utf_8') as f:
        # assuming one stop word in one line.
        stopwords = set(line.strip() for line in f)
    
        # Otherwise, use the following line
        # stopwords = set(word for line in f for word in line.split())
    

    【讨论】:

    • 我试了一下,但是出现了这个错误: Traceback (last recent call last): File "C:\Users\Desktop\remove stop words\remove\remove.py", line 7, in with open(codecs.open('stopwords_harkat1.txt','r','utf_8'))as f: TypeError: coercing to Unicode: need string or buffer, instance found
    • @msm,前面的open( 是错字。我更新了答案。请检查一下。
    猜你喜欢
    • 1970-01-01
    • 2023-03-23
    • 1970-01-01
    • 2011-02-09
    • 2015-01-18
    • 2012-05-11
    • 2018-03-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多