如何修复错误“不能在类似字节的对象上使用字符串模式”？答案

【问题标题】：how do I fix error "cannot use a string pattern on a bytes-like object"?如何修复错误“不能在类似字节的对象上使用字符串模式”？
【发布时间】：2019-09-25 02:30:42
【问题描述】：

我正在尝试按照tutorial 阅读 pdf 文件并将其转换为文本，但我不断出错。这是我的python代码

import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
 
if text != "":
   text = text
 
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
 
 
tokens = word_tokenize(text)
 
punctuations = ['(',')',';',':','[',']',',']
 
stop_words = stopwords.words('english')
 
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

我不断遇到的错误是

tokens = word_tokenize(文本)

TypeError: 不能在类似字节的对象上使用字符串模式

我该如何解决这个错误？

【问题讨论】：

你用的是哪个版本的python？
TypeError: can't use a string pattern on a bytes-like object in re.findall()的可能重复
检查副本。 word_tokenize 在后端使用 regex，因此此解决方案也适用于您。
@MyNameIsCaleb 我查看了您引用的答案，但我不知道如何适用于我的情况
tokens = word_tokenize(text.decode("utf-8")) 试试这个

标签： python

【解决方案1】：

您正在读取字节，但您需要一个字符串，因为word_tokenize 在后端使用regex。

改变这一行：

tokens = word_tokenize(text.decode("utf-8"))

【讨论】：