我正在尝试通过使用python在pdf文件中搜索一个单词但获取页码来返回完整的句子答案

【问题标题】：I'm trying to return the complete sentence by searching with one word in pdf file with python but getting page number我正在尝试通过使用python在pdf文件中搜索一个单词但获取页码来返回完整的句子
【发布时间】：2021-06-10 11:35:20
【问题描述】：

我正在尝试通过使用python在pdf文件中搜索一个单词但获取页码来返回完整的句子

比如有句像

此人进行了洗钱（这句话在第 6 页下方）。我正在尝试获取该句子包含洗钱的特定句子。

代码如下：

import PyPDF2
import re

pattern = "laundering"
fileName = "result.pdf"

object = PyPDF2.PdfFileReader(fileName)
numPages = object.getNumPages()

for i in range(0, numPages):
    pageObj = object.getPage(i)
    text = pageObj.extractText()
    text = text.lower()
   
    for match in re.finditer(pattern, text):
        print(f'Page no: {i} | Match: {match}')

输出是：

Page no: 6 | Match: <re.Match object; span=(1688, 1698), match='laundering'>
Page no: 30 | Match: <re.Match object; span=(1452, 1462), match='laundering'>
Page no: 54 | Match: <re.Match object; span=(1690, 1700), match='laundering'>
Page no: 78 | Match: <re.Match object; span=(1652, 1662), match='laundering'>
Page no: 101 | Match: <re.Match object; span=(469, 479), match='laundering'>
Page no: 125 | Match: <re.Match object; span=(1657, 1667), match='laundering'>

我期待这样的输出：

'Complete sentence', page no 6
'Complete sentence', page no 30
''
''
'Complete sentence', page no 125

【问题讨论】：

看起来 match.span 包含单词“洗钱”的开始和结束字符位置的元组。要获取包含该单词的句子，您需要找到句子的开头和结尾。这意味着找到前一个句子终止符和下一个句子终止符。注意句尾可以是句号、问号、感叹号等。
比这更糟糕。句点可以是数字、问号或 URL。 PDF 不是文字处理格式。除非 PDF 是结构化的，否则几乎不可能正确执行此操作。更不用说这个词是可以学习的，并且可能不是一个单一的 PDF 短语。

标签： python python-3.x pdf nlp pypdf2

【解决方案1】：

re.finditer(pattern, text) 正在返回一个iterator of Match object。要访问实际匹配的文本，您可以使用match.group(0) 以str 格式返回整个匹配项。

当您想要提取句子而不是简单地定义模式时，您需要修改正则表达式以捕获它之前和之后的单词。

我会这样做：

tx = '''This is a test1!
this is a test2.
test1.
This is a test3'''

import re
pattern = 'test1'
for m in re.finditer(f"([^!?.]*{pattern}.*[!?.])", tx):
    print(m.group(0))

输出：

This is a test1!

test1.

此正则表达式将捕获模式前后所有非标点符号的字符。

【讨论】：