如何使用 Python 3 提取某些 html 标签之间的文本？ [复制]答案

【问题标题】：How to use Python 3 to extract text between certain html tags? [duplicate]如何使用 Python 3 提取某些 html 标签之间的文本？ [复制]
【发布时间】：2020-03-17 22:07:08
【问题描述】：

我正在尝试抓取包含公司名称的网页。名称在标签之间。格式为：

<option value="15589" id="optExhibitor15589" title="N571  Company One, Inc">N571 Company One, Inc</option>
<option value="16441" id="optExhibitor16441" title="N873  Company Two">Company Two</option>
<option value="14863" id="optExhibitor14863" title="N219  Company Three">N219 Company Three</option>

我尝试使用.readline() 将文件分成行列表，但我不知道如何提取title= 和"> 之间的文本。

我想要提取数百个这样的名称，并且想要生成公司名称列表。

【问题讨论】：

可以用lxml或者beautifulsoup吗？
这是网页抓取，不是屏幕抓取。

标签： python web-scraping

【解决方案1】：

您可以使用 scrappy 或其他库进行抓取，但因为您已经获得了所需的内容。这可能会帮助您获取值：

a = '<option value="15589" id="optExhibitor15589" title="N571  Company One, Inc">N571 Company One, Inc</option>'
beginning = a.find('title=') # Returns the integer at the location of 'title'
end = a.find('\">') # Returns the integer at the closing tag
print(a[beginning+6:end+1])

给出这个输出： "N571 Company One, Inc"

【讨论】：