使用 Python 重复提取文本答案

【问题标题】：Repeat text extraction with Python使用 Python 重复提取文本
【发布时间】：2015-02-21 13:35:55
【问题描述】：

我有以下代码，我想用它来提取<font color='#FF0000'> and </font> 之间的文本信息。它工作正常，但它只提取一个单元（第一个），而我想提取这些标签之间的所有文本单元。我尝试使用 bash 循环代码来执行此操作，但没有成功。

import os

directory_path ='C:\\My_folder\\tmp'

    for files in os.listdir(directory_path):

    print(files)

    path_for_files = os.path.join(directory_path, files)

    text = open(path_for_files, mode='r', encoding='utf-8').read()

    starting_tag = '<font color='
    ending_tag = '</font>'

    ground = text[text.find(starting_tag):text.find(ending_tag)]

    results_dir = 'C:\\My_folder\\tmp'
    results_file = files[:-4] + 'txt'

    path_for_files = os.path.join(results_dir, results_file)

    open(path_for_files, mode='w', encoding='UTF-8').write(result)

【问题讨论】：

我想如果你想要不止一个，你应该使用 find_all 之类的东西。

标签： python xml bash loops text-extraction

【解决方案1】：

您可以使用 Beautiful Soup 的 css 选择器。

>>> from bs4 import BeautifulSoup
>>> s = "foo <font color='#FF0000'> foobar </font> bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
    print(i.text)


 foobar

【讨论】：

感谢您的建议，但我在使用 BeautifulSoup 时遇到了问题 - 同样的老问题：“ImportError：没有名为 BeautifulSoup 的模块”，并且没有任何建议的解决方案适合我。
你需要导入beautifulsoup。如果尚未安装，请安装它。
是的，我知道。我确实安装了它，但不知何故无法导入它。我阅读了不同的建议，但没有一个对我有用。我现在在想，问题可能是我的计算机上安装了三个 Python 版本。其他包我从来没有遇到过这样的问题。
嗯，我设法在 Cygwin 上运行 BeautifulSoup，但出现错误：AttributeError: 'str' object has no attribute 'text'
我正在使用 BeautifulSoup-3.2.1 - 唯一可以在我的机器上运行的。

【解决方案2】：

You can also use lxml.html 

>>> import lxml.html as PARSER
>>> s = "<html><body>foo <font color='#FF0000'> foobar </font> bar</body></html>"
>>> root = PARSER.fromstring(s)
>>> for i in root.getiterator("font"):
...   try: i.attrib["color"]
...   except:pass

【讨论】：

这里的's'是一个html文件吗？如果将其替换为包含一堆 html 或 xml 文件的目录，这会起作用吗？此外，您的脚本所做的是它提取“#FF0000”，我想提取颜色标签之间的突出显示文本： text text text
"s" 是 html 文件的内容。我们必须对目录中的 html/xml 文件应用“for”循环。使用 os.listdir("/tmp/target_html/") 和文件读取方法。是的，我想念“字体”标签的文字。 >>> root = PARSER.fromstring(s) >>> for i in root.getiterator("font"): ... try: ... if i.attrib["color"]=="#FF0000": ...打印i.text ...除了：...通过
感谢您的回复。我对 Python 还是很陌生。你介意告诉我我应该如何将你的建议或@Avinash Raj 的建议与我的脚本结合起来吗？
您可以使用任何代码，但在对几个测试用例（有效/无效）使用代码测试之前，或者您可以使用测试用例剪切示例代码，以便我查看并为您提供解决方案。 vivekbsable@gmail.com/vivek.igp (Skype ID) import lxml.html as PARSER def getFontTagText(content): """Input: Html Content. Output: List of Font tag text list.""" font_text = [] root = PARSER.fromstring(content) for i in root.getiterator("font"): try: if i.attrib["color"]=="#FF0000": font_text.append(i.text) except: pass return font_text
我认为我的问题不清楚，我让你感到困惑。我会尝试将其作为另一个问题发布。谢谢。