如何使用 BeautifulSoup 和 Python 从相似元素中仅提取某些文本答案

【问题标题】：How can I extract only certain text from similar elements using BeautifulSoup and Python如何使用 BeautifulSoup 和 Python 从相似元素中仅提取某些文本
【发布时间】：2016-06-03 23:40:39
【问题描述】：

我说的网站：http://www.animenewsnetwork.com/encyclopedia/anime.php?id=160

您不能通过正常请求取消此站点，它不允许这样做。因此，我使用硒。现在，我的问题：

我一直在尝试从“GENRE”字段中获取 TEXT。如您所见，在页面上显示如下：

Genres: adventure, comedy, science fiction

抓取的问题是这些链接附加了链接，当我抓取数据时，我不能只抓取文本。它还向我显示了与这些类型相关的课程和链接。

我现在的代码：

driver.get('http://www.animenewsnetwork.com/encyclopedia/anime.php?id=160')

elem = driver.find_element_by_xpath("//*")
source_codeANN = elem.get_attribute("outerHTML")
soup2 = BeautifulSoup(source_codeANN, 'html.parser')
Genre = soup2.find_all('div',{'id':'infotype-30'})
print Genre

【问题讨论】：

标签： python python-2.7 selenium web-scraping beautifulsoup

【解决方案1】：

请试试这个

driver.get("http://www.animenewsnetwork.com/encyclopedia/anime.php?id=160");
elem = driver.find_element_by_id("infotype-30")
print elem.text

【讨论】：

谢谢，但是，这似乎是 java。我不在 Java 上工作。
那肯定不是 Python。
对不起，请立即查看。

【解决方案2】：

如果你有以下 HTML

<div id="infotype-30" class="encyc-info-type br same-width-as-main" style="width: auto;">
    <strong>Genres:</strong> 
    <span><a href="/encyclopedia/search/genreresults?w=series&amp;a=AA&amp;a=OC&amp;a=TA&amp;a=MA&amp;g=adventure/A&amp;o=rating" class="discreet">adventure</a></span>,
    <span><a href="/encyclopedia/search/genreresults?w=series&amp;a=AA&amp;a=OC&amp;a=TA&amp;a=MA&amp;g=comedy&amp;o=rating" class="discreet">comedy</a></span>,
    <span><a href="/encyclopedia/search/genreresults?w=series&amp;a=AA&amp;a=OC&amp;a=TA&amp;a=MA&amp;g=science%20fiction&amp;o=rating" class="discreet">science fiction</a></span>
</div>

您可以像这样获取流派链接的值：

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://www.animenewsnetwork.com/encyclopedia/anime.php?id=160')
elem = driver.find_element_by_xpath("//*")
source_codeANN = elem.get_attribute("outerHTML")
soup2 = BeautifulSoup(source_codeANN, 'html.parser')
genre_div = soup2.find('div', id='infotype-30')
genres = [ a.text for a in genre_div.find_all('a') ]
print genres
# [u'adventure', u'comedy', u'science fiction']

【讨论】：

【解决方案3】：

我建议使用 Genres: 文本定位 strong 元素的所有以下同级并加入：

", ".join(elm.text for elm in driver.find_elements_by_xpath("//strong[. = 'Genres:']/following-sibling::*"))

演示：

>>> from selenium import webdriver
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://www.animenewsnetwork.com/encyclopedia/anime.php?id=160")  
>>> ", ".join(elm.text for elm in driver.find_elements_by_xpath("//strong[. = 'Genres:']/following-sibling::*"))
u'adventure, comedy, science fiction'

【讨论】：

它完成了工作，谢谢。但是，你能解释一下这里发生了什么吗？
@user2408212 当然，首先，您遇到的主要问题是您实际上需要获取找到的元素的.text。在这里，我们仅使用 selenium 来使用 XPath 表达式定位元素。首先，我们找到带有Genres: 文本的strong 元素，并获取它之后的所有兄弟元素。然后，使用字符串连接来连接我们刚刚找到的元素的文本。希望对您有所帮助。
啊……我现在明白了。谢谢亚历克斯。也非常感谢您的解释。