如何在 html 结构中维护的抓取数据中抓取所有标签（ ）say答案

【问题标题】：How to scrape to get say tags( ) all in the scraped data as maintained in the html structure say如何在 html 结构中维护的抓取数据中抓取所有标签（ ）say
【发布时间】：2015-11-26 20:02:54
【问题描述】：

<html>my news article</html>
<title>scraping</title>
<p>the world of so many articles</p>
<p>has been placed in this blocknotes</p>
<p>and i really wanna scraped that html structure just as it is</p>
<p>with all the tags in the scraped data</p>

如何抓取所有标签？

我希望抓取的数据像...........

【问题讨论】：

你想获取每个标签内的文本吗？
我的意思是，你想达到什么目的？也许发布一个示例输出？
我希望抓取的数据像......
这么多文章的世界

已放置在此块注释中

我真的很想按原样刮掉该html结构

带有已刮取数据中的所有标签
但是结果呢？所有 标记一起在一个字符串中？列表中不同项目中的每个 标记？
我想要其中的所有标签，以便我可以通过在网站上获取文章的维护结构来使用它。

标签： php web-scraping

【解决方案1】：

此 Python 脚本可能会有所帮助：

from lxml import html

HTML = """<html>
<title>scraping</title>
<p>the world of so many articles</p>
<p>has been placed in this blocknotes</p>
<p>and i really wanna scraped that html structure just as it is</p>
<p>with all the tags in the scrapped data</p>
</html>"""

tree = html.fromstring(HTML)
print ' '.join("<p>{}</p>".format(x) for x in tree.xpath('//p/text()'))

输出：

<p>the world of so many articles</p> <p>has been placed in this blocknotes</p> <p>and i really wanna scraped that html structure just as it is</p> <p>with all the tags in the scrapped data</p>

【讨论】：

请我使用 PHP Curl 和 DomXpath 而不是 Python