【问题标题】:web scraping with python beautiful soup用python美汤刮网
【发布时间】:2018-06-18 07:42:25
【问题描述】:

这是html代码

<html>
<head></head>
<body>
<pre style="word-wrap: break-word; white-space: pre-wrap;">
"{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
</pre>
</body>
</html>

我需要废弃我需要的东西。就像我只需要其中的作者姓名

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    剥离标签并将json字符串转换为python dict:

    import json
    soup = BeautifulSoup(html)
    text = soup.get_text().strip().strip('"')
    d = json.loads(text)
    print(d['Author'])
    

    【讨论】:

      【解决方案2】:

      @vijayprint json.loads(soup.find("pre").string[2:-2])["Author"]; 将完成这项工作。请看下面在 Python 交互终端上执行的代码。

      >>> import json
      >>> import requests
      >>> from bs4 import BeautifulSoup
      >>>
      >>> html_text = """<html>
      ... <head></head>
      ... <body>
      ... <pre style="word-wrap: break-word; white-space: pre-wrap;">
      ... "{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
      ... </pre>
      ... </body>
      ... </html>"""
      >>>
      >>> soup = BeautifulSoup(html_text, "html.parser")
      >>> print(soup.prettify())
      <html>
       <head>
       </head>
       <body>
        <pre style="word-wrap: break-word; white-space: pre-wrap;">
      "{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
      </pre>
       </body>
      </html>
      >>>
      >>> print(soup.find("pre"))
      <pre style="word-wrap: break-word; white-space: pre-wrap;">
      "{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
      </pre>
      >>>
      >>> print(soup.find("pre").string)
      
      "{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}"
      
      >>> print(soup.find("pre").string[2:-2])
      {"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"}
      >>>
      >>> d = json.loads(soup.find("pre").string[2:-2])
      >>> type(d)
      <type 'dict'>
      >>>
      >>> d
      {u'Author': u'Chetan Bhagat', u'Year': u'2016', u'Title': u'One Indian Girl'}
      >>>
      >>> d["Author"]
      u'Chetan Bhagat'
      >>>
      >>> d["Year"]
      u'2016'
      >>>
      >>> d["Title"]
      u'One Indian Girl'
      >>>
      >>> # Place all in the list
      ...
      >>> l = [d["Title"], d["Year"], d["Author"]]
      >>> l
      [u'One Indian Girl', u'2016', u'Chetan Bhagat']
      >>>
      

      » 在列表中获取数据,而不像上面那样引用字典的键。

      >>> final_data = [str(a.strip().split(":")[1])  for  a in soup.find("pre").string[2:-3].replace('\"', '').split(",")]
      >>>
      >>> final_data
      ['One Indian Girl', '2016', 'Chetan Bhagat']
      >>>
      

      让我们了解一下上面一步一步获取列表中数据的直接过程(更新)。

      >>> data = soup.find("pre").string[2:-3]
      >>> data
      u'{"Title":"One Indian Girl","Year":"2016","Author":"Chetan Bhagat"'
      >>>
      >>> data = data.replace('\"', '')
      >>> data
      u'{Title:One Indian Girl,Year:2016,Author:Chetan Bhagat'
      >>>
      >>> arr = data.split(",")
      >>> arr
      [u'{Title:One Indian Girl', u'Year:2016', u'Author:Chetan Bhagat']
      >>>
      >>> final_data = [str(a.strip().split(":")[1])  for  a in arr]
      >>> final_data
      ['One Indian Girl', '2016', 'Chetan Bhagat']
      >>>
      

      【讨论】:

      • 如果我需要在 python 列表中获取所有答案怎么办。像 [一个印度女孩,2016,Chetan Bhagat] 。只有答案不是标题。@Rishikesh Agrawani
      • @vijay,只需使用l = [d["Title"], d["Year"], d["Author"]]。我已经更新了答案。见最后。
      • 我不想提及标题的名称。我有类似的 html 页面,标题超过 50 个。所以我不想提及每个标题。那么@Rishikesh Agrawani 还有其他方法吗?
      • @vijay,你能解释一下,标题是什么?请粘贴示例以便我理解。
      • sorry.Heading 是字典中的关键。就像“Title”中的“Title”:“一个印度女孩”。同样的所有其他人。我有 50 多个键值对。 @Rishikesh Agrawani
      【解决方案3】:

      这就是我想要的。

      exampleSoup = soup(page_html, 'html.parser')
      text = exampleSoup.get_text().strip().strip('"')
      elems=json.loads(text)  
      Details=list(elems.values())
      for i in Details:
          print(i)
      

      elems 为我们提供字典。

      我已将字典的键值对中的值作为详细信息

      For循环用于分别获取每个元素。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2018-08-12
        • 2020-09-28
        • 2021-11-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多