【问题标题】:How do I loop through a list of URLs to print <P> with Beautifulsoup如何遍历 URL 列表以使用 Beautifulsoup 打印 <P>
【发布时间】:2019-09-03 15:48:05
【问题描述】:

我刚刚发现了 beautifulsoup(4)。我有很多链接,我想一次打印多个网站的&lt;p&gt; 标签,但我不知道该怎么做,因为我是初学者。我在 stackoverflow 上也找不到适合我的东西。
像这样的东西不起作用:

from bs4 import BeautifulSoup
import requests
import warnings

warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
url = ["http://fc.lc/api?api=9053290fd05b5e5eb091b550078fa1e30935c92c&url=https://wow-ht.ml?s=https://cutlinks.pro/api?api=e6a8809e51daedcf30d9d6270fd0bfeba73c1dcb&url=https://google.com=text&format=text", "http://fc.lc/api?api=9053290fd05b5e5eb091b550078fa1e30935c92c&url=https://wow-ht.ml?s=https://cutlinks.pro/api?api=e6a8809e51daedcf30d9d6270fd0bfeba73c1dcb&url=https://example.com&format=text&format=text"]

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print( soup.find('p').text )

我得到的错误(我没想到它会起作用(给我一个可能重复的这个错误的答案对我没有帮助,请先阅读标题中的问题):

Traceback (most recent call last):
  File "C:\Users\Gebruiker\Desktop\apitoshortened.py", line 10, in <module>
    r = requests.get(url, headers=headers)
  File "C:\Users\Gebruiker\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Gebruiker\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Gebruiker\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Gebruiker\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\Users\Gebruiker\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '['http://fc.lc/api?api=9053290fd05b5e5eb091b550078fa1e30935c92c&url=https://wow-ht.ml?s=https://cutlinks.pro/api?api=e6a8809e51daedcf30d9d6270fd0bfeba73c1dcb&url=https://google.com=text&format=text', 'http://fc.lc/api?api=9053290fd05b5e5eb091b550078fa1e30935c92c&url=https://wow-ht.ml?s=https://cutlinks.pro/api?api=e6a8809e51daedcf30d9d6270fd0bfeba73c1dcb&url=https://example.com&format=text&format=text']'

我真的没想到它会这么简单困难,任何帮助将不胜感激!

【问题讨论】:

  • @Trenton_M 你读过我写的吗?我已经修复了,我有无法打印多个 url 的 p 的问题,这与 &lt;script&gt; 标签无关,也没有提到!
  • 使用for-loop 处理链接列表(或任何其他列表)

标签: python python-3.x beautifulsoup


【解决方案1】:

如果你有列表然后使用for循环

for item in url:
    r = requests.get(item, headers=headers)
    soup = BeautifulSoup(r.content, "lxml")
    print(soup.find('p').text)

顺便说一句:您的网址不返回任何 HTML,而是返回一些带有链接的文本 - 所以代码找不到 &lt;p&gt;

查看返回的文本

for item in url:
    r = requests.get(item, headers=headers)
    print(r.text)    

结果

https://fc.lc/C4FNiXbY

【讨论】:

    【解决方案2】:

    使用for循环然后检查p标签是否存在。如果存在则打印文本。

    from bs4 import BeautifulSoup
    import requests
    import warnings
    
    warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
    urls = ["http://fc.lc/api?api=9053290fd05b5e5eb091b550078fa1e30935c92c&url=https://wow-ht.ml?s=https://cutlinks.pro/api?api=e6a8809e51daedcf30d9d6270fd0bfeba73c1dcb&url=https://google.com=text&format=text", "http://fc.lc/api?api=9053290fd05b5e5eb091b550078fa1e30935c92c&url=https://wow-ht.ml?s=https://cutlinks.pro/api?api=e6a8809e51daedcf30d9d6270fd0bfeba73c1dcb&url=https://example.com&format=text&format=text"]
    
    # add header
    for url in urls:
     headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
     r = requests.get(url, headers=headers)
     soup = BeautifulSoup(r.content, "lxml")
     if soup.find('p'):
        print( soup.find('p').text)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-07-12
      • 1970-01-01
      • 1970-01-01
      • 2020-05-15
      • 2014-09-12
      • 2020-12-05
      相关资源
      最近更新 更多