【问题标题】:How get all links form pages?如何获取所有链接表单页面?
【发布时间】:2020-02-02 20:20:27
【问题描述】:

我需要解析几个页面中的所有链接。我编写了简单的脚本,它使用异步方法。

此时它返回空列表links。但我希望将页面中的所有链接都列出links 并将其显示到控制台。

我的脚本没有任何错误消息。

import asyncio
import aiohttp
from bs4 import BeautifulSoup


links = []
host = 'https://avito.ru/saransk'
search_words = [
    'asus',
    'lenovo',
    'xiaomi',
    'apple',
    'ipad',
]


def get_data(html_text):
    paths = []
    soup = BeautifulSoup(html_text, 'lxml')
    link_obj = soup.find_all('a')

    for path in link_obj:
        paths.append(path['href'])

    links.extend(paths)

    return links


async def get_html(search_word):
    async with aiohttp.ClientSession() as session:
        resp = await session.get(host + '?q=' + search_word)   
        assert resp.status == 200
        # print(await resp.text())
        resp2 = await get_data(resp.text())
        print('----------', resp2)


def main():
    ioloop = asyncio.get_event_loop()
    tasks = [ioloop.create_task(get_html(word)) for word in search_words]
    ioloop.run_until_complete(asyncio.wait(tasks))
    ioloop.close()
    print(links)


main()

我使用 python 3.8 并遵循要求:

aiohttp==3.6.2
  - async-timeout [required: >=3.0,<4.0, installed: 3.0.1]
  - attrs [required: >=17.3.0, installed: 19.3.0]
  - chardet [required: >=2.0,<4.0, installed: 3.0.4]
  - multidict [required: >=4.5,<5.0, installed: 4.7.4]
  - yarl [required: >=1.0,<2.0, installed: 1.4.2]
    - idna [required: >=2.0, installed: 2.8]
    - multidict [required: >=4.0, installed: 4.7.4]
bs4==0.0.1
  - beautifulsoup4 [required: Any, installed: 4.8.2]
    - soupsieve [required: >=1.2, installed: 1.9.5]
fake-useragent==0.1.11
lxml==4.5.0
requests==2.22.0
  - certifi [required: >=2017.4.17, installed: 2019.11.28]
  - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
  - idna [required: >=2.5,<2.9, installed: 2.8]
  - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.8]

【问题讨论】:

    标签: python python-3.x beautifulsoup async-await


    【解决方案1】:

    试试这个。

    from simplified_scrapy.request import req
    from simplified_scrapy.simplified_doc import SimplifiedDoc
    url = 'https://avito.ru/saransk?q=asus'
    html = req.get(url) 
    doc = SimplifiedDoc(html)
    print(doc.listA(url=url))
    

    这里是一个使用框架简化_scrapy的例子。

    from simplified_scrapy.spider import Spider, SimplifiedDoc
    class MySpider(Spider):
      name = 'avito.ru'
      allowed_domains = ['avito.ru']
      # concurrencyPer1s=1
      refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.
      def __init__(self):
        host = 'https://avito.ru/saransk'
        search_words = ['asus', 'lenovo', 'xiaomi', 'apple', 'ipad']
        self.start_urls = [host+'?q='+w for w in search_words] # Initialize variable start_urls
        Spider.__init__(self,self.name) #necessary
    
      def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        print (doc.listA(url=url['url']))
        # return {"Urls": doc.listA(url=url['url']), "Data": None} # Return data to framework
        return True
    
    from simplified_scrapy.simplified_main import SimplifiedMain
    SimplifiedMain.startThread(MySpider()) # Start crawling
    

    这里有更多例子:https://github.com/yiyedata/simplified-scrapy-demo

    【讨论】:

      猜你喜欢
      • 2020-02-29
      • 2014-01-21
      • 2017-04-17
      • 2011-01-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-04-02
      • 2016-04-19
      相关资源
      最近更新 更多