【问题标题】:how to scrape for a particular search item?如何抓取特定的搜索项?
【发布时间】:2021-08-04 15:09:45
【问题描述】:

我正在尝试抓取此site,以获取“按类别搜索”选项下可用的特定搜索项。 但是我没有得到正确的站点响应,所以我可以抓取它,看起来正在发生后台调用。

我已尝试按要求设置值('例如 10'),但它不起作用。

我想知道如何设置类别以获取所需的网页。

import requests
from bs4 import BeautifulSoup
import csv

#filename='comp1.csv'
#f=open(filename,'w')
#company=csv.writer(f)


url=requests.get("http://www.businessdirectoryoman.com/search.php").text
soup=BeautifulSoup(url,'lxml')
links=soup.find_all('div',class_='BdCoTitle')
for i in links:
    print(i.text)

<div class="searchcoll">
                <div class="searchcol1">By category
                <select name="searh_category" id="searh_category" style="width: 250px" ;="" onchange="getdat(this.value)">
<option value="0" selected="selected">All Categories</option>
<option value="10">Abrasives</option>
<option value="11">Access Controls &amp; Attendance Systems</option>
<option value="15">Access Platforms</option>
<option value="20">Accommodation &amp; Office Rentals </option>
<option value="0" selected="selected">Accountancy Training</option>
<option value="30">Accountants &amp; Auditors</option>
<option value="40">Acrylic Products</option>
<option value="50">Acu Cure</option>
<option value="60">Adhesives</option>
</select>
</div>
<div class="searchcol2">Or by company name:
   <input name="search1" type="text" id="search1" size="55" placeholder="Search for Company">

   </div>
</div>
**使用 Python 和美丽的汤。

【问题讨论】:

  • 到目前为止您所尝试的方法,请在问题中提及。另外请阅读How to Ask

标签: python web-scraping beautifulsoup


【解决方案1】:

如果你想从你提供的select 中刮取options value/text,那么你可以试试这个:

from bs4 import BeautifulSoup

html="""
<div class="searchcoll">
                <div class="searchcol1">By category
                <select name="searh_category" id="searh_category" style="width: 250px" ;="" onchange="getdat(this.value)">
<option value="0" selected="selected">All Categories</option>
<option value="10">Abrasives</option>
<option value="11">Access Controls &amp; Attendance Systems</option>
<option value="15">Access Platforms</option>
<option value="20">Accommodation &amp; Office Rentals </option>
<option value="0" selected="selected">Accountancy Training</option>
<option value="30">Accountants &amp; Auditors</option>
<option value="40">Acrylic Products</option>
<option value="50">Acu Cure</option>
<option value="60">Adhesives</option>
</select>
</div>
<div class="searchcol2">Or by company name:
   <input name="search1" type="text" id="search1" size="55" placeholder="Search for Company">

   </div>
</div>
"""
soup=BeautifulSoup(html,"lxml")

select_option=soup.find("select",{"id":"searh_category"}).find_all("option")
for option in select_option:
   print(option["value"],option.text)

【讨论】:

  • 您是说他们使用 ajax 加载该数据,但没有使用 ajax 加载。您可以在那里进行更多研究并从他们的 HTML 中获取。我希望你至少这样做,因为你没有提供任何代码。我可以给你这么多的答案。 @Demonslayer :)
【解决方案2】:

您可以使用下一个示例如何从一个类别下的所有页面获取信息:

import requests
from bs4 import BeautifulSoup


api_url = "http://www.businessdirectoryoman.com/search.php?page={}"
data = {"searh_category": "10"}  # <-- category 10 == Abrasives

page = 1

with requests.session() as s:
    while True:
        soup = BeautifulSoup(
            s.post(api_url.format(page), data=data).content, "html.parser"
        )

        titles = soup.select(".BdCoTitle")

        if not titles:
            break

        for title in titles:
            print(title.get_text(strip=True))

        page += 1

打印:

Abrasives Manufacturing Co (SAOG)
Al Hassan Group of Companies - Al Hassan Technical & Construction Supplies LLC
Arabian Engineering Services LLC  - Arabian Engg Services (Tyres, Batteries & Allied Prod Div)
Das Investment & Transport Co LLC
Gulf Services & Industrial Supplies Co LLC - Industrial Tools & Equipment
International Automobiles & Parts Co LLC (Parts/Tyres, Battries & Allied Products Div) - International Automobiles & Parts Co LLC (Parts/Tyres, Batteries & Allied Products Div)
Middle East Fuji Khimji’s Co LLC
Suhail Bahwan Group (Holding) LLC - Bahwan Building Materials LLC
Technical Building Materials
Technical Trading Co LLC

【讨论】:

    猜你喜欢
    • 2019-08-23
    • 2018-09-25
    • 2020-05-12
    • 1970-01-01
    • 2012-06-13
    • 2021-09-11
    • 1970-01-01
    • 1970-01-01
    • 2020-10-04
    相关资源
    最近更新 更多