【问题标题】:How to use br tag in web scraping to get a better output?如何在网页抓取中使用 br 标签以获得更好的输出?
【发布时间】:2021-10-01 20:56:23
【问题描述】:

我正在尝试抓取此站点Link。 我正在尝试抓取这个特定部分,请在下面找到 HTML:

<div style="padding:20px;">
  <h1>
    ABDULLA SALEM CONTRACTING EST
  </h1>
  <strong>
   <a href="directory/umm-al-quwain/umm-al-quwain/building-contractors.html" title="Building 
   Contractors in Umm Al Quwain">
      Building Contractors
   </a>
</strong>
  <br> P.O. Box: 200
  <br> Location: Umm Al Quwain
  <br> Phone: 06-7655445
</div>
import requests
import re
import csv
from bs4 import BeautifulSoup


def comp_links():
    url=requests.get("https://www.uae-business-directory.com/directory/umm-al-quwain/umm-al-quwain/building-contractors.html").text
    soup=BeautifulSoup(url,'lxml')
    links=soup.find_all('a', attrs={'href': re.compile("^directory/umm-al-quwain/umm-al-quwain/building-contractors/")})
    return links
def comp_details(z):
    filename='comp.csv'
    f=open(filename,'w')
    music=csv.writer(f)

    a=[]

    def email_format():
            if 'E-Mail' in details.text:
                mail=details.img['src']
                email=mail.replace('typo3temp/GB/','').replace('%40','@').split('_')[0]
                return email
    for i in z:
        comp=requests.get('https://www.uae-business-directory.com/'+i['href']).text
        soup_comp=BeautifulSoup(comp,'lxml')
        details=soup_comp.find('div',class_='details')
        for i in details:
            print(i.text)
            music.writerow([i.get_text(),email_format()]) #Writing to CSV
            
        

z=comp_links()
comp_details(z)

输出是这样的:

ABDULLA SALEM CONTRACTING ESTBuilding ContractorsP.O. Box: 200Location: Umm Al Quwain电话:06-7655445

我怎样才能得到它:

  • 阿卜杜拉塞勒姆合同东部时间
  • 建筑承包商
  • 邮政信箱箱数:200
  • 位置:乌姆盖万
  • 电话:06-7655445

【问题讨论】:

    标签: python web-scraping beautifulsoup scrapy web-scraping-language


    【解决方案1】:

    试试:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.uae-business-directory.com/directory/umm-al-quwain/umm-al-quwain/building-contractors/abdulla-salem-contracting-est.html"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(soup.h1.parent.get_text(strip=True, separator="\n"))
    

    打印:

    ABDULLA SALEM CONTRACTING EST
    Building Contractors
    P.O. Box: 200
    Location: Umm Al Quwain
    Phone: 06-7655445
    

    【讨论】:

      【解决方案2】:

      因为标签有scrapy,你可以试试这个:

      details = response.css(".details ::text").getall()
      

      这将获取details 中的整个div。 经检查,details 的结构如下:

      ['\n',
       '\n',
       '<!--\ngoogle_ad_client = "ca-pub-7955553446826172";\ngoogle_ad_slot = "2007388357";\ngoogle_ad_width = 300;\ngoogle_ad_height = 600;\n//-->\n',
       '\n',
       '\n',
       '\n',
       '\n',
       'ABDULLA SALEM CONTRACTING EST',
       'Building Contractors',
       'P.O. Box: 200',
       'Location: Umm Al Quwain',
       'Phone: 06-7655445']
      

      您可以使用details[-5:] 获取子数组。它返回

      ['ABDULLA SALEM CONTRACTING EST',
       'Building Contractors',
       'P.O. Box: 200',
       'Location: Umm Al Quwain',
       'Phone: 06-7655445']
      

      【讨论】:

        猜你喜欢
        • 2021-03-25
        • 2018-02-12
        • 2023-04-02
        • 1970-01-01
        • 1970-01-01
        • 2014-11-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多