【问题标题】:Scraping text in h3 and div tags using beautifulSoup, Python使用 beautifulSoup、Python 在 h3 和 div 标签中抓取文本
【发布时间】:2018-04-06 14:49:31
【问题描述】:

我没有使用 python、BeautifulSoup、Selenium 等的经验,但我渴望从网站上抓取数据并存储为 csv 文件。 我需要的单个数据样本编码如下(单行数据)。

<div class="box effect">
<div class="row">
<div class="col-lg-10">
    <h3>HEADING</h3>
        <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
        <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
        <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
        <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
    <div class="space">&nbsp;</div>

<div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> &nbsp;more info</a></div>

</div>
<div class="col-lg-2">

</div>
</div>
</div>

我需要的输出是 Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS

我发现这些数据没有 id 或 class 还作为一般文本在网站上。 我正在为此分别尝试 BeautifulSoup 和 Python Selenium,我在这两种方法中都被困住了,因为我没有看到任何教程,指导我从这些和标签中提取文本

我使用 BeautifulSoup 的代码

import urllib2
from bs4 import BeautifulSoup
import requests
import csv

MAX = 2

'''with open("lg.csv", "a") as f:
  w=csv.writer(f)'''
##for i in range(1,MAX+1)
url="http://www.example_site.com"

page=requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")

for h in soup.find_all('h3'):
    print(h.get('h3'))

我的硒代码

import csv
from selenium import webdriver
MAX_PAGE_NUM = 2
driver = webdriver.Firefox()
for i in range(1, MAX_PAGE_NUM+1):
  url = "http://www.example_site.com"
  driver.get(url)
  name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')
  #contact = driver.find_elements_by_xpath('//span[@class="item-price"]')
#  phone = 
#  mobile = 
#  address =
#  print(len(buyers))
#  num_page_items = len(buyers)
#  with open('res.csv','a') as f:
#    for i in range(num_page_items):
#      f.write(buyers[i].text + "," + prices[i].text + "\n")
  print (name)          
driver.close()

【问题讨论】:

    标签: python html selenium beautifulsoup web-crawler


    【解决方案1】:

    您可以使用 CSS 选择器来查找您需要的数据。 在您的情况下,div &gt; h3 ~ div 将找到直接在 div 元素内并由 h3 元素进行的所有 div 元素。

    import bs4
    
    page= """
    <div class="box effect">
    <div class="row">
    <div class="col-lg-10">
        <h3>HEADING</h3>
        <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
        <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
        <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
        <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
    </div>
    </div>
    </div>
    """
    
    soup = bs4.BeautifulSoup(page, 'lxml')
    
    # find all div elements that are inside a div element
    # and are proceeded by an h3 element
    selector = 'div > h3 ~ div'
    
    # find elements that contain the data we want
    found = soup.select(selector)
    
    # Extract data from the found elements
    data = [x.text.split(';')[-1].strip() for x in found]
    
    for x in data:
        print(x)
    

    编辑:刮掉标题中的文字..

    heading = soup.find('h3') 
    heading_data = heading.text
    print(heading_data)
    

    编辑:或者您可以使用如下选择器一次获取标题和其他 div 元素:div.col-lg-10 &gt; *。这将查找属于col-lg-10 类的div 元素内的所有元素。

    soup = bs4.BeautifulSoup(page, 'lxml')
    
    # find all elements inside a div element of class col-lg-10
    selector = 'div.col-lg-10 > *'
    
    # find elements that contain the data we want
    found = soup.select(selector)
    
    # Extract data from the found elements
    data = [x.text.split(';')[-1].strip() for x in found]
    
    for x in data:
        print(x)
    

    【讨论】:

    • 谢谢..非常有用,但是 ~ in selector = 'div > h3 ~ div' 是什么意思?同样在 selector = 'div.col-lg-10 > *' 如果 col-lg-10 有空格怎么办?我该如何表达?谢谢!
    【解决方案2】:

    所以看起来很不错:

        #  -*- coding: utf-8 -*-
        # by Faguiro #
        # run using Python 3.8.6  on Linux#
        import requests
        from bs4 import BeautifulSoup
    
        # insert your site here
        url= input("Enter the url-->")
    
        #use requests
        r = requests.get(url)
        content = r.content
    
        #soup!
        soup = BeautifulSoup(content, "html.parser")
    
        #find all tag in the soup.
        heading = soup.find_all("h3")
    
        #print(heading) <--- result...
    
        #...ptonic organization!
        n=len(heading)
        for x in range(n): 
            print(str.strip(heading[x].text))
    

    依赖: 在终端(linux)上:

    sudo apt-get install python3-bs4

    【讨论】:

      【解决方案3】:

      试试这个:

      import urllib2
      from bs4 import BeautifulSoup
      import requests
      import csv
      
      MAX = 2
      
      '''with open("lg.csv", "a") as f:
        w=csv.writer(f)'''
      ##for i in range(1,MAX+1)
      url="http://www.example_site.com"
      
      page=requests.get(url)
      soup = BeautifulSoup(page,"html.parser")
      
      print(soup.text)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-11-28
        • 1970-01-01
        • 2022-11-19
        • 1970-01-01
        • 1970-01-01
        • 2018-01-19
        • 1970-01-01
        相关资源
        最近更新 更多