【问题标题】:Parse out text after an href Beautifulsoup在 href Beautifulsoup 之后解析文本
【发布时间】:2021-01-10 02:41:43
【问题描述】:

我不擅长beautifulsoup。几个问题合二为一:

我只是想将这三列放在 pandas 数据框中。

*下面是从 url 获取汤数据的代码(会有空值):

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
import re

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
    r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')
soup

这是我要解析的 html(每个页面上都有一堆这样的 div):

<div class="col-md-12 data">
                            <div class="col-md-6">
                                <a href="/business-directory/company-profiles.S-A_FLUXO_-_COMERCIO_E_ASSESSORIA_INTERNACION_AL.02f1cc56465eb3286f769daad5262d91.html">
                                        S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL</a>
                                </div>
                            <div class="col-md-4">
                                <div class="show-mobile">Country:</div>
                                Recife,
                                Pernambuco,
                                <br>
                                Brazil</div>
                            <div class="col-md-2 last">
                                <div class="show-mobile">Sales Revenue ($M):</div>
                                250.620749M</div>
                        </div>

这是我目前所拥有的:

#sales rev
sales_revenue = soup.find_all("div", {"class": "col-md-2 last"})

#location
country = soup.find_all("div", {"class": "col-md-4"})

#thought something like this would work for country but it doesn't"
classToIgnore = ["col-sm-4", "col-xs-4"]
classes = "col-md-4"
for a in soup:
    a = soup.find_all("div", class_= lambda c: classes in c and classToIgnore not in c)

#company name
for div in soup.find_all('div',class_="col-md-6"):
    x = div.find_all("a", href=re.compile("business-directory"))
    print(x)

结果应该是这样的

revenue         location                     company
$250620749      Recife, Pernambuco, Brazil   S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL

问题

- 销售收入类型的作品 - 不是很好。获取很多其他信息。

-位置不好

-公司名称很难抓住,因为它是href后面的文字。我可以获取 href,但不确定如何获取 url 之后的文本

有什么想法吗?

【问题讨论】:

    标签: python pandas beautifulsoup


    【解决方案1】:

    要保存页面上找到的表格,您可以使用以下示例:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
    soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
    
    # remove unnecessary information:
    for t in soup.select('.show-mobile'):
        t.extract()
    
    all_data = []
    for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
                        soup.select('#companyResults .col-md-4')[1:],
                        soup.select('#companyResults .col-md-2')[1:]):
        all_data.append({
            'Name': c1.get_text(strip=True),
            'Location': ' '.join(c2.get_text(strip=True).split()),
            'Revenue': c3.get_text(strip=True)
        })
        
    df = pd.DataFrame(all_data)
    print(df)
    df.to_csv('data.csv')
    

    打印:

                                                                       Name                                    Location      Revenue
    0                      S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL                   Recife, Pernambuco,Brazil  250.620749M
    1                                                    POINT SHOES EIRELI                    Franca, Sao Paulo,Brazil             
    2                                 Cooperativa Triticola Caçapavana Ltda   Caçapava Do Sul, Rio Grande Do Sul,Brazil  142.786551M
    3                                 CRT2 REPRESENTACOES EMPRESARIAIS LTDA                     Curitiba, Parana,Brazil             
    4                                            Mercantil Palmeirense Ltda                 Sao Paulo, Sao Paulo,Brazil             
    5                                      GVD IMPORTACAO E EXPORTACAO LTDA         Campo Bom, Rio Grande Do Sul,Brazil             
    6                          COOPERATIVA TRITICOLA DE GETULIO VARGAS LTDA           Estacao, Rio Grande Do Sul,Brazil   75.176735M
    7                                             Golden Distribuidora Ltda              Vitoria, Espirito Santo,Brazil             
    8                                    JTF COMERCIO E REPRESENTACOES LTDA                 Colider, Mato Grosso,Brazil             
    9                                                MARINHO VESTUARIO LTDA                       Eusebio, Ceara,Brazil             
    10  COTIA FOODS COMERCIO E REPRESENTACAO LTDA - EM RECUPERACAO JUDICIAL                     Cotia, Sao Paulo,Brazil             
    11                                                 FOKUS LOGISTICA LTDA          Aparecida De Goiania, Goias,Brazil             
    12                 R. SHIBUYA TENDENCIA MARKETING E REPRESENTACOES LTDA       Rio De Janeiro, Rio De Janeiro,Brazil             
    13                                      TIM COMERCIO DE EMBALAGENS LTDA         Belo Horizonte, Minas Gerais,Brazil             
    14                                                       PEDRAFORT LTDA            Sete Lagoas, Minas Gerais,Brazil             
    15                   FARMA-RAPIDA MEDICAMENTOS E MATERIAIS ESPECIAIS SA           Natal, Rio Grande Do Norte,Brazil   48.913861M
    16                                        PORTOFINO REPRESENTACOES LTDA             Botuvera, Santa Catarina,Brazil             
    17                               NOROEST REPRESENTACOES COMERCIAIS LTDA                       Jaru, Rondonia,Brazil             
    18                  LOGIMED DISTRIBUIDORA SOCIEDADE EMPRESARIA LIMITADA                 Sao Paulo, Sao Paulo,Brazil             
    19                                            Filon Confecções - EIRELI                 São Paulo, Sao Paulo,Brazil             
    20                               LEMES & LIMA COMERCIO E LOGISTICA LTDA                       Goiania, Goias,Brazil             
    21                                              CERAMICA JACARANDA LTDA     Ribeirao Das Neves, Minas Gerais,Brazil             
    22                             NORDICAL REPRESENTANTE DE ALIMENTOS LTDA  Jaboatao Dos Guararapes, Pernambuco,Brazil             
    23                QUESALON REPRESENTACAO DE PRODUTOS FARMACEUTICOS LTDA                    Alhandra, Paraiba,Brazil             
    24                                   ATACK REPRESENTACAO COMERCIAL LTDA              Vitoria, Espirito Santo,Brazil             
    25                      LESTE BRASILEIRA IMPORTADORA E EXPORTADORA LTDA            Cariacica, Espirito Santo,Brazil             
    26                                        JUCELITO BORDIGNON & CIA LTDA          Sao Sepe, Rio Grande Do Sul,Brazil             
    27                      CASAS DA LAVOURA REPRESENTACOES COMERCIAIS LTDA                       Goiania, Goias,Brazil             
    28                                              UNISOAP COSMETICOS LTDA              Praia Grande, Sao Paulo,Brazil             
    29                                                 MOTIVA MAQUINAS LTDA                      Salvador, Bahia,Brazil             
    30                                                   BC COSMETICOS LTDA                 Sao Paulo, Sao Paulo,Brazil             
    31                                     ORGANIZACOES ALMEIDA SOARES LTDA         Belo Horizonte, Minas Gerais,Brazil             
    32                            Refinitiv Brasil Servicos Economicos Ltda                 Sao Paulo, Sao Paulo,Brazil             
    33                                              JBC REPRESENTACOES LTDA                   Conchal, Sao Paulo,Brazil             
    34                            P & P RIO DISTRIBUIDORA DE ALIMENTOS LTDA       Rio De Janeiro, Rio De Janeiro,Brazil             
    35            FORMATTO TELHAS E TELHADOS REPRESENTACAO COMERCIAL EIRELI            Jaguaruna, Santa Catarina,Brazil             
    36                   MACLENY - DISTRIBUIDORA DE PRODUTOS DE BELEZA LTDA                 Sao Paulo, Sao Paulo,Brazil             
    37                                    ELG REPRESENTACAO E COMERCIO LTDA       Jaragua Do Sul, Santa Catarina,Brazil             
    38                      ELFA PRODUTOS FARMACEUTICOS E HOSPITALARES LTDA                    Cabedelo, Paraiba,Brazil             
    39                      COMERCIO E EXPORTACAO DE CEREAIS MUNARETTO LTDA           Bom Sucesso Do Sul, Parana,Brazil             
    40                                    RGE DISTRIBUIDORA DE BEBIDAS LTDA          Montes Claros, Minas Gerais,Brazil             
    41                                A.S. REPRESENTACAO DE EMBALAGENS LTDA                 Sao Paulo, Sao Paulo,Brazil             
    42                                                   ON LINE TRADING SA     Novo Hamburgo, Rio Grande Do Sul,Brazil    21.79624M
    43                          AMX COMERCIO E SERVICOS DE AUTOMOTORES LTDA             Itaborai, Rio De Janeiro,Brazil             
    44                                      SOL EMBALAGENS PLASTICAS EIRELI                      Camacari, Bahia,Brazil             
    45    MJB COMERCIO DE EQUIPAMENTOS ELETRONICOS E GESTAO DE PESSOAL LTDA                  Cuiaba, Mato Grosso,Brazil             
    46                                           EBANOS REPRESENTACOES LTDA    Estancia Velha, Rio Grande Do Sul,Brazil             
    47     TENXE SERVICOS DE REPRESENTACAO COMERCIAL E TELEATENDIMENTO LTDA                     Curitiba, Parana,Brazil             
    48                                       BRASILVEST REPRESENTACOES LTDA               Gaspar, Santa Catarina,Brazil             
    49                                   EURO MED INDUSTRIA E COMERCIO LTDA                 Timbauba, Pernambuco,Brazil             
    

    并保存data.csv(来自 LibreOffice 的屏幕截图):


    编辑:要抓取多个页面,请使用以下示例:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
    params = {'page': 1}
    url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'
    
    all_data = []
    for params['page'] in range(1, 3):  # <-- increase number of pages here
        print('Page {}...'.format(params['page']))
        soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')
    
        # remove unnecessary information:
        for t in soup.select('.show-mobile'):
            t.extract()
    
        for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
                            soup.select('#companyResults .col-md-4')[1:],
                            soup.select('#companyResults .col-md-2')[1:]):
            all_data.append({
                'Name': c1.get_text(strip=True),
                'Location': ' '.join(c2.get_text(strip=True).split()),
                'Revenue': c3.get_text(strip=True)
            })
        
    df = pd.DataFrame(all_data)
    print(df)
    df.to_csv('data.csv')
    

    【讨论】:

    • !你真棒。每次我有一个美丽的汤问题,你总是想出一个惊人的解决方案。这非常有效。还有一个问题 - 假设我有很多页面 - 我可以只写一个 for 循环,在每个页面的循环中包含所有这些信息吗?该网址的末尾有一个“page=x”。
    猜你喜欢
    • 2019-07-22
    • 1970-01-01
    • 2017-06-02
    • 2012-02-05
    • 1970-01-01
    • 2021-12-25
    • 2016-10-12
    • 1970-01-01
    • 2021-09-24
    相关资源
    最近更新 更多