【问题标题】:Python Multi Page Web Scraping Text OnlyPython 多页网页仅抓取文本
【发布时间】:2020-03-10 00:17:49
【问题描述】:

我是 python 新手。我目前正在研究网络抓取。任务是抓取戴尔社区 Inspiron 问题的前 5 页。我有代码可以运行并返回我需要的信息。但是,我无法仅获取文本。我当前的代码返回文本+ html。我曾尝试将 .text 放置在代码的各个点,但这样做只会出错。

最常见的错误是:“AttributeError: ResultSet object has no attribute 'text'。您可能将项目列表视为单个项目。当您打算调用 find() 时是否调用了 find_all()? "

下面是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, csv
from time import sleep



pages = ['https://www.dell.com/community/Inspiron/bd-p/Inspiron',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/2',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/3',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/4',
        'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/5'
    
    ]
import requests
data = []

for page in pages:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')
    rows = soup.select('tbody tr')
    
    for row in rows:
        d = dict()
        d['title'] = soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})
        d['author'] = soup.find_all ('span', attrs = {'class': 'login-bold'})
        d['time'] = soup.find_all ('span', attrs = {'class': 'local-time'})
        d['kudos'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-kudos-count'})
        d['messages'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-replies-count'})
        d['views'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-topic-views-count'})
        d['solved'] = soup.find_all ('td', attrs = {'aria-label': 'triangletop lia-data-cell-secondary lia-data-cell-icon'})
        d['latest']= soup.find_all ('span', attrs = {'cssclass': 'lia-info-area-item'})
        data.append(d)
    
    sleep(10)
print(data[0])

非常感谢任何帮助。谢谢!

【问题讨论】:

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

find_all 返回一个 list 的 html 元素。如果您希望打印每个元素的文本,您需要遍历使用find_all 创建的每个列表,然后将.text 方法应用于每个单独的条目。例如:

titles = soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})
for title in titles:
    print(title.text())

【讨论】:

    【解决方案2】:

    正如 Joseph 所提到的,find_all 返回一个 html 元素列表,遍历这些列表中的每个元素,然后将 .text 方法应用于每个项目。

    下面我使用列表理解来循环和应用.text 方法。使用strip() 删除任何尾随,如 \t、\n 等...

    最终代码:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import os, csv
    from time import sleep
    
    
    
    pages = ['https://www.dell.com/community/Inspiron/bd-p/Inspiron',
            'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/2',
            'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/3',
            'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/4',
            'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/5'
    
        ]
    import requests
    data = []
    
    for page in pages:
        r = requests.get(page)
        soup = BeautifulSoup(r.content, 'html.parser')
        rows = soup.select('tbody tr')
    
        for row in rows:
            d = dict()
            d['title'] = [i.text.strip() for i in soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})]
            d['author'] = [i.text.strip() for i in soup.find_all ('span', attrs = {'class': 'login-bold'})]
            d['time'] = [i.text.strip() for i in soup.find_all ('span', attrs = {'class': 'local-time'})]
            d['kudos'] = [i.text.strip() for i in soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-kudos-count'})]
            d['messages'] = [i.text.strip() for i in soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-replies-count'})]
            d['views'] = [i.text.strip() for i in soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-topic-views-count'})]
            d['solved'] = [i.text.strip() for i in soup.find_all ('td', attrs = {'aria-label': 'triangletop lia-data-cell-secondary lia-data-cell-icon'})]
            d['latest']= [i.text.strip() for i in soup.find_all ('span', attrs = {'cssclass': 'lia-info-area-item'})]
            data.append(d)
    
        sleep(10)
    print(data[0])
    

    编辑:将其包含在您的代码中以将字典另存为 csv。

    import pandas as pd
    
    pd.DataFrame.from_dict(data)
    pd.head()     # confirm if the data is correct
    pd.to_csv('name.csv', index=False)
    

    【讨论】:

    • 如何将其转换为 csv?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-17
    • 1970-01-01
    • 1970-01-01
    • 2010-09-29
    • 1970-01-01
    • 2021-01-12
    相关资源
    最近更新 更多