【问题标题】:beautifulsoup for-if loop extractbeautifulsoup for-if 循环提取
【发布时间】:2020-01-20 13:39:50
【问题描述】:

我想使用 for/if 循环从下面的网站中提取数据。下面的代码使用 for/if 循环成功地从文章中提取数据,但我想更新它并使用循环提取公司、满意百分比和总体评分数据(始终相同)。

overall=[]

satisfied=[]
company=[]

arbeitsatmosphare = []
vorgesetztenverhalten = []
kollegenzusammenhalt= []



lurl='https://www.kununu.com/de/volkswagenconsulting/kommentare'
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'{lurl}/{page}'
        print(url)
        response = session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            for key in [{'label': 'Arbeitsatmosphäre', 'list': arbeitsatmosphare},
                        {'label': 'Vorgesetztenverhalten', 'list': vorgesetztenverhalten},
                        {'label': 'Kollegenzusammenhalt', 'list': kollegenzusammenhalt}]:
                span = article.find('span', text=re.compile(key['label']))
                #print(span)
                if span and span.find_next('span'):
                    key['list'].append(span.find_next('span').text.strip())
                else:
                    key['list'].append('N/A')



# THIS PART IS NOT WORKING

            div = soup.find(class_="company-profile-container")
            for key2 in [{'label2': 'company-name', 'list': company},
                             {'label2': 'review-recommend-value', 'list': satisfied},
                            {'label2': 'review-rating-value', 'list': overall}]:
                span2 = div.find('span', text=re.compile(key2['label2']))
                #print(span2)
                if span2 and span2.find('span'):
                    key2['list'].append(span2.find('span').text.strip())
                else:
                    key2['list'].append('N/A')
        page += 1
        pagination = soup.find_all('div', {'class': 'paginationControl'})
        if not pagination:
            break

    #print(overall)
    df = pd.DataFrame({'Arbeitsatmosphäre': arbeitsatmosphare,
                       'Vorgesetztenverhalten': vorgesetztenverhalten,
                       'Kollegenzusammenhalt': kollegenzusammenhalt,
                       'company': company,
                       'satisfied': satisfied,
                       'overall':overall
                       })

print(df)

我以上面的代码为例,但看起来我的部分不起作用。我找不到问题,你能帮忙吗?

【问题讨论】:

  • 什么是“不工作”?您收到错误还是没有结果?
  • 空列表,找不到span2

标签: python loops for-loop if-statement beautifulsoup


【解决方案1】:

如果每一行的公司名称、满意评分和总体评分都相同,则不必将它们放在 for 循环中的列表中。只需在最后获取必要的信息并使用,例如,列出* 运算符:

import re
import requests
from bs4 import BeautifulSoup

arbeitsatmosphare = []
vorgesetztenverhalten = []
kollegenzusammenhalt= []

lurl='https://www.kununu.com/de/volkswagenconsulting/kommentare'
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }
    page = 1
    while True:
        print(f"Processing page {page}..")
        url = f'{lurl}/{page}'
        print(url)
        response = session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        articles = soup.find_all('article')
        print("Number of articles: " + str(len(articles)))
        for article in articles:

            for key in [{'label': 'Arbeitsatmosphäre', 'list': arbeitsatmosphare},
                        {'label': 'Vorgesetztenverhalten', 'list': vorgesetztenverhalten},
                        {'label': 'Kollegenzusammenhalt', 'list': kollegenzusammenhalt}]:
                span = article.find('span', text=re.compile(key['label']))
                if span and span.find_next('span'):
                    key['list'].append(span.find_next('span').text.strip())
                else:
                    key['list'].append('N/A')

        page += 1
        pagination = soup.find_all('div', {'class': 'paginationControl'})
        if not pagination:
            break

    company = soup.select_one('.company-name').get_text(strip=True)
    satisfied = soup.select_one('.review-recommend-value').get_text(strip=True)
    overall = soup.select_one('.review-rating-value').get_text(strip=True)

    df = pd.DataFrame({'Arbeitsatmosphäre': arbeitsatmosphare,
                       'Vorgesetztenverhalten': vorgesetztenverhalten,
                       'Kollegenzusammenhalt': kollegenzusammenhalt,
                       'company': [company] * len(arbeitsatmosphare),
                       'satisfied': [satisfied] * len(arbeitsatmosphare),
                       'overall':[overall] * len(arbeitsatmosphare)
                       })

print(df)

打印:

   Arbeitsatmosphäre Vorgesetztenverhalten Kollegenzusammenhalt                company satisfied overall
0               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
1               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
2               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
3               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
4               2,00                  1,00                 3,00  Volkswagen Consulting       86%    4,27
5               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
6               5,00                  5,00                 5,00  Volkswagen Consulting       86%    4,27
7               5,00                  5,00                 4,00  Volkswagen Consulting       86%    4,27
....and so on.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-07-30
    • 1970-01-01
    • 2021-09-16
    • 2021-10-31
    • 1970-01-01
    相关资源
    最近更新 更多