【问题标题】:Web scraping script is returning duplicate valuesWeb 抓取脚本返回重复值
【发布时间】:2019-10-02 08:29:54
【问题描述】:

我的网页抓取脚本由于某种原因返回了重复的结果,我尝试了很多替代方案,但就是无法让它工作。有人可以帮忙吗?

import requests
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import csv

soup = [ ]
pages = [ ]

csv_file = open('444.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Practice', 'Practice Manager'])

for i in range(35899, 35909):
   url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
   pages.append(url)

for item in pages:
   page = requests.get(item)
   soup.append(bs(page.text, 'lxml'))

business = []
for items in soup:
   h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
   for i in h1Obj:
      tagArray = i.findChildren()
   for tag in tagArray:
      if isinstance(tag,Tag) and tag.name in 'h1':
         business.append(tag.text)
      else:
         print('no-business')

names = []
for items in soup:
   h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
   for i in h4Obj:
      tagArray = i.findChildren()
      for tag in tagArray:
         if isinstance(tag,Tag) and tag.name in 'h4':
            names.append(tag.text)
         else:
            print('no-name')

print(business, names)
csv_writer.writerow([business, names])
csv_file.close()

它目前正在返回所有重复值。

它需要做的是每个 url 调用返回一个 'business' 和一个 'names' 值。如果没有 'business' 或 'name',则需要返回值 'no-business' 或 'no-name'。

谁能帮帮我?

【问题讨论】:

  • 您是否只需要每个实践的实践经理?
  • 基本上是的,但我还需要说明他们是哪个实践的经理,有些有多个经理,有些根本没有,所以它需要为那些说“无名” .
  • 那么只有练习经理和如果多个返回多个?
  • 是的,我也需要执业名称(企业名称),所以我知道他们来自哪里。

标签: python-3.x web-scraping beautifulsoup python-requests tags


【解决方案1】:

我不知道这是否是最好的方法,但我使用 set 而不是 list 来删除重复项,并且在保存文件之前我将 set 转换为这样的列表:

import requests
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import csv

soup = [ ]
pages = [ ]

csv_file = open('444.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Practice', 'Practice Manager'])

for i in range(35899, 35909):
   url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
   pages.append(url)

for item in pages:
   page = requests.get(item)
   soup.append(bs(page.text, 'lxml'))

business = set()
for items in soup:
   h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
   for i in h1Obj:
      tagArray = i.findChildren()
   for tag in tagArray:
      if isinstance(tag,Tag) and tag.name in 'h1':
         business.add(tag.text)
      else:
         print('no-business')


names = set()
for items in soup:
   h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
   for i in h4Obj:
      tagArray = i.findChildren()
      for tag in tagArray:
         if isinstance(tag,Tag) and tag.name in 'h4':
            names.add(tag.text)
         else:
            print('no-business')

print(business, names)
csv_writer.writerow([list(business), list(names)])
csv_file.close()

【讨论】:

  • 这是一种享受,非常感谢您的帮助。我只需要使 else 语句起作用,你有什么想法当在给定的 url 下找不到任何东西时,我如何让它返回 'no-name' 或 'no-business'?
【解决方案2】:

您可以使用以下 id 来生成列表的初始列表。您可以将每一行写入 csv 而不是附加到最终列表。

import requests
from bs4 import BeautifulSoup as bs

results = []
with requests.Session() as s:

    for i in range(35899, 35909):
        r = s.get('https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i))
        soup = bs(r.content, 'lxml')
        row = [item.text for item in soup.select('.staff-title:has(em:contains("Practice Manager")) [id]')]
        if not row: row = ['no practice manager']
        practice = soup.select_one('.gp').text if soup.select_one(':has(#org-title)')  else 'No practice name'
        row.insert(0, practice)
        results.append(row)
print(results)

不确定您希望如何列出多个名称

import requests
from bs4 import BeautifulSoup as bs
import csv

with open('output.csv', 'w', newline='') as csvfile:
    w = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    with requests.Session() as s:

        for i in range(35899, 35909):
            r = s.get('https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i))
            soup = bs(r.content, 'lxml')
            row = [item.text for item in soup.select('.staff-title:has(em:contains("Practice Manager")) [id]')]
            if not row: row = ['no practice manager']
            practice = soup.select_one('.gp').text if soup.select_one(':has(#org-title)')  else 'No practice name'
            row.insert(0, practice)
            w.writerow(row)

【讨论】:

  • 不用担心。看起来所有有标题的员工都是staff = [(item.text, item.next_sibling.next_sibling.text) for item in soup.select('[id^=staff]')]
【解决方案3】:

看起来问题源于这样一个事实:在其中一些页面中,根本没有任何信息,并且您收到“配置文件隐藏”错误。我稍微修改了您的代码,以涵盖前 5 页。除了保存到文件之外,它看起来像这样:

[same imports]
pages = [ ]

for i in range(35899, 35904):
   url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
   pages.append(url)

soup = [ ]
for item in pages:
   page = requests.get(item)
   soup.append(bs(page.text, 'lxml'))

business = []
for items in soup:
       h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
       for i in h1Obj:
          tagArray = i.findChildren()
       for tag in tagArray:
          if isinstance(tag,Tag) and tag.name in 'h1':
             business.append(tag.text)


names = []
for items in soup:    
  h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
  for i in h4Obj:
      tagArray = i.findChildren()
  for tag in tagArray:
     if isinstance(tag,Tag) and tag.name in 'h4':
        names.append(tag.text)


for bus, name in zip(business,names):
    print(bus,'---',name)

输出如下:

Bilbrook Medical Centre --- Di Palfrey
Caversham Group Practice --- Di Palfrey
Caversham Group Practice --- Di Palfrey
The Moorcroft Medical Ctr --- Ms Kim Stanyer 
Brotton Surgery --- Mrs Gina Bayliss

请注意,只有第 2 和第 3 个条目是重复的;那是(不知何故,不知道为什么)由第三页中的“隐藏配置文件”引起。所以如果你将代码的主要块修改为:

business = []
for items in soup:
   if "ProfileHiddenError.aspx" in (str(items)):
    business.append('Profile Hidden')
   else:
       h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
       for i in h1Obj:
          tagArray = i.findChildren()
       for tag in tagArray:
          if isinstance(tag,Tag) and tag.name in 'h1':
             business.append(tag.text)


names = []
for items in soup:
    if "ProfileHiddenError.aspx" in (str(items)):
        names.append('Profile Hidden')        
    elif not "Practice Manager" in str(items):
        names.append('No Practice Manager Specified')     
    else:
      h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
      for i in h4Obj:        
          tagArray = i.findChildren()
      for tag in tagArray:
         if isinstance(tag,Tag) and tag.name in 'h4':
            names.append(tag.text)


for bus, name in zip(business,names):
    print(bus,'---',name)

输出,这次是:

BBilbrook Medical Centre --- Di Palfrey
Caversham Group Practice --- No Practice Manager Specified
Profile Hidden --- Profile Hidden
The Moorcroft Medical Ctr --- Ms Kim Stanyer 
Brotton Surgery --- Mrs Gina Bayliss

希望这能帮助您解决问题。

【讨论】:

  • 感谢您的帮助。它似乎已经清理了很多,但现在返回的数据不正确,因为 Caversham Group Practice 应该返回“无名”,因为那里根本没有经理。有什么建议?在过去的几个小时里,我一直在和一只橡皮鸭交谈,试图找出问题所在:-(慢慢变得疯狂哈哈。
  • @MissPepper 你是对的!我修改了上面的“名称”块来解决这个问题。让我们看看这是否有效......
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-10-02
  • 2021-01-02
  • 1970-01-01
相关资源
最近更新 更多