【问题标题】:Python- Incomplete Data (Web Scraping)Python - 不完整的数据(网页抓取)
【发布时间】:2016-05-18 00:08:11
【问题描述】:

这是我的代码:

from bs4 import BeautifulSoup
import urllib2
import re
import sys


main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/"
test_url = urllib2.urlopen(main_url)
readHtml = test_url.read()
test_url.close()


soup = BeautifulSoup(readHtml, "html.parser")

url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})

count = 1

fobj = open('D:\Scrapping\parveen_again2.xml', 'w')
for getting in url:
   url = getting.find('a')
   if url.has_attr('href'):
          urls = url['href']       
          test_url = urllib2.urlopen(urls, timeout=36)
          readHtml = test_url.read()
          test_url.close()

          soup1 = BeautifulSoup(readHtml, "html.parser")

          title = soup1.find('title')
          title = title.get_text('+')
          title = title.split("|")

          author = soup1.find('div',attrs={"class":"entry-meta"}).find('span',attrs={"class":"categories-links"})


          author = author.findAll('a')

          fobj.write("<add><doc>\n")
          fobj.write("<field name=\"id\">sukhansara.com_pg1Author"+author[0].string.encode('utf8')+"Count"+str(count)+"</field>\n")
          fobj.write("<field name=\"title\">"+title[0].encode('utf8')+"</field>\n")
          fobj.write("<field name=\"content\">")

          count += 1


          poetry = soup1.find('div',attrs={"class":"entry-content"}).findAll('div')

          x=1
          check = True

          while check:
                 if poetry[x+1].string.encode('utf8') != author[0].string.encode('utf8'):
                        fobj.write(poetry[x].string.encode('utf8')+"|")
                        x+=1
                 else:
                        check = False
          fobj.write(poetry[x].string.encode('utf8'))

          fobj.write("</field>\n")
          fobj.write("<field name=\"group\">ur_poetry</field>\n")
          fobj.write("<field name=\"author\">"+author[0].string.encode('utf8')+"</field>\n")
          fobj.write("<field name=\"url\">"+urls+"</field>\n")
          fobj.write("<add><doc>\n\n")



fobj.close()

print "Done printing"

有时我从 24 个 url 获得 24 首诗歌,有时是 81 首。但是有将近 100 个 url?每次我达到 81 时都会发生此错误

AttributeError: 'NoneType' 对象没有属性 'encode'

或有时设置超时错误。我做错了什么?

【问题讨论】:

    标签: python web-scraping beautifulsoup urllib2


    【解决方案1】:

    切换到requests 并维护一个打开的会话应该可以让它工作:

    import requests
    
    with requests.Session() as session:
        main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/"
    
        readHtml = session.get(main_url).content
        soup = BeautifulSoup(readHtml, "html.parser")
    
        # ...
    

    【讨论】:

      猜你喜欢
      • 2018-08-15
      • 1970-01-01
      • 2020-09-26
      • 1970-01-01
      • 1970-01-01
      • 2014-11-10
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多