【问题标题】:Is there a better approach to use BeautifulSoup in my python web crawler codes?有没有更好的方法在我的 python 网络爬虫代码中使用 BeautifulSoup?
【发布时间】:2016-08-01 11:53:30
【问题描述】:

我正在尝试从页面中的 url 抓取信息并将它们保存在文本文件中。

我在这个问题上得到了很大的帮助 How to get the right source code with Python from the URLs using my web crawler? 我尝试使用我在 BeautifulSoup 中学到的知识来完成基于该问题的代码。

但是当我查看我的代码时,虽然它们满足了我的需求,但它们看起来很乱。 谁能帮我优化一下,尤其是在 BeautifulSoup 部分?比如 infoLists 部分和 saveInfo 部分。 谢谢!

这是我的代码:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

url = 'http://bbs.skykiwi.com/forum.php?mod=forumdisplay&fid=55&typeid=470&sortid=231&filter=typeid&pageNum=1&page=1'

#To get the source code from the url
def getsourse(url):
    header = {'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows  NT 10.0; WOW64; Trident/8.0; Touch)'}
    html = requests.get(url, headers=header)
    return html.content

#To get all the links in current page
def getallLinksinPage(sourceCode):
    soup = BeautifulSoup(sourceCode)
    return [a["href"] for a in soup.select("#threadlist a.xst")]

#To save the info in the info.txt file
def saveinfo(infoLists):
    f = open('info.txt', 'a')
    for each in infoLists:
        f.writelines('Job Title: ' + str(each['title'].encode('utf-8')) + '\n')
        f.writelines('Company Name: ' + str(each['companyName'].encode('utf-8')) + '\n')
        f.writelines('Company Address: ' + str(each['address'].encode('utf-8')) + '\n')
        f.writelines('Job Position: ' + str(each['position'].encode('utf-8')) + '\n')
        f.writelines('Salary: ' + str(each['salary'].encode('utf-8')) + '\n')
        f.writelines('Full/Part time: ' + str(each['jobType'].encode('utf-8')) + '\n')
        f.writelines('Company Tel: ' + str(each['tel'].encode('utf-8')) + '\n')
        f.writelines('Company Email: ' + str(each['email'].encode('utf-8')) + '\n')
        f.writelines('WorkTime: ' + str(each['workTime'].encode('utf-8')) + '\n\n')
    f.close()

sourceCode = getsourse(url) # source code of the url page
allLinksinPage = getallLinksinPage(sourceCode) #a List of the urls in current page
linkNum=1
infoLists=[]
for eachLink in allLinksinPage:
    print('Now downloading link '+str(linkNum))
    url = 'http://bbs.skykiwi.com/'
    realUrl=urljoin(url, eachLink)
    html = getsourse(realUrl)
    soup= BeautifulSoup(html)
    infoList={} #To save the following info,such as title companyName etc
    infoList['title']=soup.find(attrs={'id':'thread_subject'}).string
    infoList2=[] #To temporarily save info except 'title'
    #FROM HERE IT GETS MESSY...
    for line in soup.find_all(attrs={'class':'typeoption'}): # first locate the bigClass
        for td in line.find_all('td'):  # then locate all the 'td's
            infoList2.append(td.string)
        try:
            for eachInfo in infoList2:
                infoList['companyName'] = infoList2[0]
                infoList['address'] = infoList2[1]
                infoList['position'] = infoList2[2]
                infoList['salary'] = infoList2[3]
                infoList['jobType'] = infoList2[4]
                infoList['tel'] = infoList2[5]
                infoList['email'] = infoList2[6]
                infoList['workTime'] = infoList2[7]
        finally:
            linkNum += 1 # To print link number
    infoLists.append(infoList)

saveinfo(infoLists)

【问题讨论】:

  • 这听起来像是一个Code Review 的问题。我会检查像this one 这样的问题,然后在那里问。
  • 感谢您的提醒! @jDo我下次会在那里发布类似的问题!

标签: python python-2.7 beautifulsoup web-crawler


【解决方案1】:

使用zip()list comprehension 将显着提高可读性:

headers = ['companyName', 'address', 'position', 'salary', 'jobType', 'tel', 'email', 'workTime']

infoLists = [dict(zip(headers, [item.string for item in line.find_all('td')[:8]])) 
             for line in soup.select(".typeoption")]

【讨论】:

    猜你喜欢
    • 2015-05-12
    • 1970-01-01
    • 1970-01-01
    • 2017-11-30
    • 2015-02-11
    • 1970-01-01
    • 2015-04-04
    • 2016-01-23
    • 1970-01-01
    相关资源
    最近更新 更多