【问题标题】:Tips for modifying the python web scraping code using functions使用函数修改python网页抓取代码的技巧
【发布时间】:2015-12-01 02:57:14
【问题描述】:

我正在尝试使用 BeautifulSoup 编写一个 Python 脚本,该脚本在网页 http://tbc-python.fossee.in/completed-books/ 中爬行并从中收集必要的数据。基本上它必须将所有书籍章节中出现的所有page loading errors, SyntaxErrors, NameErrors, AttributeErrors, etc 提取到文本文件errors.txt 中。大约有273本书。编写的脚本很好地完成了任务。我正在以良好的速度使用带宽。但是代码需要很长时间才能浏览所有书籍。请帮助我通过必要的调整优化 python 脚本,也许使用函数等。谢谢

import urllib2, urllib
from bs4 import BeautifulSoup
website = "http://tbc-python.fossee.in/completed-books/"
soup = BeautifulSoup(urllib2.urlopen(website))
errors = open('errors.txt','w')

# Completed books webpage has data stored in table format
BookTable = soup.find('table', {'class': 'table table-bordered table-hover'})
for BookCount, BookRow in enumerate(BookTable.find_all('tr'), start = 1):
    # Grab  book names
    BookCol = BookRow.find_all('td')
    BookName = BookCol[1].a.string.strip()
    print "%d: %s" % (BookCount, BookName)  
    # Open each book
    BookSrc = BeautifulSoup(urllib2.urlopen('http://tbc-python.fossee.in%s' %(BookCol[1].a.get("href"))))
    ChapTable = BookSrc.find('table', {'class': 'table table-bordered table-hover'})

    # Check if each chapter page opens, if not store book & chapter name in error.txt
    for ChapRow in ChapTable.find_all('tr'):
        ChapCol = ChapRow.find_all('td')
        ChapName = (ChapCol[0].a.string.strip()).encode('ascii', 'ignore') # ignores error : 'ascii' codec can't encode character u'\xef'
        ChapLink = 'http://tbc-python.fossee.in%s' %(ChapCol[0].a.get("href"))

        try:
            ChapSrc = BeautifulSoup(urllib2.urlopen(ChapLink))
        except:
            print '\t%s\n\tPage error' %(ChapName)
            errors.write("Page; %s;%s;%s;%s" %(BookCount, BookName, ChapName, ChapLink))
            continue

        # Check for errors in chapters and store the errors in error.txt
        EgError = ChapSrc.find_all('div', {'class': 'output_subarea output_text output_error'})
        if EgError:
            for e, i in enumerate(EgError, start=1):
                errors.write("Example;%s;%s;%s;%s\n" %(BookCount,BookName,ChapName,ChapLink)) if 'ipython-input' or 'Error' in i.pre.get_text() else None           
            print '\t%s\n\tExample errors: %d' %(ChapName, e)       

errors.close()

【问题讨论】:

    标签: python beautifulsoup web-crawler execution-time


    【解决方案1】:

    您可能想查看 multiprocessing 并吐出工作量。

    如果您一次只使用 1 个连接,则连接速度并不重要。

    【讨论】:

    • @OneOfOne 。我一次使用 1 个连接。还有其他建议吗?谢谢。
    • @ThirumaleshHS 我看不出有什么方法可以在不拆分的情况下让它更快,但也许其他人会。祝你好运。
    【解决方案2】:

    我试图分解代码并使用函数来表示它。 有什么建议可以再次即兴编写代码吗?如何将从网站获取的错误转储到一个新的 html 文件中,该文件的表格格式包含包含错误的书籍和章节的详细信息。

    以下是更新后的代码:

    import urllib2, sys
    from bs4 import BeautifulSoup
    
    def get_details(link, index):
        """
        This function takes in two arguments and returns a list which contains details of 
        books and/or chapters like:
        * name of the book or chapter
        * link of the book or chapter
    
        Getting details from book or chapter is set by index value
        * index = 1 --> gets details of the book
        * index = 0 --> gets details of the chapter
        """
        details_list = []
    
        src = BeautifulSoup(urllib2.urlopen(link))
        table = src.find('table')
        for row in table.find_all('tr'):
            column = row.find_all('td')  
            name, link = column[index].a.string, column[index].a.get("href")
            details_list.append([name, link])
    
        return details_list
    
    
    def get_chapter_errors(chap_link):
        """
        This function takes in chapter link from chapter_details_list as argument and returns 
        * Number of example errors(SyntaxErrors, NameErrors, ValueErrors, etc) present in the chapter
                     OR
        * HTTPError while loading the chapter
        """
        try:
            chp_src = BeautifulSoup(urllib2.urlopen(chap_link))
            example_errors = chp_src.find_all('div', {'class': 'output_subarea output_text output_error'})
            error = len(example_errors)
            if not example_errors:
                error = None 
    
        except urllib2.HTTPError as e:
            print e
            error = "Page fetch error"
    
        return error
    
    
    def main():
        log_dict = {}
        book_dict = {}
    
        url = sys.argv[1] # accept url as argument
        book_details_list = get_details(url, index=1)
        for book_name, book_link in book_details_list:
            chapter_details_list = get_details('http://tbc-python.fossee.in%s' % book_link, index=0)
            _id = book_link.strip('/book-details')
            book_dict = {'name': book_name,
                         'url': 'http://tbc-python.fossee.in%s' % book_link,
                         'id': _id,
                         'chapters': []
                        }
    
            for chap_name, chap_link in chapter_details_list:
                error = get_chapter_errors('http://tbc-python.fossee.in%s' % chap_link)
                book_dict.get('chapters').append({'name': chap_name, 
                                                  'url': 'http://tbc-python.fossee.in%s' % chap_link, 
                                                  'errors': error
                                                 })
    
            log_dict.update({_id: book_dict})
    
            print log_dict
            print "\n\n\n\n"
    
    
    if __name__ == '__main__':
        main()
    

    【讨论】:

      猜你喜欢
      • 2011-05-04
      • 2011-06-02
      • 2021-06-25
      • 1970-01-01
      • 1970-01-01
      • 2023-03-06
      • 2017-01-25
      • 1970-01-01
      • 2014-04-12
      相关资源
      最近更新 更多