【问题标题】:How to add this data to database in scraperwiki如何将此数据添加到 scraperwiki 中的数据库
【发布时间】:2014-05-07 08:51:20
【问题描述】:
import scraperwiki
import urllib2, lxml.etree
url = 'http://eci.nic.in/eci_main/statisticalreports/SE_1998/StatisticalReport-DEL98.pdf'
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)
# how many pages in PDF
pages = list(root)
print "There are",len(pages),"pages"
#from page 86 to 107
for page in pages[86:107]:
    for el in page:
            data = {}
        if el.tag == "text":
            if int(el.attrib['left']) < 215: data = { 'Rank': el.text }
            elif int(el.attrib['left']) < 230: data['Name'] = el.text 
            elif int(el.attrib['left']) < 592: data['Sex'] = el.text
            elif int(el.attrib['left']) < 624: data['Party'] = el.text
            elif int(el.attrib['left']) < 750: data['Votes'] = el.text
            elif int(el.attrib['left']) < 801: data['Percentage'] = el.text
            print data

现在我想知道如何将这些数据保存在scraperwiki 的数据库中。我尝试了一些命令,例如

scraperwiki.sqlite.save(unique_keys=[], table_name='ecidata1998', data=data)

但是当我检查数据集时他们没有给我所需的结果,代码或最后一条语句是否有问题。请帮忙。 Python 编程和 Scraperwiki 的新手。

【问题讨论】:

    标签: python pdf screen-scraping scraperwiki


    【解决方案1】:

    您的代码存在一些问题。

    首先,您设置的从 PDF 中提取不同内容的条件需要更加严格和精确(例如,if int(el.attrib['left']) &lt; 215 将提取左侧位置小于 215 像素的任何文本,这适用于其他您正在查看的 PDF 页面中的内容,例如文本“Constituency”)。

    其次,您需要一种方法来检查您何时拥有该行的所有数据并可以继续下一行。 (您可以尝试按行提取数据,但我发现当我拥有该行的所有数据时,依次从每个字段中获取数据并创建一个新行更容易。)

    (至于为什么scraperwiki.sqlite.save 不起作用,可能是因为那里有几行空值,但你的数据无论如何都不正确。)

    这对我有用:

    import scraperwiki
    import urllib2
    import lxml.etree
    
    
    def create_blank_row():
        """ Create an empty candidate data dictionary. """
        return {'Rank': None,
                'Name': None,
                'Sex': None,
                'Party': None,
                'Votes': None,
                'Percentage': None}
    
    
    def row_is_filled(dictionary):
        """ Return True if all values of dictionary are filled; False if not. """
        for item in dictionary.values():
            if not item:
                return False
        return True
    
    
    def main():
        url = ('http://eci.nic.in/eci_main/statisticalreports'
               '/SE_1998/StatisticalReport-DEL98.pdf')
        pdfdata = urllib2.urlopen(url).read()
        xmldata = scraperwiki.pdftoxml(pdfdata)
        root = lxml.etree.fromstring(xmldata)
    
        # how many pages in PDF
        pages = list(root)
        print "There are", len(pages), "pages"
    
        output_data = []
        candidate_data = create_blank_row()
        #from page 86 to 107
        for page in pages[86:107]:
            for el in page:
                if el.tag == "text":
                    if 206 < int(el.attrib['left']) <= 214:
                        # There are some None values here which we want to ignore.
                        if el.text:
                            candidate_data['Rank'] = el.text
    
                    if int(el.attrib['left']) == 222:
                        # Also removes ". " from start of names.
                        candidate_data['Name'] = el.text[2:]
    
                    if int(el.attrib['left']) == 591:
                        candidate_data['Sex'] = el.text
    
                    if int(el.attrib['left']) == 622:
                        candidate_data['Party'] = el.text
    
                    if 725 < int(el.attrib['left']) <= 753:
                        candidate_data['Votes'] = el.text
    
                    if 790 < int(el.attrib['left']) < 801:
                        candidate_data['Percentage'] = el.text
    
                if row_is_filled(candidate_data):
                    output_data.append(candidate_data)
                    candidate_data = create_blank_row()
    
        # Collect candidate data into a list then add to SQL database.
        # Calls to this SQL write function slow, so minimise how many times we do.
        scraperwiki.sqlite.save(unique_keys=['Rank', 'Name', 'Sex', 'Party', 
                                             'Votes'], 
                                table_name='ecidata1998',
                                data=output_data)
    
    if __name__ == '__main__':
        main()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-11-17
      • 1970-01-01
      • 2021-12-31
      • 2017-01-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多