【发布时间】:2015-01-11 06:07:50
【问题描述】:
我正在尝试从<li> 标记中提取日期并将它们存储在 Excel 文件中。
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
代码:
import urllib2
import os
from datetime import datetime
import re
os.environ["LANG"]="en_US.UTF-8"
from bs4 import BeautifulSoup
page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
li =soup.find_all("li")
count = 0
while count < len(li):
soup = BeautifulSoup(li[count])
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
count+=1
错误:
Traceback (most recent call last):
File "C:\Users\sony\Desktop\Trash\Crawler Try\trytest.py", line 13, in <module>
soup =BeautifulSoup(li[count])
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
[Finished in 4.0s with exit code 1]
因此,我不知道如何编写在 excel 中提取的每个文本。没有包含在其中的代码。参考问题:Web crawler to extract in between the list
【问题讨论】:
标签: python parsing web-scraping beautifulsoup web-crawler