下面我提供两个例子和一个可能的解决方案:
- 示例 1 显示了一个工作示例。
- 示例 2 显示了一个导致您报告的错误的非工作示例。
- Solution 显示了一个可能的解决方案。
示例 1:html 具有预期的 div
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry-content"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
示例 2:html 的内容中没有预期的 div
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
"""
The error will rise here because the first find does not return nothing,
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll'
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
可能的解决方案:
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})
"""
Deal with documents that do not have the expected html structure
"""
if url:
url = url.findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
else:
print("The html source does not comply with expected structure")