【发布时间】:2018-11-08 12:53:49
【问题描述】:
我一直在尝试从this url 中提取一些数据。但是,我无法刮看看你是否能识别出害虫。有一个名为“collapsefaq-content”的类,beautifulsoup 找不到。
我想把所有的都刮掉
在这个类下标记数据。
这是我的代码:
import urllib.request
import csv
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
import lxml
page_url = 'http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases'
page = urllib.request.urlopen(page_url)
soup = BeautifulSoup(page, 'html.parser')
file_name = "alpit.csv"
main_url = []
see_if_you_can = []
see_if_you_can.append("Indetify")
legal =[]
legal.append('Legal Stuff')
specimen =[]
specimen.append("Specimen")
insect_name = []
insect_name.append("Name of insect")
disease_name = []
disease_name.append("Name")
disease_list = []
disease_list.append("URL")
origin = []
origin.append('Origin')
for insectName in soup.find_all('li', attrs={'class': 'flex-item'}):
if(str(insectName.a.attrs['href']).startswith('/')):
# to go in the link and extract data
main_url.append('http://www.agriculture.gov.au' +
insectName.a.attrs['href'])
print(insectName.text.strip()) # disease name
for name in insectName.find_all('img'):
print('http://www.agriculture.gov.au' +
name.attrs['src']) # disease link
disease_list.append('http://www.agriculture.gov.au' +
name.attrs['src'])
for disease in main_url:
if(True):
# disease = 'http://www.agriculture.gov.au'+disease
inner_page = urllib.request.urlopen(disease)
soup_list = BeautifulSoup(inner_page, 'lxml')
for detail in soup_list.find_all('strong'):
if(detail.text == 'Origin: '):
origin.append(detail.next_sibling.strip())
print(detail.next_sibling.strip())
for name in soup_list.find_all('div', class_='pest-header-content'):
print(name.h2.text)
insect_name.append(name.h2.text)
for textin in soup_list.find_all('div',class_ = "collapsefaq-content"):
print("*******")
print(textin.text)
# print('alpit')
# print(len(disease_list))
# print(len(origin))
df = pd.DataFrame([insect_name, disease_list, origin,see_if_you_can, legal, specimen])
df = df.transpose()
df.to_csv(file_name, index=False, header=None)
# with open('alpit.csv','w') as myfile:
# wr = csv.writer(myfile)
# for val in disease_list:
# wr.writerow([val])
# for val in origin:
# wr.writerow([val])
连“***”都没有打印出来。 谁能告诉我我在这里做错了什么...?
【问题讨论】:
-
该类在 HTML 中不存在,它可能是稍后使用 JS 逻辑添加的。
-
你想从那里解析什么?您查找的内容不是动态生成的。
-
@Alpit Anand ,不提供完整的脚本,只提供相关部分
标签: python web-scraping beautifulsoup