【发布时间】:2021-09-25 03:56:24
【问题描述】:
所以,我试图为一个网站抓取一个故事,我设法抓取了第一个 URL 的表,但我不知道如何迭代到下一个 URL。
这是我的一个网址:
u = 'https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393739'
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
#openurl
driver = webdriver.Chrome('chromedriver',options=options)
web = driver.get(u)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
openiframe = driver.get(iframe)
iframehtml = driver.page_source
soupiframe = bs(iframehtml, 'html.parser')
#extracting table
df = pd.read_html(iframehtml)
table1 = df[1]
table2 = df[2]
table3 = df[3]
#cleanup table
t1 = table1.set_index([0, table1.groupby(0).cumcount()])[1].unstack(0)
t1['Remarks'] = table2.iloc[1]
t3 = table3.set_index([0, table3.groupby(0).cumcount()])[1].unstack(0)
#join all table
frame = [t1,t3]
merge = pd.concat(frame,axis=1,join="outer",ignore_index=False)
merge
现在,我不知道如何在此脚本中迭代 2 个或更多 URL:
u = {'https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393739','https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393738'}
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
#openurl
driver = webdriver.Chrome('chromedriver',options=options)
web = driver.get(u)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
openiframe = driver.get(iframe)
iframehtml = driver.page_source
soupiframe = bs(iframehtml, 'html.parser')
#extracting table
df = pd.read_html(iframehtml)
table1 = df[1]
table2 = df[2]
table3 = df[3]
#cleanup table
t1 = table1.set_index([0, table1.groupby(0).cumcount()])[1].unstack(0)
t1['Remarks'] = table2.iloc[1]
t3 = table3.set_index([0, table3.groupby(0).cumcount()])[1].unstack(0)
#join all table
frame = [t1,t3]
merge = pd.concat(frame,axis=1,join="outer",ignore_index=False)
merge
【问题讨论】:
标签: python pandas selenium beautifulsoup