如何迭代抓取 url 列表中的所有表格？答案

【问题标题】：how to iterate scraping all the table in a list of url?如何迭代抓取 url 列表中的所有表格？
【发布时间】：2021-09-25 03:56:24
【问题描述】：

所以，我试图为一个网站抓取一个故事，我设法抓取了第一个 URL 的表，但我不知道如何迭代到下一个 URL。

这是我的一个网址：

u = 'https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393739'

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

#openurl
driver = webdriver.Chrome('chromedriver',options=options)
web = driver.get(u)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
openiframe = driver.get(iframe)
iframehtml = driver.page_source
soupiframe = bs(iframehtml, 'html.parser')

#extracting table
df = pd.read_html(iframehtml)
table1 = df[1]
table2 = df[2]
table3 = df[3]

#cleanup table
t1 = table1.set_index([0, table1.groupby(0).cumcount()])[1].unstack(0)
t1['Remarks'] = table2.iloc[1]
t3 = table3.set_index([0, table3.groupby(0).cumcount()])[1].unstack(0)

#join all table
frame = [t1,t3]
merge = pd.concat(frame,axis=1,join="outer",ignore_index=False)
merge

我的输出是：

现在，我不知道如何在此脚本中迭代 2 个或更多 URL：

u = {'https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393739','https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393738'}

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

#openurl
driver = webdriver.Chrome('chromedriver',options=options)
web = driver.get(u)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
openiframe = driver.get(iframe)
iframehtml = driver.page_source
soupiframe = bs(iframehtml, 'html.parser')

#extracting table
df = pd.read_html(iframehtml)
table1 = df[1]
table2 = df[2]
table3 = df[3]

#cleanup table
t1 = table1.set_index([0, table1.groupby(0).cumcount()])[1].unstack(0)
t1['Remarks'] = table2.iloc[1]
t3 = table3.set_index([0, table3.groupby(0).cumcount()])[1].unstack(0)

#join all table
frame = [t1,t3]
merge = pd.concat(frame,axis=1,join="outer",ignore_index=False)
merge

输出应该是这样的：

【问题讨论】：

标签： python pandas selenium beautifulsoup

【解决方案1】：

import trio
import httpx
import pandas as pd

keys = [393738, 393739]


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"
}


allin = []


async def worker(channel):
    async with channel:
        async for key_ in channel:
            async with httpx.AsyncClient(timeout=None) as client:
                client.headers.update(headers)
                params = {
                    "e": key_
                }
                r = await client.get('https://disclosure.bursamalaysia.com/FileAccess/viewHtml', params=params)
                all = pd.read_html(
                    r.text, index_col=0)
                df = all[1].T.join(all[-1].T)
                df['Remarks'] = all[2].iloc[1].name
                allin.append(df)


async def main():
    async with trio.open_nursery() as nurse:

        sender, receiver = trio.open_memory_channel(0)

        async with receiver:
            for _ in range(3):
                nurse.start_soon(worker, receiver.clone())

            async with sender:
                for k in keys:
                    await sender.send(k)

    finaldf = pd.concat(allin, ignore_index=True)
    print(finaldf)
    # finaldf.to_csv('data.csv', index=False)


if __name__ == "__main__":
    trio.run(main)

输出：

0 Date of change Type of change  ...     Reference No                                            Remarks
0     11/11/2011    Resignation  ...  CC-111111-50017  Resigned as Chief Executive Officer of the Com...  
1     31/12/2011         Others  ...  CC-110907-47379  It was Mr Yen Wen Hwa's desire to retire and t...  

[2 rows x 17 columns]

【讨论】：

@Prophet 你是什么意思？
@Prophet 到底是谁为你说的是JS？
什么@Prophet ??我认为python中的整个代码，但仍然需要对该库的命令
@BhavyaParikh 不用担心，当人们不知道什么是异步程序时，我曾经看到过这样的 cmets。
感谢您提供此代码 sn-p，它可能会提供一些有限的即时帮助。 proper explanation 将通过展示为什么这是解决问题的好方法，并使其对有其他类似问题的未来读者更有用，从而大大提高其长期价值。请edit您的回答添加一些解释，包括您所做的假设。