【问题标题】:how to iterate scraping all the table in a list of url?如何迭代抓取 url 列表中的所有表格?
【发布时间】:2021-09-25 03:56:24
【问题描述】:

所以,我试图为一个网站抓取一个故事,我设法抓取了第一个 URL 的表,但我不知道如何迭代到下一个 URL。

这是我的一个网址:

u = 'https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393739'

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

#openurl
driver = webdriver.Chrome('chromedriver',options=options)
web = driver.get(u)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
openiframe = driver.get(iframe)
iframehtml = driver.page_source
soupiframe = bs(iframehtml, 'html.parser')

#extracting table
df = pd.read_html(iframehtml)
table1 = df[1]
table2 = df[2]
table3 = df[3]

#cleanup table
t1 = table1.set_index([0, table1.groupby(0).cumcount()])[1].unstack(0)
t1['Remarks'] = table2.iloc[1]
t3 = table3.set_index([0, table3.groupby(0).cumcount()])[1].unstack(0)

#join all table
frame = [t1,t3]
merge = pd.concat(frame,axis=1,join="outer",ignore_index=False)
merge

我的输出是:

现在,我不知道如何在此脚本中迭代 2 个或更多 URL:

u = {'https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393739','https://www.bursamalaysia.com/market_information/announcements/company_announcement/announcement_details?ann_id=393738'}

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

#openurl
driver = webdriver.Chrome('chromedriver',options=options)
web = driver.get(u)
html = driver.page_source
soup = bs(html, 'html.parser')
iframe = soup.find('iframe')['src']
openiframe = driver.get(iframe)
iframehtml = driver.page_source
soupiframe = bs(iframehtml, 'html.parser')

#extracting table
df = pd.read_html(iframehtml)
table1 = df[1]
table2 = df[2]
table3 = df[3]

#cleanup table
t1 = table1.set_index([0, table1.groupby(0).cumcount()])[1].unstack(0)
t1['Remarks'] = table2.iloc[1]
t3 = table3.set_index([0, table3.groupby(0).cumcount()])[1].unstack(0)

#join all table
frame = [t1,t3]
merge = pd.concat(frame,axis=1,join="outer",ignore_index=False)
merge

输出应该是这样的:

【问题讨论】:

    标签: python pandas selenium beautifulsoup


    【解决方案1】:
    import trio
    import httpx
    import pandas as pd
    
    keys = [393738, 393739]
    
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"
    }
    
    
    allin = []
    
    
    async def worker(channel):
        async with channel:
            async for key_ in channel:
                async with httpx.AsyncClient(timeout=None) as client:
                    client.headers.update(headers)
                    params = {
                        "e": key_
                    }
                    r = await client.get('https://disclosure.bursamalaysia.com/FileAccess/viewHtml', params=params)
                    all = pd.read_html(
                        r.text, index_col=0)
                    df = all[1].T.join(all[-1].T)
                    df['Remarks'] = all[2].iloc[1].name
                    allin.append(df)
    
    
    async def main():
        async with trio.open_nursery() as nurse:
    
            sender, receiver = trio.open_memory_channel(0)
    
            async with receiver:
                for _ in range(3):
                    nurse.start_soon(worker, receiver.clone())
    
                async with sender:
                    for k in keys:
                        await sender.send(k)
    
        finaldf = pd.concat(allin, ignore_index=True)
        print(finaldf)
        # finaldf.to_csv('data.csv', index=False)
    
    
    if __name__ == "__main__":
        trio.run(main)
    

    输出:

    0 Date of change Type of change  ...     Reference No                                            Remarks
    0     11/11/2011    Resignation  ...  CC-111111-50017  Resigned as Chief Executive Officer of the Com...  
    1     31/12/2011         Others  ...  CC-110907-47379  It was Mr Yen Wen Hwa's desire to retire and t...  
    
    [2 rows x 17 columns]
    

    【讨论】:

    • @Prophet 你是什么意思?
    • @Prophet 到底是谁为你说的是​​JS
    • 什么@Prophet ??我认为python中的整个代码,但仍然需要对该库的命令
    • @BhavyaParikh 不用担心,当人们不知道什么是异步程序时,我曾经看到过这样的 cmets。
    • 感谢您提供此代码 sn-p,它可能会提供一些有限的即时帮助。 proper explanation 将通过展示为什么这是解决问题的好方法,并使其对有其他类似问题的未来读者更有用,从而大大提高其长期价值。请edit您的回答添加一些解释,包括您所做的假设。
    猜你喜欢
    • 2023-03-17
    • 2021-03-10
    • 1970-01-01
    • 1970-01-01
    • 2020-08-15
    • 2017-07-07
    • 2022-07-13
    • 1970-01-01
    • 2018-06-09
    相关资源
    最近更新 更多