【问题标题】:When getting Data through web scrapping the previous data is lost by the new data通过网络抓取数据时,以前的数据会被新数据丢失
【发布时间】:2022-01-25 12:52:06
【问题描述】:

我正在网上报废一段代码以获取 NSE 公司公告。但问题是我在这段代码中使用的 url 一次只能包含 20 个项目,因此发生的情况是他们每天有很多 100 个公告被错过,因为它一次只包含 20 个

我希望解决这个问题,以便我获得所有之前的公告以及之前的公告。这是我的代码-

import requests
import pandas as pd
from datetime import date
from datetime import datetime

today = date.today()

__request_headers = {
    'Host':'www.nseindia.com', 
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
    'Accept-Language':'en-US,en;q=0.5', 
    'Accept-Encoding':'gzip, deflate, br',
    'DNT':'1', 
    'Connection':'keep-alive', 
    'Upgrade-Insecure-Requests':'1',
    'Pragma':'no-cache',
    'Cache-Control':'no-cache',    
}


try:
    nse_url = 'https://www.nseindia.com/'
    url = 'https://www.nseindia.com/api/corporate-announcements?index=equities'
    resp = requests.get(url=nse_url, headers=__request_headers)
    if resp.ok:
        req_cookies = dict(nsit=resp.cookies['nsit'], nseappid=resp.cookies['nseappid'], ak_bmsc=resp.cookies['ak_bmsc'])
        tresp = requests.get(url=url, headers=__request_headers, cookies=req_cookies)
        result = tresp.json()
        result = pd.DataFrame(result)
        result.drop(['difference', 'dt','exchdisstime','csvName','old_new','orgid','seq_id','sm_isin','bflag','symbol','sort_date'], axis = 1, inplace = True)
        result.rename(columns = {'an_dt':'DateandTime', 'attchmntFile':'Source','attchmntText':'Topic','desc':'Type','smIndustry':'Sector','sm_name':'Company Name'}, inplace = True)
        result[['Date','Time']] = result.DateandTime.str.split(expand=True)
        result.to_csv( ( str(today.day) +'-'+str(today.month) +'-'+'CA.csv'), index=True)
        print(result)
        res_data = result["NIFTY"]["data"] if "NIFTY" in result and "data" in result["NIFTY"] else []
        if res_data != None and len(res_data) > 0:
            __top_list = res_data
            print(__top_list)
except OSError as err:
    logger.error('Unable to fetch data')

【问题讨论】:

    标签: python pandas web-scraping


    【解决方案1】:

    您可以使用 1 天 url 来构建您的请求并使用今天的日期(或您想要的任何日期范围)

    import requests
    import pandas as pd
    from datetime import datetime
    
    s = requests.Session()
    headers =   {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
    url = 'https://www.nseindia.com/'
    step = s.get(url,headers=headers)
    
    today = datetime.now().strftime('%d-%m-%Y')
    api_url = f'https://www.nseindia.com/api/corporate-announcements?index=equities&from_date={today}&to_date={today}'
    
    resp = s.get(api_url,headers=headers).json()
    
    df= pd.DataFrame(resp)
    df.to_csv('nseindia.csv',index=False)
    
    print('Saved to nseindia.csv')
    

    【讨论】:

    • 感谢您的帮助,先生,它成功了
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-02
    • 2020-12-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多