【问题标题】:add column to the scraped tables将列添加到已抓取的表中
【发布时间】:2021-04-11 16:00:20
【问题描述】:

我尝试添加包含从天气网站https://www.wunderground.com/history/daily/us/dc/washington/KDCA 抓取的每个表格的唯一日期的列

我从这段代码开始

driver = webdriver.Chrome('/Users/razanalthawwadi/Desktop/chromedriver')
link='https://www.wunderground.com/history/daily/us/va/arlington- 
county/KDCA/Date/'

def list_dates(start,end):
""" This creates a list of of dates between the 'start' date and the 'end' date """
# create datetime object for the start and end dates
start = datetime.datetime.strptime(start, '%Y-%m-%d')
end = datetime.datetime.strptime(end, '%Y-%m-%d')
# generates list of dates between start and end dates
step = datetime.timedelta(days=1)
dates = []
while start <= end:
    dates.append(start.date())
    start += step
# return the list of dates in string format
return [str(date) for date in dates]

    dates=list_dates('2017-01-01','2017-12-31')
    we=[]
   
    datess=[]
    for i in dates:
#     print(i)
    datess.append(i)
    page=str(str(link) + str(i))
    driver.get(page)
    time.sleep(3)
    html=driver.page_source
    df=pd.read_html(html)
    we.append(df[1])

我尝试使用此循环,但它为所有表格打印相同的日期

for i in dates:

    wel.insert(loc=0, column='jj', value=i) 

【问题讨论】:

    标签: python web-scraping data-science


    【解决方案1】:

    只需在循环期间将 i 作为日期列分配给数据框。

    注意:未对所有网址进行测试。此外,wunderground 曾经有一个 API,因此可能会对其进行研究,而不是产生浏览器的开销。

    from selenium import webdriver
    import datetime
    import pandas as pd
    import time
    
    driver = webdriver.Chrome('/Users/razanalthawwadi/Desktop/chromedriver')
    link='https://www.wunderground.com/history/daily/us/va/arlington-county/KDCA/Date/'
    
    def list_dates(start,end):
        """ This creates a list of of dates between the 'start' date and the 'end' date """
        # create datetime object for the start and end dates
        start = datetime.datetime.strptime(start, '%Y-%m-%d')
        end = datetime.datetime.strptime(end, '%Y-%m-%d')
        # generates list of dates between start and end dates
        step = datetime.timedelta(days=1)
        dates = []
        
        while start <= end:
            dates.append(start.date())
            start += step
        # return the list of dates in string format
        return [str(date) for date in dates]
    
    dates = list_dates('2017-01-01','2017-12-31')
     
    for i in dates:
        page= f'https://www.wunderground.com/history/daily/us/va/arlington-county/KDCA/Date/{i}'
        driver.get(page)
        # time.sleep(3)
        html = driver.page_source
        df= pd.read_html(html)[1]
        df.dropna(axis=0, how= 'all',  inplace = True)
        df['date'] = i
    

    【讨论】:

    • 当我应用此代码时出现此错误消息 IndexError: list index out of range
    • 出现该错误时page 的值是多少?即在哪个页面发生错误
    【解决方案2】:

    添加到第一个循环:

    df[1].insert(loc=0, column='Date', value=i)
    

    它有效。

    【讨论】:

      猜你喜欢
      • 2019-05-18
      • 1970-01-01
      • 2019-07-06
      • 2023-01-11
      • 1970-01-01
      • 1970-01-01
      • 2019-02-23
      • 2015-03-02
      • 2014-08-01
      相关资源
      最近更新 更多