将列添加到已抓取的表中答案

【问题标题】：add column to the scraped tables将列添加到已抓取的表中
【发布时间】：2021-04-11 16:00:20
【问题描述】：

我尝试添加包含从天气网站https://www.wunderground.com/history/daily/us/dc/washington/KDCA 抓取的每个表格的唯一日期的列

我从这段代码开始

driver = webdriver.Chrome('/Users/razanalthawwadi/Desktop/chromedriver')
link='https://www.wunderground.com/history/daily/us/va/arlington- 
county/KDCA/Date/'

def list_dates(start,end):
""" This creates a list of of dates between the 'start' date and the 'end' date """
# create datetime object for the start and end dates
start = datetime.datetime.strptime(start, '%Y-%m-%d')
end = datetime.datetime.strptime(end, '%Y-%m-%d')
# generates list of dates between start and end dates
step = datetime.timedelta(days=1)
dates = []
while start <= end:
    dates.append(start.date())
    start += step
# return the list of dates in string format
return [str(date) for date in dates]

    dates=list_dates('2017-01-01','2017-12-31')
    we=[]
   
    datess=[]
    for i in dates:
#     print(i)
    datess.append(i)
    page=str(str(link) + str(i))
    driver.get(page)
    time.sleep(3)
    html=driver.page_source
    df=pd.read_html(html)
    we.append(df[1])

我尝试使用此循环，但它为所有表格打印相同的日期

for i in dates:

    wel.insert(loc=0, column='jj', value=i)

【问题讨论】：

标签： python web-scraping data-science

【解决方案1】：

只需在循环期间将 i 作为日期列分配给数据框。

注意：未对所有网址进行测试。此外，wunderground 曾经有一个 API，因此可能会对其进行研究，而不是产生浏览器的开销。

from selenium import webdriver
import datetime
import pandas as pd
import time

driver = webdriver.Chrome('/Users/razanalthawwadi/Desktop/chromedriver')
link='https://www.wunderground.com/history/daily/us/va/arlington-county/KDCA/Date/'

def list_dates(start,end):
    """ This creates a list of of dates between the 'start' date and the 'end' date """
    # create datetime object for the start and end dates
    start = datetime.datetime.strptime(start, '%Y-%m-%d')
    end = datetime.datetime.strptime(end, '%Y-%m-%d')
    # generates list of dates between start and end dates
    step = datetime.timedelta(days=1)
    dates = []
    
    while start <= end:
        dates.append(start.date())
        start += step
    # return the list of dates in string format
    return [str(date) for date in dates]

dates = list_dates('2017-01-01','2017-12-31')
 
for i in dates:
    page= f'https://www.wunderground.com/history/daily/us/va/arlington-county/KDCA/Date/{i}'
    driver.get(page)
    # time.sleep(3)
    html = driver.page_source
    df= pd.read_html(html)[1]
    df.dropna(axis=0, how= 'all',  inplace = True)
    df['date'] = i

【讨论】：

当我应用此代码时出现此错误消息 IndexError: list index out of range
出现该错误时page 的值是多少？即在哪个页面发生错误

【解决方案2】：

添加到第一个循环：

df[1].insert(loc=0, column='Date', value=i)

它有效。

【讨论】：