【问题标题】:Creating a dataframe where one of the arrays has a different length创建一个数据框,其中一个数组具有不同的长度
【发布时间】:2019-01-05 15:17:47
【问题描述】:

我正在学习通过 Python 从网站上抓取数据。从this page 提取有关旧金山的天气信息。我在将数据组合到 Pandas Dataframe 时卡住了。是否可以创建每行具有不同长度的数据框?

我已经根据此处的答案尝试了 2 种方法,但它们并不是我想要的。两个答案都将 temps 列的值向上移动。 Here is the screen what I try to explain..

第一种方式:https://stackoverflow.com/a/40442094/10179259

第二种方式:https://stackoverflow.com/a/19736406/10179259

import requests
from bs4 import BeautifulSoup
import pandas as pd


page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")

periods=[pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]

short_descs=[sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]

temps=[t.get_text() for t in seven_day.select('.tombstone-container .temp')]

descs = [d['alt'] for d in seven_day.select('.tombstone-container img')]


#print(len(periods), len(short_descs), len(temps), len(descs))

weather = pd.DataFrame({
        "period": periods, #length is 9
        "short_desc": short_descs, #length is 9
        "temp": temps, #problem here length is 8
        #"desc":descs #length is 9
    })


print(weather)

我希望 temp 列的第一行是 Nan。谢谢。

【问题讨论】:

  • 回答您的问题“是否可以创建每行具有不同长度的数据框?”:不,这是不可能的,除非您用 NaN 填充其他列。但通常这不是正确的方法。

标签: python pandas dataframe


【解决方案1】:

您可以使用iternext 循环每个forecast_items 值以选择第一个值,如果不存在则分配给字典NaN 值:

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")

soup = BeautifulSoup(page.content, 'html.parser')

seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")

out = []
for x in forecast_items:
    periods = next(iter([t.get_text() for t in x.select('.period-name')]), np.nan)
    short_descs = next(iter([t.get_text() for t in x.select('.short-desc')]), np.nan)
    temps = next(iter([t.get_text() for t in x.select('.temp')]), np.nan)
    descs = next(iter([d['alt'] for d in x.select('img')]), np.nan)
    out.append({'period':periods, 'short_desc':short_descs, 'temp':temps, 'descs':descs})

weather = pd.DataFrame(out)
print (weather)
                                               descs               period  \
0                                                     NOW until4:00pm Sat   
1  Today: Showers, with thunderstorms also possib...                Today   
2  Tonight: Showers likely and possibly a thunder...              Tonight   
3  Sunday: A chance of showers before 11am, then ...               Sunday   
4  Sunday Night: Rain before 11pm, then a chance ...          SundayNight   
5  Monday: A 40 percent chance of showers.  Cloud...               Monday   
6  Monday Night: A 30 percent chance of showers. ...          MondayNight   
7  Tuesday: A 50 percent chance of rain.  Cloudy,...              Tuesday   
8  Tuesday Night: Rain.  Cloudy, with a low aroun...         TuesdayNight   

                               short_desc         temp  
0                           Wind Advisory          NaN  
1                       Showers andBreezy  High: 56 °F  
2                           ShowersLikely   Low: 49 °F  
3                     Heavy Rainand Windy  High: 56 °F  
4  Heavy Rainand Breezythen ChanceShowers   Low: 52 °F  
5                           ChanceShowers  High: 58 °F  
6                           ChanceShowers   Low: 53 °F  
7                             Chance Rain  High: 59 °F  
8                                    Rain   Low: 53 °F  

【讨论】:

  • 从未听说过 next 和 iter 函数。太棒了,非常感谢。
猜你喜欢
  • 2019-06-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-20
  • 1970-01-01
相关资源
最近更新 更多