【问题标题】:Converting pyspark DataFrame with date column to Pandas results in AttributeError将带有日期列的 pyspark DataFrame 转换为 Pandas 会导致 AttributeError
【发布时间】:2019-11-29 13:18:58
【问题描述】:

我有以下数据框(pyspark)-

 |-- DATE: date (nullable = true)
 |-- ID: string (nullable = true)
 |-- A: double (nullable = true)
 |-- B: double (nullable = true)

尝试将数据框转换为pandas -

res2 = res.toPandas()

我收到以下错误 - AttributeError: Can only use .dt accessor with datetimelike values

详细错误 -

    AttributeError                            Traceback (most recent call last)
<ipython-input-29-471067d510fa> in <module>
----> 1 res2 = res.toPandas()

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2123                         table = pyarrow.Table.from_batches(batches)
   2124                         pdf = table.to_pandas()
-> 2125                         pdf = _check_dataframe_convert_date(pdf, self.schema)
   2126                         return _check_dataframe_localize_timestamps(pdf, timezone)
   2127                     else:

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_dataframe_convert_date(pdf, schema)
   1705     """
   1706     for field in schema:
-> 1707         pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
   1708     return pdf
   1709 

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_series_convert_date(series, data_type)
   1690     """
   1691     if type(data_type) == DateType:
-> 1692         return series.dt.date
   1693     else:
   1694         return series

/opt/anaconda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5061         if (name in self._internal_names_set or name in self._metadata or
   5062                 name in self._accessors):
-> 5063             return object.__getattribute__(self, name)
   5064         else:
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):

/opt/anaconda/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    169             # we're accessing the attribute of the class, i.e., Dataset.geo
    170             return self._accessor
--> 171         accessor_obj = self._accessor(obj)
    172         # Replace the property with the accessor object. Inspired by:
    173         # http://www.pydanny.com/cached-property.html

/opt/anaconda/lib/python3.7/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
    322             pass  # we raise an attribute error anyway
    323 
--> 324         raise AttributeError("Can only use .dt accessor with datetimelike "
    325                              "values")

AttributeError: Can only use .dt accessor with datetimelike values

有什么办法解决吗?也许转换原始数据帧中的某些内容?

【问题讨论】:

  • 你能打印res.show()的输出吗?
  • @cs95 将输出添加到原始帖子中。谢谢!
  • 嗯,看不出究竟是什么问题。也许您可以尝试将日期列转换为时间戳,然后再试一次:from pyspark.sql.functions import to_timestamp; res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas()
  • @cs95 成功了!谢谢! (我可以接受这个作为答案)。

标签: pandas dataframe pyspark pyspark-sql


【解决方案1】:

作为一种解决方法,您可以考虑将date 列转换为timestamp(这更符合pandas 的datetime 类型)。

from pyspark.sql.functions import to_timestamp
res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-02-11
    • 2018-02-15
    • 2013-11-13
    相关资源
    最近更新 更多