作为固定宽度的表格读入,删除第一列
In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)
In [31]: df
Out[31]:
Timestamp What sweets do you like0
0 23/11/2013 13:22:34 Chocolate
1 23/11/2013 13:22:39 Toffee, Popcorn, Fruit
2 23/11/2013 13:22:45 Fudge, Toffee
3 23/11/2013 13:22:48 Popcorn
将时间戳设置为适当的 datetime64 dtype(本练习不需要),
但几乎总是你想要的。
In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])
新列名
In [33]: df.columns = ['date','sweets']
In [34]: df
Out[34]:
date sweets
0 2013-11-23 13:22:34 Chocolate
1 2013-11-23 13:22:39 Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45 Fudge, Toffee
3 2013-11-23 13:22:48 Popcorn
In [35]: df.dtypes
Out[35]:
date datetime64[ns]
sweets object
dtype: object
将甜蜜的列从一个字符串拆分成一个列表
In [37]: df['sweets'].str.split(',\s*')
Out[37]:
0 [Chocolate]
1 [Toffee, Popcorn, Fruit]
2 [Fudge, Toffee]
3 [Popcorn]
Name: sweets, dtype: object
关键步骤,这将为存在值的位置创建一个虚拟矩阵
In [38]: df['sweets'].str.split(',\s*').apply(lambda x: pd.Series(1,index=x))
Out[38]:
Chocolate Fruit Fudge Popcorn Toffee
0 1 NaN NaN NaN NaN
1 NaN 1 NaN 1 1
2 NaN NaN 1 NaN 1
3 NaN NaN NaN 1 NaN
我们将 nans 填充为 0,然后将 astype 填充为 bool 以使 True/False 成为最终结果。然后连接
它到原来的框架
In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: pd.Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]:
date sweets Chocolate Fruit Fudge Popcorn Toffee
0 2013-11-23 13:22:34 Chocolate True False False False False
1 2013-11-23 13:22:39 Toffee, Popcorn, Fruit False True False True True
2 2013-11-23 13:22:45 Fudge, Toffee False False True False True
3 2013-11-23 13:22:48 Popcorn False False False True False