【发布时间】:2018-03-15 17:41:02
【问题描述】:
情景
要导入的数据集包含相当多的NaN 值。同样,我在 Python 中使用 SurPRISE 包(由 Nicholas Hug 编写)而不是使用 Pandas。原因是预测 NaN 值的方法适用于上述软件包。
问题
数据集post_df1.csv如下:
uid iid rat
1 303.0 785.0 3.000000
2 291.0 1042.0 4.000000
3 234.0 1184.0 2.000000
4 102.0 768.0 2.000000
5 181.0 1081.0 1.000000
...
194 944.0 110.0 NaN
195 944.0 111.0 NaN
196 944.0 112.0 NaN
197 944.0 113.0 NaN
198 944.0 114.0 5.000000
199 944.0 115.0 5.000000
使用 SurPRISE 导入
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1, 5))
df = Dataset.load_from_file('post_df1.csv', reader=reader)
返回错误:
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 173, in load_from_file
return DatasetAutoFolds(ratings_file=file_path, reader=reader)
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 306, in __init__
self.raw_ratings = self.read_ratings(self.ratings_file)
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 205, in read_ratings
itertools.islice(f, self.reader.skip_lines, None)]
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 455, in parse_line
return uid, iid, float(r) + self.offset, timestamp
ValueError: could not convert string to float:
我无法弄清楚,字符串在哪里!因为使用 Pandas 读取 post_df1.csv 时,返回以下内容:
post_df1.dtypes
uid float64
iid float64
rat float64
dtype: object
问题
- 使用此包读取它时,可能会将整个数据视为字符串吗?
- 我在错误中注意到,float 在 Dataset.py 中有一个偏移量和时间戳作为返回值。如何将其限制为 uid、iid、rat / float?
返回 uid, iid, float(r) + self.offset, 时间戳 3. 列表项
参考
编辑 #1
所以,这就是 post_df1 和 post_df2 的形成方式。同样对于 post_df1,我尝试从第 1 行开始取值,以防第 0 行是标题。
# PRE PROCESSED CLUSTER 0 -- Named to POST DataFrame1
if flag1 is 1:
print pre_df01
post_df1 = pre_df01.iloc[1:, :]
elif flag1 is 2:
print pre_df02
post_df1 = pre_df02.iloc[1:, :]
elif flag1 is 3:
print pre_df03
post_df1 = pre_df03.iloc[1:, :]
# PRE PROCESSED CLUSTER 1 -- Named to POST DataFrame2
if flag2 is 1:
print pre_df11
post_df2 = pre_df11
elif flag2 is 2:
print pre_df12
post_df2 = pre_df12
elif flag2 is 3:
print pre_df13
post_df2 = pre_df13
在这里,我已经尝试删除标题和索引以避免其中包含任何字符串类型。
# EXPORT TO CSV & LOAD AGAIN IN PROGRAM
post_df1.to_csv("post_df1.csv", sep='\t', index=False, header=False)
post_df2.to_csv("post_df2.csv", sep='\t', index=False, header=False)
【问题讨论】:
标签: python python-2.7 pandas csv dataframe