【问题标题】:Pandas unable to parse csv with multiple lines within a cell熊猫无法在一个单元格内解析多行的 csv
【发布时间】:2017-10-04 12:06:03
【问题描述】:

我有一个 csv 文件 Decoded.csv

Query,Doc,article_id,data_source
5000,how to get rid of serve burn acne,1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
2 Leave the paste on your skin overnight then wash it with cold water the next morning. 
3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible.,459,random
5001,what is hypospadia,A birth defect of the male urethra.,409,dummy
5002,difference between alimentary canal and accessory organs,The alimentary canal is the tube going from the mouth to the anus. The accessory organs are the organs located along that canal which produce enzymes to aid the digestion process.,461,nytimes

并且有 3 个查询 5000,5001 和 5002。 查询 5000 的 Doc 值包含多行,这让 pandas 感到困惑。 (1 玫瑰水和檀香:将玫瑰水和檀香制成糊状,轻轻涂抹在痤疮疤痕上。 2 将糊状物留在皮肤上过夜,然后在第二天早上用冷水清洗。 3 定期与其他治疗痤疮疤痕的自然疗法一起进行,以尽快消除疤痕)

我的python代码如下

def main():
    import pandas as pd
    dataframe = pd.read_csv("Decoded.csv")
    queries, docs = dataframe['Query'], dataframe['Doc']
    for idx in range(len(queries)):
        print("idx: ", idx, " ", queries[idx], " <-> ", docs[idx])
        query_doc_appended = (queries[idx] + " " + docs[idx])
    print(query_doc_appended)

if __name__ == '__main__':
    main()

它失败了。请指出如何去掉换行符,以便 Query 5000 拥有完整的 Doc 语句集。

【问题讨论】:

  • 任何错误信息?你的数据文件是什么样的?不清楚。
  • 问题本身提供了数据文件 Decoded.csv ,Query,Doc,article_id,data_source...并且错误是 Traceback (最近一次调用最后一次): line 53, in main( ) 第 49 行,在主 query_doc_appended = (queries[idx] + " " + docs[idx]) TypeError: unsupported operand type(s) for +: 'float' and 'str' idx: 0 how to get rid of serve burn痤疮 1 玫瑰水和檀香:将玫瑰水和檀香制成糊状,轻轻涂抹在痤疮疤痕上。 idx: 1 南 南
  • 当你运行这个程序时你会得到什么错误信息?

标签: python python-2.7 python-3.x pandas csv


【解决方案1】:

您的 Query 5001 行中的字段过多,使其有 5 列,而不是其他行的 4 列。

5001,what is hypospadia,A birth defect of the male urethra.,409,dummy

您可以在 Decoded.csv 中双引号引用您的 Doc 内容来解决这个问题。

【讨论】:

    【解决方案2】:

    2 个问题:

    • 要允许多行字段,字段数据必须用双引号括起来。
    • 您的字段数据中也有逗号。

    所以,csv 应该是这样的:

    Query,Doc,article_id,data_source
    5000,"how to get rid of serve burn acne,1 Rose water and sandalwood: Make a paste of rose water and sandalwood and gently apply it on your acne scars.
    2 Leave the paste on your skin overnight then wash it with cold water the next morning. 
    3 Do this regularly together with other natural treatments for acne scars to get rid of the scars as quickly as possible.",459,random
    5001,"what is hypospadia,A birth defect of the male urethra.",409,dummy
    5002,"difference between alimentary canal and accessory organs,The alimentary canal is the tube going from the mouth to the anus. The accessory organs are the organs located along that canal which produce enzymes to aid the digestion process.",461,nytimes
    

    如果这些字段中有双引号,则必须用另一个双引号对其进行转义。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-02-01
      • 1970-01-01
      • 2012-10-16
      • 2017-03-13
      • 2017-01-31
      • 1970-01-01
      • 2016-05-22
      • 2015-01-21
      相关资源
      最近更新 更多