如何使用 pandas 从 txt 加载数据？答案

【问题标题】：How can I load data from txt using pandas?如何使用 pandas 从 txt 加载数据？
【发布时间】：2018-09-01 05:17:48
【问题描述】：

我读过这个问题Load data from txt with pandas。但是，我的数据格式有点不同。以下是数据示例：

product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to ... 

product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by Poor Script and Production
review/text: THE VIRGIN OF JUAREZ is based on true events...

.
.

我打算进行情绪分析，因此我只想获取每个部分中的 text 和 score 行。有人如何使用熊猫来做到这一点吗？或者我需要阅读文件并分析每一行以提取评论和评分？

【问题讨论】：

标签： python pandas dataframe loaddata

【解决方案1】：

事实上，我不知道 pandas 可以读取该文件。

我建议编写一个可以读取您的文件的 python 程序，并输出 csv 文件，让我们像这样命名为 Sentiment.csv：

产品 ID、评论者 ID、个人资料名称、帮助、分数、时间、摘要、文本 B003AI2VGA,A141HP4LYPWMSR,Brian E. Erland "彩虹狮身人面像",7/7,3.0,1182729600,"现在有这么多黑暗〜来吧奇迹”，剧情简介：每天从墨西哥华雷斯到...

B003AI2VGA,A328S9RN3U5M68,Grady Harp,4/4,3.0,1181952000,值得和糟糕的剧本和制作阻碍了重要的故事，处女 JUAREZ 是根据真实事件改编的……

然后，简单地使用： df = pd.read_csv('sentiment.csv')

【讨论】：

文件很大，差不多10G。将其转换为 csv 并在之后读取会很慢吗？
新转换的文件会比10G小很多。原始文件的问题在于它重复每一行的元数据。对于新转换的文件，元数据是文件的第一行，其余部分是数据。至于 python 将其转换为 csv，如果性能成为问题，您可以将原始文件细分为更小的文件，并处理这些文件。最后，合并生成的文件。

【解决方案2】：

这是一种方式：

import pandas as pd
from io import StringIO

mystr = StringIO("""product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to ... 

product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by Poor Script and Production
review/text: THE VIRGIN OF JUAREZ is based on true events...""")

# replace mystr with 'file.txt'
df = pd.read_csv(mystr, header=None, sep='|', error_bad_lines=False)

df = pd.DataFrame(df[0].str.split(':', n=1).values.tolist())
df = df.loc[df[0].isin({'review/text', 'review/score'})]

结果：

               0                                                  1
4   review/score                                                3.0
7    review/text   Synopsis: On the daily trek from Juarez, Mexi...
12  review/score                                                3.0
15   review/text    THE VIRGIN OF JUAREZ is based on true events...

【讨论】：

感谢您的回答，我试过了，但它提醒我这样的错误： File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File “pandas/_libs/parsers.pyx”，第 924 行，在 pandas._libs.parsers.TextReader._read_low_memory 文件“pandas/_libs/parsers.pyx”，第 978 行，在 pandas._libs.parsers.TextReader._read_rows 文件“pandas /_libs/parsers.pyx"，第 965 行，在 pandas._libs.parsers.TextReader._tokenize_rows pandas.errors.ParserError：错误标记数据。 C 错误：预期 59372 行中有 1 个字段，看到 14
您可以尝试error_bad_lines=False 参数，如上面更新的答案，但这需要您自担风险（错误数据将被跳过）。
谢谢。还有一个问题，我想匹配评论文本和分数。但现在他们处于不同的行列。你有什么想法将它们组合成一条线吗？每行代表一个（评论文本，分数）键值对。
@Coding_Rabbit，这绝对是可能的 - 我建议你先搜索 SO，或者作为一个单独的问题提出。

【解决方案3】：

我认为@sanrio 的回答可能是最直接的，但这里有一个在pandas 中进行字符串操作的选项：

with open('your_text_file.txt') as f:
    text_lines = f.readlines()

# create pandas Series object where each value is a text line from your file
s = pd.Series(text_lines)

# remove the new-lines
s = s.str.strip()

# extract some columns using regex and represent in a dataframe
df = s.str.split('\s?(.*)/([^:]*):(.*)', expand=True)

# remove irrelevant columns
df = df.replace('', np.nan).dropna(how='all', axis=1)

def gb_organize(df_):
    """
    Organize a sub-dataframe from group-by operation.
    """
    df_ = df_.dropna()
    return pd.DataFrame(df_[3].values, index=df_[2].values).T

# pass a Series object to .groupby to iterate over consecutive non-null rows
df_result = df.groupby(df.isna().all(axis=1).cumsum(), group_keys=False).apply(gb_organize)

df_result = df_result.set_index(['productId', 'userId'])

# then you can access the records you want with the following:
df_result[['score', 'text']]

【讨论】：