将文本文件读入 numpy 数组答案

【问题标题】：Read textfile into numpy array将文本文件读入 numpy 数组
【发布时间】：2017-12-23 12:51:49
【问题描述】：

我正在尝试将 textfile 加载到 numpy 数组中。

结构如下：

THE 77534223
AND 30997177
ING 30679488
ENT 17902107
ION 17769261
HER 15277018
FOR 14686159
THA 14222073
NTH 14115952
[...]

但我使用失败

import numpy as np

data = np.genfromtxt("english_trigrams.txt", dtype=(str,int), delimiter=' ')                                                   
print(data)

[['TH' '77']
 ['AN' '30']
 ['IN' '30']
 ..., 
 ['JX' '1']
 ['JQ' '1']
 ['JQ' '1']]

我想要一个 (x,2) 数组，第一列 dtype str，第二列 dtype int。

非常感谢！

附注：

Python 3.6.1
NumPy 1.13.0

【问题讨论】：

也许可以试试 np.loadtxt
How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?的可能重复
np.loadtxt("english_trigrams.txt", dtype=[('f0', '|S3'),('f1', '<i8')])
只是出于好奇，您是否打算将文件中的77534223 更改为77？
“我想要一个 (x,2) 数组，第一列是 dtype str，第二列是 dtype int。” numpy 无法做到这一点。您可以得到的是一个一维结构化数组，使用@MaximilianPeters 建议的方法。

标签： python numpy genfromtxt

【解决方案1】：

加载此文本的各种方式

In [470]: txt=b"""THE 77534223
     ...: AND 30997177
     ...: ING 30679488
     ...: ENT 17902107
     ...: ION 17769261
     ...: HER 15277018
     ...: FOR 14686159
     ...: THA 14222073
     ...: NTH 14115952"""

让genfromtxt推导出正确的列dtype

In [471]: data = np.genfromtxt(txt.splitlines(),dtype=None)
In [472]: data
Out[472]: 
array([(b'THE', 77534223), (b'AND', 30997177), (b'ING', 30679488),
       (b'ENT', 17902107), (b'ION', 17769261), (b'HER', 15277018),
       (b'FOR', 14686159), (b'THA', 14222073), (b'NTH', 14115952)],
      dtype=[('f0', 'S3'), ('f1', '<i4')])

不是正确的 dtype 规范；像你的一样，但每个元素只有 1 个字符。

In [473]: data = np.genfromtxt(txt.splitlines(),dtype=(str, int))
In [474]: data
Out[474]: 
array([['T', '7'],
       ['A', '3'],
       ['I', '3'],
       ['E', '1'],
       ['I', '1'],
       ['H', '1'],
       ['F', '1'],
       ['T', '1'],
       ['N', '1']],
      dtype='<U1')

稍微好一点 - 但是字符串太短了

In [475]: data = np.genfromtxt(txt.splitlines(),dtype='str,int')
In [476]: data
Out[476]: 
array([('', 77534223), ('', 30997177), ('', 30679488), ('', 17902107),
       ('', 17769261), ('', 15277018), ('', 14686159), ('', 14222073),
       ('', 14115952)],
      dtype=[('f0', '<U'), ('f1', '<i4')])

类似于dtype=None 案例

In [477]: data = np.genfromtxt(txt.splitlines(),dtype='U10,int')
In [478]: data
Out[478]: 
array([('THE', 77534223), ('AND', 30997177), ('ING', 30679488),
       ('ENT', 17902107), ('ION', 17769261), ('HER', 15277018),
       ('FOR', 14686159), ('THA', 14222073), ('NTH', 14115952)],
      dtype=[('f0', '<U10'), ('f1', '<i4')])

【讨论】：