Python matplotlib直方图很慢答案

【问题标题】：Python matplotlib histogram very slowPython matplotlib直方图很慢
【发布时间】：2019-02-13 19:30:35
【问题描述】：

我试图为 .csv 文件中的数据绘制直方图。但是当我运行它时，它非常非常慢。我等了大约 20 分钟，但仍然无法得到剧情。请问是这个问题吗？

以下几行是我的代码。

import pandas as pd
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy.values[ :, 5 ]

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()

【问题讨论】：

CSV 有多大？
不是很大。 stock_price_gs 长度为 4871
stock_price_gs 未在您的代码中定义。你的意思是plt.hist( stock_price_spy, 50 ) 吗？
抱歉，打错字了。
在这种情况下，代码应该是正确的，并且应该在几分之一秒内生成图形。但是，您实际上从未要求它显示，对吗？ plt.show() 还是你在用笔记本？

标签： python pandas matplotlib histogram

【解决方案1】：

我做了以下，看来这可以解决问题。

似乎“ stock_price_spy = spy[ 'Adj Close' ].values ”给出了一个真正的numpy ndarray。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

spy = pd.read_csv( 'SPY.csv' )
stock_price_spy = spy[ 'Adj Close' ].values

plt.hist( stock_price_spy, bins = 100, label = 'S&P 500 ETF', alpha = 0.8 )
plt.show()

【讨论】：

【解决方案2】：

事实上，你正在使用一种相当有缺陷的方式来实现你的目标，你需要使用 numpy 来提高性能。

import numpy as np
import matplotlib.pyplot as plt

stock_price_spy = np.loadtxt('SPY.csv', dtype=float, delimiter=',', skiprows=1, usecols=4)

#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy, 50 )
plt.show()

我没有测试它，但它应该可以工作。

我建议你使用英特尔的优化版 python。最好管理这种过程。 Intel python distribution

添加测试代码。因为有些人试图误导并且缺少真正的论据，panda 使用作为字典的 Dataframes，而不是 numpy 数组。 numpy 数组几乎快两倍。

import numpy as np
import pandas as pd
import random
import csv
import matplotlib.pyplot as plt
import time

#Creating a random csv file 6 x 4871, simulating the problem.
rows = 4871
columns = 6
fields = ['one', 'two', 'three', 'four', 'five', 'six']

write_a_csv = csv.DictWriter(open("random.csv", "w"), 
fieldnames=fields)
for i in range(0, rows):
    write_a_csv.writerow(dict([
    ('one', random.random()),
    ('two', random.random()),
    ('three', random.random()),
    ('four', random.random()),
    ('five', random.random()),
    ('six', random.random())
    ]))

start_old = time.clock()
spy = pd.read_csv( 'random.csv' )
print(type(spy))
stock_price_spy = spy.values[ :, 5 ]
n, bins, patches = plt.hist( stock_price_spy, 50 )

plt.show()
end_old = time.clock()
total_time_old = end_old - start_old
print(total_time_old)

start_new = time.clock()

stock_price_spy_new = np.loadtxt('random.csv', dtype=float, 
delimiter=',', skiprows=1, usecols=4)
print(type(stock_price_spy_new))
#here you have nothing else than the 5th column of your csv, this cuts the bottleneck in memory.

n, bins, patches = plt.hist( stock_price_spy_new, 50 )
plt.show()
end_new = time.clock()

total_time_new = end_new - start_new
print(total_time_new)

【讨论】：

由于某种原因，这里以前的 cmets 被删除了。似乎仍然有必要为未来的读者提供一些背景信息。虽然这个答案没有回答问题，但它也包含一些缺陷。用挂钟测量时间总是很危险的。在这里，它会导致首先测量的方法总是显得更慢。如果使用适当的时序测量，则会发现对于此处选择的设置，pandas 解决方案更快。虽然绘图需要相同的时间，但差异来自读入，其中 pandas 解决方案的性能优于 numpy 约 2 倍。