【发布时间】:2017-07-14 01:39:46
【问题描述】:
这是对此的后续问题: pandas replace only part of a column
这是我当前的输入:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date
pd.set_option('expand_frame_repr', False)
df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))
# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj
for ci in jj:
df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......
print (df.round(2))
这是我当前的输出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 NaN
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 NaN
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 NaN
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
Traceback (most recent call last):
File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'
我想要做的是将列“x”更改为连续三个 1 的集合,不重叠。期望的输出是:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 NaN
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 NaN
因此,1 月 5 日、18 日和 31 日从 NaN 更改为 1.0。
正如上面的评论所说,代码的第二部分非常适合基于整数的索引。但是,当使用 dtype datetime64[ns] 的日期时间索引时,它不起作用。我想我只需要对代码的第二部分进行微小的调整就可以让它工作(希望如此)。
提前致谢, 大卫
--------------后续部分 ------------------ ------------------
感谢您与我在一起 b2002。由于它的简洁性,我真的试图保持最佳解决方案。当我开箱即用运行您的代码时,输出如下:
原始输出
...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]...
... a[ci:ci+2] = 1.0...
Open High Low Close Volume Adj Close x ii dd jj jj desired
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.68 1.0 1
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.49 1.0 1
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.44 1.0 2 x x
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.82 1.0 3 1
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.94 NaN 4 1
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.93 NaN 5 1
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.98 NaN 6 1
2000-01-12 151.06 153.25 150.56 152.00 18342300 30.08 NaN 7 1
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.42 1.0 1
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.88 1.0 1
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.29 1.0 10 3 x x x
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.43 1.0 11 1
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.88 1.0 1
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.52 1.0 1
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.33 1.0 1
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.41 1.0 15 4 z z
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.99 1.0 16 1
2000-01-27 141.56 141.75 137.06 141.75 19243500 28.05 1.0 1
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.52 1.0 1
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.52 1.0 19 3 x x x
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.91 1.0 20 1
我真的很想了解发生了什么,所以我设置了列 ii、dd、jj 之前、jj 之后和期望。当我将输入调整为:
...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]...
... a[ci:ci+1] = 1.0...
这是输出:
Open High Low Close Volume Adj Close x
Date
2000-01-03 153.00 153.69 149.19 150.00 22069800 29.45 1.0
2000-01-04 147.25 148.00 144.00 144.00 22121400 28.27 1.0
2000-01-05 143.75 147.00 142.56 143.75 27292800 28.22 1.0
2000-01-06 143.12 146.94 142.63 145.67 19873200 28.60 NaN
2000-01-07 148.00 151.88 147.00 151.31 20141400 29.70 NaN
2000-01-10 152.69 154.06 151.12 151.25 15226500 29.69 NaN
2000-01-11 151.00 152.69 150.62 151.50 15123000 29.74 NaN
2000-01-12 151.06 153.25 150.56 152.00 18342300 29.84 NaN
2000-01-13 153.13 154.94 153.00 153.75 14953500 30.18 1.0
2000-01-14 153.38 154.63 149.56 151.00 18480300 29.64 1.0
2000-01-18 149.62 149.62 146.75 148.00 18296700 29.05 1.0
2000-01-19 146.50 150.94 146.25 148.72 14849700 29.19 NaN
2000-01-20 149.06 149.75 142.63 145.94 30759000 28.65 1.0
2000-01-21 147.94 148.25 143.94 144.13 24005400 28.29 1.0
2000-01-24 145.31 145.94 136.44 138.13 27116100 27.12 1.0
2000-01-25 138.06 140.38 137.00 138.50 25387500 27.19 1.0
2000-01-26 140.50 142.19 138.88 141.44 15856800 27.77 NaN
2000-01-27 141.56 141.75 137.06 141.75 19243500 27.83 1.0
2000-01-28 140.31 140.50 133.63 134.00 29846700 26.31 1.0
2000-01-31 134.00 135.94 133.06 134.00 21782700 26.31 1.0
2000-02-01 134.25 137.00 134.00 136.00 27339000 26.70 NaN
唯一的问题是 1 月 25 日,其中 np.diff 给出的值为 4。我只需要代码跳过 4 的值即可单独保留现有的三个 1 集。我试图在 dd 去 jj 之前修改它,这两次尝试都没有奏效:
dd[dd == 4] = 1
dd = [3 if x==4 else x for x in dd]
还尝试用这个来修改 jj 条目:
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
它给出了这个错误信息:
Traceback (most recent call last):
File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
有人有什么想法吗?
【问题讨论】:
-
您可以尝试使用
ix进行基于标签/整数的混合访问,而不是loc,或者reset_index 并执行转换并将set_index 返回到Date -
你能解释一下你的代码的逻辑吗?你想做什么?为什么这些行需要三个连续的 1?
-
冻糕 - 这只是一个例子。没有具体原因。
-
如果你的数据不是太大和/或你不是太在意超快的速度,我写的函数可以从一个单独的文件中导入并单行执行。关于较短的代码,我应该说代码将运行,而代码将在我的答案的第一行运行。如果您愿意,我可以帮助您设置单独的文件并导入。很遗憾,现在没有时间处理其他代码。
标签: python pandas datetime indexing