【问题标题】:How can we find the last row, in a specific column, in a data frame?我们如何在数据框中找到特定列中的最后一行?
【发布时间】:2021-01-10 05:59:26
【问题描述】:

我有一个名为 'data["df"]' 的数据框。数据如下所示。

如何获取名为“adjclose”的列中的最后一行?我想要的数字是“28.430000”。

我正在尝试将其附加到列表中。我测试了这段代码:

print(str(data.iloc[-1, data.columns.get_loc("adjclose")]) = 0))

我明白了:AttributeError: 'dict' object has no attribute 'iloc'

我在这里做错了什么?

这是我的完整代码。

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from yahoo_fin import stock_info as si
from collections import deque

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import os
import random


# set seed, so we can get the same results after rerunning several times
np.random.seed(314)
tf.random.set_seed(314)
random.seed(314)



def load_data(ticker, n_steps=50, scale=True, shuffle=True, lookup_step=1, 
                test_size=0.2, feature_columns=['adjclose', 'volume', 'open', 'high', 'low']):
    # see if ticker is already a loaded stock from yahoo finance
    if isinstance(ticker, str):
        # load it from yahoo_fin library
        df = si.get_data(ticker)
    elif isinstance(ticker, pd.DataFrame):
        # already loaded, use it directly
        df = ticker
    # this will contain all the elements we want to return from this function
    result = {}
    # we will also return the original dataframe itself
    result['df'] = df.copy()
    # make sure that the passed feature_columns exist in the dataframe
    for col in feature_columns:
        assert col in df.columns, f"'{col}' does not exist in the dataframe."
    if scale:
        column_scaler = {}
        # scale the data (prices) from 0 to 1
        for column in feature_columns:
            scaler = preprocessing.MinMaxScaler()
            df[column] = scaler.fit_transform(np.expand_dims(df[column].values, axis=1))
            column_scaler[column] = scaler

        # add the MinMaxScaler instances to the result returned
        result["column_scaler"] = column_scaler
    # add the target column (label) by shifting by `lookup_step`
    df['future'] = df['adjclose'].shift(-lookup_step)
    # last `lookup_step` columns contains NaN in future column
    # get them before droping NaNs
    last_sequence = np.array(df[feature_columns].tail(lookup_step))
    # drop NaNs
    df.dropna(inplace=True)
    sequence_data = []
    sequences = deque(maxlen=n_steps)
    for entry, target in zip(df[feature_columns].values, df['future'].values):
        sequences.append(entry)
        if len(sequences) == n_steps:
            sequence_data.append([np.array(sequences), target])
    # get the last sequence by appending the last `n_step` sequence with `lookup_step` sequence
    # for instance, if n_steps=50 and lookup_step=10, last_sequence should be of 59 (that is 50+10-1) length
    # this last_sequence will be used to predict in future dates that are not available in the dataset
    last_sequence = list(sequences) + list(last_sequence)
    # shift the last sequence by -1
    last_sequence = np.array(pd.DataFrame(last_sequence).shift(-1).dropna())
    # add to result
    result['last_sequence'] = last_sequence
    # construct the X's and y's
    X, y = [], []
    for seq, target in sequence_data:
        X.append(seq)
        y.append(target)
    # convert to numpy arrays
    X = np.array(X)
    y = np.array(y)
    # reshape X to fit the neural network
    X = X.reshape((X.shape[0], X.shape[2], X.shape[1]))
    # split the dataset
    result["X_train"], result["X_test"], result["y_train"], result["y_test"] = train_test_split(X, y, test_size=test_size, shuffle=shuffle)
    # return the result
    return result

def create_model(sequence_length, units=256, cell=LSTM, n_layers=2, dropout=0.3,
                loss="mean_absolute_error", optimizer="rmsprop", bidirectional=False):
    model = Sequential()
    for i in range(n_layers):
        if i == 0:
            # first layer
            if bidirectional:
                model.add(Bidirectional(cell(units, return_sequences=True), input_shape=(None, sequence_length)))
            else:
                model.add(cell(units, return_sequences=True, input_shape=(None, sequence_length)))
        elif i == n_layers - 1:
            # last layer
            if bidirectional:
                model.add(Bidirectional(cell(units, return_sequences=False)))
            else:
                model.add(cell(units, return_sequences=False))
        else:
            # hidden layers
            if bidirectional:
                model.add(Bidirectional(cell(units, return_sequences=True)))
            else:
                model.add(cell(units, return_sequences=True))
        # add dropout after each layer
        model.add(Dropout(dropout))
    model.add(Dense(1, activation="linear"))
    model.compile(loss=loss, metrics=["mean_absolute_error"], optimizer=optimizer)
    return model


# Window size or the sequence length
N_STEPS = 100
# Lookup step, 1 is the next day
LOOKUP_STEP = 1
# test ratio size, 0.2 is 20%
TEST_SIZE = 0.2
# features to use
FEATURE_COLUMNS = ["adjclose", "volume", "open", "high", "low"]
# date now
date_now = time.strftime("%Y-%m-%d")
### model parameters
N_LAYERS = 3
# LSTM cell
CELL = LSTM
# 256 LSTM neurons
UNITS = 256
# 40% dropout
DROPOUT = 0.4
# whether to use bidirectional RNNs
BIDIRECTIONAL = False
### training parameters
# mean absolute error loss
# LOSS = "mae"
# huber loss
LOSS = "huber_loss"
OPTIMIZER = "adam"
BATCH_SIZE = 64
EPOCHS = 2


# save the dataframe
from datetime import date
from datetime import datetime

start_date = date(2020, 3, 1)
end_date = date(2020, 9, 18)
days = np.busday_count(start_date, end_date)



# Apple stock market

tickers = ['CHIS']

thelen = len(tickers)

z=0
all_stocks=[]

for ticker in tickers:
    ticker_data_filename = os.path.join("data", f"{ticker}_{date_now}.csv")
    # model name to save, making it as unique as possible based on parameters
    model_name = f"{date_now}_{ticker}-{LOSS}-{OPTIMIZER}-{CELL.__name__}-seq-{N_STEPS}-step-{LOOKUP_STEP}-layers-{N_LAYERS}-units-{UNITS}"
    if BIDIRECTIONAL:
        model_name += "-b"
    
    
    # create these folders if they does not exist
    if not os.path.isdir("results"):
        os.mkdir("results")
    if not os.path.isdir("logs"):
        os.mkdir("logs")
    if not os.path.isdir("data"):
        os.mkdir("data")
        
    
    # load the data
    data = load_data(ticker, N_STEPS, lookup_step=LOOKUP_STEP, test_size=TEST_SIZE, feature_columns=FEATURE_COLUMNS)
    
    # save the dataframe
    data["df"].to_csv(ticker_data_filename)
    
    # construct the model
    model = create_model(N_STEPS, loss=LOSS, units=UNITS, cell=CELL, n_layers=N_LAYERS,
                        dropout=DROPOUT, optimizer=OPTIMIZER, bidirectional=BIDIRECTIONAL)
    
    # some tensorflow callbacks
    checkpointer = ModelCheckpoint(os.path.join("results", model_name + ".h5"), save_weights_only=True, save_best_only=True, verbose=1)
    tensorboard = TensorBoard(log_dir=os.path.join("logs", model_name))
    
    history = model.fit(data["X_train"], data["y_train"],
                        batch_size=BATCH_SIZE,
                        epochs=EPOCHS,
                        validation_data=(data["X_test"], data["y_test"]),
                        callbacks=[checkpointer, tensorboard],
                        verbose=1)
    
    model.save(os.path.join("results", model_name) + ".h5")
    
    
    # after the model ends running...or during training, run this
    # tensorboard --logdir="logs"
    # http://localhost:6006/
    
    
    data = load_data(ticker, N_STEPS, lookup_step=LOOKUP_STEP, test_size=TEST_SIZE,
                    feature_columns=FEATURE_COLUMNS, shuffle=False)
    
    # construct the model
    model = create_model(N_STEPS, loss=LOSS, units=UNITS, cell=CELL, n_layers=N_LAYERS,
                        dropout=DROPOUT, optimizer=OPTIMIZER, bidirectional=BIDIRECTIONAL)
    
    model_path = os.path.join("results", model_name) + ".h5"
    model.load_weights(model_path)
    
    
    # evaluate the model
    mse, mae = model.evaluate(data["X_test"], data["y_test"], verbose=0)
    # calculate the mean absolute error (inverse scaling)
    mean_absolute_error = data["column_scaler"]["adjclose"].inverse_transform([[mae]])[0][0]
    #print("Mean Absolute Error:", mean_absolute_error)
    
    
    def predict(model, data, classification=False):
        # retrieve the last sequence from data
        last_sequence = data["last_sequence"][:N_STEPS]
        # retrieve the column scalers
        column_scaler = data["column_scaler"]
        # reshape the last sequence
        last_sequence = last_sequence.reshape((last_sequence.shape[1], last_sequence.shape[0]))
        # expand dimension
        last_sequence = np.expand_dims(last_sequence, axis=0)
        # get the prediction (scaled from 0 to 1)
        prediction = model.predict(last_sequence)
        # get the price (by inverting the scaling)
        predicted_price = column_scaler["adjclose"].inverse_transform(prediction)[0][0]
        return predicted_price
    
    
    # predict the future price
    future_price = predict(model, data)
    print(f"Future price after {LOOKUP_STEP} days is ${future_price:.2f}")

    z=z+1
    print(str(z) + ' of ' + str(thelen))

    
    # print(new_seriesdata['Adj Close'].iloc[-1])
    all_stocks.append(ticker + ' act: ' + str(future_price) + ' prd: ' + str(future_price))

【问题讨论】:

  • 看来您的data 不是数据框,而是字典。请提供一个完整的例子。但是,根据您的评论“我有一个名为 'data["df"]' 的数据框”,也许 data 不是您的数据框,但 data['df'] 是?
  • 一旦你有了你的数据框,请注意有一个有据可查的函数来获取最后的记录,称为tail()。像your_df.tail(1)['adjclose'] 这样简单的东西可能就是你所需要的。
  • 这能回答你的问题吗? How to get the last N rows of a pandas DataFrame?
  • 如果是字典,只需使用data['adjclose'][-1]即可获取所需项目

标签: python python-3.x dataframe


【解决方案1】:

正如 Grismar 回答的那样,您的当前被定义为字典。如果您以 DataFrame 格式获取它,您还可以将其表示为列中的值列表:

a = your_df['adjclose']

然后检索列表的最后一个值(相应的最后一行)。

print(a[len(a)-1])

【讨论】:

  • 谢谢大家。我有点迷失在这里。我尝试了这些选项: df['adjclose'] Yields: KeyError: 'adjclose' df.tail(1)['adjclose'] Yields: KeyError: 'adjclose' df['adjclose'][-1] Yields: KeyError: 'adjclose'
  • 你能发布你是如何构建你的df的吗?似乎 adjclose 不存在于您的 df 中。
  • 我刚刚发布了我的完整代码。这可能有点令人困惑,但希望它比伤害更有用!
  • @Grismar & pavel,你们都是对的。这是做同一件事的两种不同方法。非常感谢!!还有一个问题。为什么我在 'data["df"]' 中有 450 条记录,但我在 'days' 中只有 147 天。我希望 'data["df"]' 中有 147 条记录。为什么会有差异?
  • @Grismar & pavel,我想通了。我刚刚对数据框进行了切片,现在我得到了我想要的。谢谢大家!!
猜你喜欢
  • 2023-01-31
  • 2010-09-09
  • 1970-01-01
  • 1970-01-01
  • 2022-10-12
  • 2020-06-10
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多