QUANT[22]论文2:Deep Direct Reinforcement Learning for Financial Signal Representation and Trading

深度直接强化学习于金融信号表示和交易

发表于《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 》期刊SCI一区

论文结构：

1、Introduction（综述论文，主要从决策执行RL和特征学习DL进行阐述）

2、Related works（介绍RL和DL相关知识）

3、Direct Deep Reinforcement Learning（正题部分，模型的构建顺序）

4、DRNN Learning（模型的初始化，训练方式）

5、Experimental Verifications（多种方法对比）

术语定义

QUANT[22]论文2:Deep Direct Reinforcement Learning for Financial Signal Representation and Trading

TC trade commission 手续费

Abstract

Abstract—Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.

我们能训练电脑在金融交易中击败有经验的交易者吗?在本文中，我们试图通过引入一个递归的深度神经网络(NN)来解决这一挑战，用于实时金融信号的表示和交易。我们的模型是受两个生物相关学习概念的启发，即深度学习(DL)和强化学习(RL)。在该框架中，DL部分自动感知信息特征学习的动态市场条件。然后，RL模块与深度表示交互，做出交易决策，在未知的环境中积累最终的回报。该学习系统是在一个复杂的神经网络中实现的，该神经网络具有深度和递归结构。因此，我们提出了一种任务感知的时间反向传播方法来处理深度训练中的梯度消失问题。在广泛的测试条件下，神经系统对股票和商品期货市场的鲁棒性得到了验证。

Introduction

Deep Direct Reinforcement LeTraining intelligent agents for automated financial asserts trading is a time-honored topic that has been widely discussed in the modern artificial intelligence [1]. Essentially, the process of trading is well depicted as an online decision making problem involving two critical steps of market condition summarization and optimal action execution. Compared with conventional learning tasks, dynamic decision making is more challenging due to the lack of the supervised information from human experts. It, thus, requires the agent to explore an unknown environment all by itself and to simultaneously make correct decisions in an online manner. arning for Financial Signal Representation and Trading

Such self-learning pursuits have encouraged the long-term developments of RL—a biological inspired framework—with its theory deeply rooted in the neuroscientific field for behavior control [2]–[4]. From the theoretical point of view, stochastic optimal control problems were well formulated in a pioneering work [2]. In practical applications, the successes of RL have been extensively demonstrated in a number of tasks, including robots navigation [5], atari game playing [6], and helicopter control [7]. Under some tests, RL even outperforms human experts in conducting optimal control policies [6], [8]. Hence, it leads to an interesting question in the context of trading: can we train an RL model to beat experienced human traders on the financial markets? When compared with conventional RL tasks, algorithmic trading is much more difficult due to the following two challenges.

The first challenge stems from the difficulties in financial environment summarization and representation. The financial data contain a large amount of noise, jump, and movement leading to the highly nonstationary time series. To mitigate data noise and uncertainty, handcraft financial features, e.g., moving average or stochastic technical indicators [9], are usually extracted to summarize the market conditions. The search for ideal indicators for technical analysis [10] has been extensively studied in quantitative finance. However, a widely known drawback of technical analysis is its poor generalization ability. For instance, the moving average feature is good enough to describe the trend but may suffer significant losses in the mean-reversion market [11]. Rather than exploiting predefined handcraft features, can we learn more robust feature representations directly from data?

The second challenge is due to the dynamic behavior of trading action execution. Placing trading orders is a systematic work that should take a number of practical factors into consideration. Frequently changing the trading positions (long or short) will contribute nothing to the profits but lead to great losses due to the transaction cost (TC) and slippage. Accordingly, in addition to the current market condition, the historic actions and the corresponding positions are, meanwhile, required to be explicitly modeled in the policy learning part.Without adding extra complexities, how can we incorporate such memory phenomena into the trading system?

In addressing the aforementioned two questions, in this paper, we introduce a novel RDNN structure for simultaneous environment sensing and recurrent decision making for online financial assert trading. The bulk of the RDNN is composed of two parts of DNN for feature learning and recurrent neural network (RNN) for RL. To further improve the robustness for market summarization, the fuzzy learning concepts are introduced to reduce the uncertainty of the input data. While the DL has shown great promises in many signal processing problems as image and speech recognitions, to the best of our knowledge, this is the first paper to implement DL in designing a real trading system for financial signal representation and self-taught reinforcement trading. The whole learning model leads to a highly complicated NN that involves both the deep and recurrent structures. To handle the recurrent structure, the BPTT method is exploited to unfold the RNN as a series of time-dependent stacks without feedback. When propagating the RL score back to all the layers, the gradient vanishing issue is inevitably involved in the training phase. This is because the unfolded NN exhibits extremely deep structures on both the feature learning and time expansion parts. Hence, we introduce a more reasonable training method called the task-aware BPTT to overcome this pitfall. In our approach, some virtual links from the objective function are directly connected with the deep layers during the backpropagation (BP) training. This strategy provides the deep part a chance to see what is going on in the final objective and, thus, improves the learning efficiency. The DDR trading system is tested on the real financial market for future contracts trading. In detail, we accumulate the historic prices of both the stock-index future (IF) and commodity futures. These real market data will be directly used for performance verifications. The deep RL system will be compared with other trading systems under diverse testing conditions. The comparisons show that the DDR system and its fuzzy extension are much robust to different market conditions and could make reliable profits on various future markets. The remaining parts of this paper are organized as follows. Section II generally reviews some related works about the RL and the DL. Section III introduces the detailed implementations of the RDNN trading model and its fuzzy extension. The proposed task-aware BPTT algorithm will be presented in Section IV for RDNN training. Section V is the experimental part where we will verify the performances of the DDR and compare it with other trading systems. Section VI concludes this paper and indicates some future directions.

金融资产自动交易智能代理的深度直接强化训练是现代人工智能[1]中广泛讨论的一个历史悠久的课题。从本质上讲，交易过程被很好地描述为一个在线决策问题，涉及到市场状况总结和最优行动执行两个关键步骤。与传统的学习任务相比，由于缺乏来自人类专家的监督信息，动态决策更具挑战性。因此，它要求代理独自探索一个未知的环境，同时以在线方式做出正确的决策。为金融信号表示和交易

这种自我学习的追求促进了rl的长期发展，这是一种受生物学启发的框架，其理论深深植根于行为控制的神经科学领域。从理论的角度看，随机最优控制问题在开创性的工作[2]中得到了很好的阐述。在实际应用中，RL的成功已经在许多任务中得到了广泛的证明，包括机器人导航[5]、atari游戏[6]和直升机控制[7]。在某些测试中，RL甚至在执行最优控制策略[6]、[8]方面优于人类专家。因此，在交易的背景下，它引出了一个有趣的问题:我们可以训练一个RL模型来击败金融市场上有经验的人类交易员吗?与传统的RL任务相比，由于以下两个挑战，算法交易更加困难。

第一个挑战来自于金融环境的总结和表述方面的困难。金融数据包含大量的噪音、跳跃和移动，导致高度非平稳的时间序列。为了减少数据的噪声和不确定性，通常提取手工金融特征，如移动平均或随机技术指标[9]来概括市场状况。为技术分析寻找理想的指标在定量金融学中得到了广泛的研究。然而，众所周知，技术分析的一个缺点是泛化能力差。例如，移动平均特征足以描述趋势，但在均值回归市场[11]中可能遭受重大损失。与其开发预定义的手工特性，不如直接从数据中学习更健壮的特性表示?

第二个挑战来自于交易行为执行的动态行为。下单是一项系统的工作，需要考虑很多实际因素。频繁改变交易头寸(多头或空头)不仅对利润毫无贡献，而且由于交易成本(TC)和滑移而造成巨大损失。因此，除了当前的市场状况外，政策学习部分还需要明确的对历史行动和相应的立场进行建模。在不增加额外复杂性的情况下，我们如何将这种记忆现象纳入交易系统?

针对上述两个问题，本文提出了一种新的RDNN结构，用于在线金融资产交易的同时环境感知和递归决策。
RDNN的大部分由两部分组成:用于特征学习的DNN和用于RL的递归神经网络(RNN)。
为了进一步提高市场总结的鲁棒性，引入模糊学习概念来减少输入数据的不确定性。
虽然DL在图像和语音识别等许多信号处理问题上显示出了巨大的潜力，但就我们所知，这是第一篇将DL应用于设计一个真实的金融信号表示和自学强化交易系统的论文。
整个学习模型是一个高度复杂的神经网络，包含深度结构和递归结构。
为了处理递归结构，利用BPTT方法将RNN展开为一系列无反馈的时变栈。
当将RL分数传播回所有层时，训练阶段不可避免地涉及到梯度消失问题。
这是因为展开神经网络在特征学习和时间扩展两方面都表现出了极深的结构。
因此，我们引入了一种更合理的训练方法，称为任务感知BPTT来克服这一缺陷。
在我们的方法中，目标函数中的一些虚拟链接在反向传播(BP)训练中直接与深层连接。
这种策略为深入了解最终目标提供了一个机会，从而提高了学习效率。
DDR交易系统是在真实金融市场上对QIHUO合约交易的测试。
详细地，我们累积股票指数期货(IF)和商品期货的历史价格。
这些真实的市场数据将直接用于业绩验证。
在不同的测试条件下，深度RL系统将与其他交易系统进行比较。
结果表明，DDR系统及其模糊扩展对不同的市场条件具有较强的鲁棒性，并能在未来不同的市场上获得可靠的利润。
本文其余部分组织如下。
第二部分主要回顾了有关RL和DL的相关工作。
第三部分详细介绍了RDNN交易模型的具体实现及其模糊扩展。
提出的任务感知BPTT算法将在第四节中介绍，用于RDNN培训。
第五部分是实验部分，我们将验证DDR的性能并与其他交易系统进行比较。
第六部分对本文进行了总结，并提出了今后的研究方向。

II. RELATED WORKS

RL[12]是一种流行的自学[13]范式，它是为了解决马尔科夫决策问题[14]而开发的。根据不同的学习目标，典型的RL一般可以分为基于批评的(学习价值函数)和基于行为的(学习行为)两种类型。基于临界的算法直接估计可能是该领域中使用最多的RL框架的值函数。这些value-function-based方法。对于离散空间中的优化问题，通常采用TD-learning或Q-learning[15]来求解。数值函数的优化问题可以通过动态规划[16]来解决。

虽然基于价值函数的方法(也称为基于关键值的方法)在许多问题上表现良好，但如[17]和[18]所示，它不是一个很好的交易问题范例。这是因为交易环境过于复杂，无法用离散空间来近似。另一方面，在典型的q -学习中，价值函数的定义总是涉及到对未来贴现收益[17]进行重新编码的术语。交易的本质要求以在线方式计算利润。在交易系统的感觉部分或决策部分，不允许任何种类的未来市场信息。虽然基于值函数的方法对于离线调度器问题[15]是可行的，但是对于动态的在线交易[17]、[19]却不是很理想。因此，与学习价值函数不同，开创性的工作[17]建议直接学习属于基于参与者的框架的操作。

基于actor的RL直接从参数化的策略家族定义一系列连续操作。在典型的基于值函数的方法中，优化总是依赖于一些复杂的动态规划来获得每个状态的最优操作。基于行为的优化学习要简单得多，只需要一个带有潜参数的可微目标函数。此外，基于行为者的方法直接从连续的感官数据(市场特征)中学习策略，而不是用一些离散的状态来描述不同的市场情况(在Q-learning中)。综上所述，基于参与者的方法具有两个优点:1)优化目标灵活，2)对市场情况的描述连续。因此，它是一个比Q-learning方法更好的交易框架。在[17]和[19]中，基于参与者的学习被称为DRL，为了保持一致性，我们也将在这里使用DRL。

虽然DRL定义了一个良好的交易模型，但它并没有说明特性学习的方向。众所周知，鲁棒特征表示对机器学习性能至关重要。在股票数据学习的背景下，从[20]-[22]多个视角提出了各种特征表示策略。稳健特征提取的失败可能会对交易系统处理具有高不确定性的市场数据的性能产生不利影响。在直接增强交易(DRT)领域，Deng等人尝试引入稀疏编码模型作为金融分析的特征提取器。与DRL相比，稀疏特征实现了更可靠的性能。

虽然承认稀疏编码对于特征学习[23]-[25]，[36]的一般有效性，但它本质上是一种浅层的数据表示策略，在广泛的测试[26]，[27]中的性能无法与最先进的DL相比。DL是一种新兴的技术，它允许从大数据中学习健壮的特性。DL技术在图像分类[26]和语音识别[29]中取得了成功。在这些应用中，DL主要用于从大量的训练样本中自动发现信息特性。然而，据我们所知，目前几乎没有关于DL的金融信号挖掘的工作。本文将DL的能力推广到金融信号处理和学习的新领域。DL模型将与DRL相结合，为金融断言交易设计一个实时交易系统。

RL [12] is a prevalent self-taught learning [13] paradigm that has been developed to solve the Markov decision problem [14]. According to different learning objectives, typical RL can be generally categorized into two types as criticbased (learning value functions) and actor-based (learning actions) methods. Critic-based algorithms directly estimate the value functions that are perhaps the mostly used RL frameworks in the filed. These value-function-based methods,e.g., TD-learning or Q-learning [15] are always applied to solve the optimization problems defined in a discrete space. The optimizations of value functions can always be solved by dynamic programming [16].

While the value-function-based methods (also known as critic-based method) perform well for a number of problems, it is not a good paradigm for the trading problem, as indicated in [17] and [18]. This is because the trading environment is too complex to be approximated in a discrete space. On the other hand, in typical Q-learning, the definition of value function always involves a term recoding the future discounted returns [17]. The nature of trading requires to count the profits in an online manner. Not any kind of future market information is allowed in either the sensory part or policy making part of a trading system. While value-function-based methods are plausible for the offline scheduler problems [15], they are not ideal for dynamic online trading [17], [19]. Accordingly, rather than learning the value functions, a pioneering work [17] suggests learning the actions directly that falls into the actor-based framework.

The actor-based RL defines a spectrum of continuous actions directly from a parameterized family of policies. In typical value-function-based method, the optimization always relies on some complicated dynamic programming to derive optimal actions on each state. The optimization of actor-based learning is much simpler that only requires a differentiable objective function with latent parameters. In addition, rather than describing diverse market conditions with some discrete states (in Q-learning), the actor-based method learns the policy directly from the continuous sensory data (market features). In conclusion, the actor-based method exhibits two advantages: 1) flexible objective for optimization and 2) continuous descriptions of market condition. Therefore, it is a better framework for trading than the Q-learning approaches. In [17] and [19], the actor-based learning is termed DRL and we will also use DRL here for consistency.

While the DRL defines a good trading model, it does not shed light on the side of feature learning. It is known that robust feature representation is vital to machine learning performances. In the context of the stock data learning, various feature representation strategies have been proposed from multiple views [20]–[22]. Failure in the extraction of robust features may adversely affect the performances of a trading system on handling market data with high uncertainties. In the field of direct reinforcement trading (DRT), Deng et al. [19] attempt to introduce the sparse coding model as a feature extractor for financial analysis. The sparse features achieve much more reliable performances than the DRL to trade stock-IFs.

While admitting the general effectiveness of sparse coding for feature learning [23]–[25], [36], it is essentially a shallow data representation strategy whose performance is not comparable with the state-of-the-art DL in a wide range tests [26], [27]. DL is an emerging technique [28] that allows robust feature learning from big data. The successes of DL techniques have been witnessed in image categorization [26] and speech recognition [29]. In these applications, DL mainly serves to automatically discover informativefeatures from a large amount of training samples. However, to the best of our knowledge, there is hardly any existing work about DL for financial signal mining. This paper will try to generalize the power of DL into a new field for financial signal processing and learning. The DL model will be combined with DRL to design a real-time trading system for financial assert trading.

III. DIRECT DEEP REINFORCEMENT LEARNING

A. Direct Reinforcement Trading

内容主要有下面几点

穆迪的DRL框架[30]，典型的DRL本质上是一个单层的RNN
提到了TC成本的问题
提到了TP可以直接作为Reward
然后可以用SHARP之类的去替换

B. Deep Recurrent Neural Network for DDR

QUANT[22]论文2:Deep Direct Reinforcement Learning for Financial Signal Representation and Trading

对小穆的DRL换了网络结构单层变多层，隐藏层数设置为4，每个隐藏层的节点数固定为128。

C. Fuzzy Extensions to Reduce Uncertainties

深度配置很好地解决了RNN中的特征学习问题。然而，另一个重要的问题，即在财务数据的数据不确定性方面，也应该仔细考虑。与图像或语音等其他类型的信号不同，由于交易背后的随机**，金融序列包含了大量不可预测的不确定性。此外，一些其他因素，如全球经济环境和一些公司谣言，也可能影响实时的财务信号的方向。因此，降低原始数据的不确定性是增强金融信号挖掘鲁棒性的重要途径。在人工智能领域，模糊学习是降低原始数据[33]、[34]不确定性的理想范式。模糊系统不采用对某些现象的精确描述，而是倾向于将模糊语言学家值分配给输入数据。通过将实际数据与若干模糊粗糙集进行比较，进而得出相应的模糊隶属度，可以很容易地得到这种模糊化表示。因此，学习系统只能使用这些模糊表示来做出鲁棒控制决策。对于本文讨论的财务问题，可以根据股票价格的基本运动，自然地定义模糊粗糙集。其中，模糊集定义为递增组、递减组和无趋势组。然后，根据所讨论问题的上下文，可以预先定义模糊隶属函数中的参数。或者，它们可以以完全数据驱动的方式学习。财务问题是一个非常复杂的问题，根据以往的经验很难手工建立模糊隶属函数。因此，我们倾向于直接学习隶属度函数，这一思想将在第四节详细介绍。在模糊神经网络中，模糊表示部分通常与具有不同隶属度函数[35]的输入向量ft (green nodes)相连。注意，在我们的设置中，我们遵循开创性的工作[35]来为输入向量的每个维度分配k个不同的模糊度。在图2的动画中，由于空间限制，每个输入变量只连接两个模糊节点(k = 2)。在我们的实际实现中，k固定为3来描述递增、递减、无趋势的情况。数学,我第六届模糊隶属函数(·):R→[0, 1]地图我th输入模糊degreeThe高斯隶属函数与m和方差σ2是利用我们的系统建议后[37]和[38]。在得到模糊表示后，将它们直接连接到深层转换层，寻找深层转换。综上所述，模糊DRNN主要由模糊表示、深度变换和DRT三部分组成。将FDRNN看作一个统一的系统，分别发挥数据预处理(减少不确定性)、特征学习(深度转换)和交易决策(RL)的作用。整个优化框架如下:其中需要学习的参数有三组，即交易参数 = (w, b, u), 模糊表示 v(·), gd (·). 和深度转换在上面的优化,UT的终极奖励RL函数,δt FRDNN近似的政策,英国《金融时报》的高级特性表示当前市场情况由DL。

深度直接强化学习于金融信号表示和交易

论文结构：

术语定义

Abstract

Introduction

II. RELATED WORKS

III. DIRECT DEEP REINFORCEMENT LEARNING

A. Direct Reinforcement Trading

B. Deep Recurrent Neural Network for DDR

C. Fuzzy Extensions to Reduce Uncertainties

IV. DRNN LEARNING