https://medium.com/@ranko.mosic/reinforcement-learning-based-trading-application-at-jp-morgan-chase-f829b8ec54f2
Business Insider article on the same topic for readers that do not have FT subscription ). The intent is to reduce market impact and provide best trade execution results for large orders.
It is a complex application with many moving parts:
slides:
Sarsa
Q-learning
Rewards are immediate rewards ( price spread ) and terminal ( end of episode ) rewards like completion, order duration and market penalties ( obviously those are negative rewards that punish the agent along these dimensions ).
Actions are memorized as weights of a Deep Neural Network — function approximation via NN is used since state, action space is too big to be handled in tabular form. We assume stochastic gradient descent is used for both feed forward and backprop operation operation ( hence Deep designation ):
Kearns and Nevmyvaka 2006 ).
latest LOXM developments will be presented at QuantMinds Conference in Lisbon (May of 2018).
Q-learning, probably for the same purpose ( market impact reduction ).