On-Policy Trust Region Policy Optimisation with Replay Buffers

Kangin, D; Pugeault, N

dc.contributor.author	Kangin, D
dc.contributor.author	Pugeault, N
dc.date.accessioned	2019-01-31T14:34:31Z
dc.date.issued	2019-01-18
dc.description.abstract	Building upon the recent success of deep reinforcement learning methods, we investigate the possibility of on-policy reinforcement learning improvement by reusing the data from several consecutive policies. On-policy methods bring many benefits, such as ability to evaluate each resulting policy. However, they usually discard all the information about the policies which existed before. In this work, we propose adaptation of the replay buffer concept, borrowed from the off-policy learning setting, to create the method, combining advantages of on- and off-policy learning. To achieve this, the proposed algorithm generalises the $Q$-, value and advantage functions for data from multiple policies. The method uses trust region optimisation, while avoiding some of the common problems of the algorithms such as TRPO or ACKTR: it uses hyperparameters to replace the trust region selection heuristics, as well as the trainable covariance matrix instead of the fixed one. In many cases, the method not only improves the results comparing to the state-of-the-art trust region on-policy learning algorithms such as PPO, ACKTR and TRPO, but also with respect to their off-policy counterpart DDPG.	en_GB
dc.description.sponsorship	Engineering and Physical Sciences Research Council (EPSRC)	en_GB
dc.identifier.citation	Working paper in arXiv	en_GB
dc.identifier.grantnumber	EP/N035399/1	en_GB
dc.identifier.uri	http://hdl.handle.net/10871/35684
dc.language.iso	en	en_GB
dc.publisher	arXiv.org	en_GB
dc.relation.url	http://arxiv.org/abs/1901.06212v1	en_GB
dc.rights	© 2019 The Author(s)	en_GB
dc.title	On-Policy Trust Region Policy Optimisation with Replay Buffers	en_GB
dc.type	Working Paper	en_GB
dc.date.available	2019-01-31T14:34:31Z
dc.rights.uri	http://www.rioxx.net/licenses/all-rights-reserved	en_GB
dcterms.dateAccepted	2019-01-18
exeter.funder	::Engineering and Physical Sciences Research Council (EPSRC)	en_GB
rioxxterms.version	AO	en_GB
rioxxterms.licenseref.startdate	2019-01-18
rioxxterms.type	Working paper	en_GB
refterms.dateFOA	2019-01-31T14:34:33Z
refterms.panel	B	en_GB

Files in this item

Name:: 1901.06212v1.pdf
Size:: 3.524Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science

Show simple item record

Show Statistical Information