信任域政策最佳化

出自維基百科,自由嘅百科全書
(由TRPO跳轉過嚟)

信任域政策最佳化英文Trust Region Policy OptimizationTRPO)係一類無模型強化學習演算法。佢係一種同策(on-policy)方法,限制KL分歧冇超過某隻數來保證新舊政策走差冇咁遠嘅。[1]喺 TRPO 基礎上發展出有 PPO 啲方法[2]

[編輯]

  1. "Trust Region Policy Optimization". OpenAI Spinning Up. 喺2022-06-19搵到.{{cite web}}: CS1 maint: url-status (link)
  2. "Proximal Policy Optimization". OpenAI Spinning Up. 喺2022-06-18搵到.{{cite web}}: CS1 maint: url-status (link)