P72

Sampling-based Policy Optimization

✅ 基于采样的方法。

  • Iterative methods
    Goal: find the optimal policy \(\pi (s;\theta )\) that minimize the objective \(J(\theta )=\sum_{t=0}^{}h(s_t,a_t) \)

    • Initialize policy parmeters \(\pi (x;\theta )\)
    • Repeat:
      • Propose a set of candidate parameters {\(\theta _i \)} according to \(\theta \)
      • Simulate the agent under the control of each \( \pi ( \theta _i)\)
      • Evaluate the objective function \( J (\theta_i )\) on the simulated state-action sequences
      • Update the estimation of \(\theta \) based on {\( J (\theta_i )\)}
  • Example: CMA-ES 把 \( \theta\) 建模为高斯分布,每次更新高斯分布的均值和方差。

P73

Example: Locomotion Controller with Linear Policy

🔎 [Liu et al. 2012 – Terrain Runner]

P74

Stage 1a: Open-loop Policy

Find open-loop control using SAMCON

✅ 使用开环轨迹优化得到开环控制轨迹。

P76

Stage 1b: Linear Feedback Policy

✅ 使用反馈控制更新控制信号。由于假设了线性关系,根据偏离 offset 可直接得到调整 offset.

P78

Stage 1b: Reduced-order Closed-loop Policy

✅ 把 \(M\) 分解为两个矩阵,\(M_{AXB} = M_{AXC}\cdot M_{CXB}\) 如果 \(C\) 比较小,可以明显减少矩阵的参数量。
✅ 好处:(1) 减少参数,减化优化过程。(2) 抹掉状态里不需要的信息。

P79

一些工程上的 trick

Manually-selected States: s

  • Running: 12 dimensions

✅ (1)根结点旋转(2)质心位置(3)质心速度(4)支撑脚位置

P80

Manually-selected Controls: a

  • for all skills: 9 dimensions

✅ 仅对少数关节加反馈。

P81

Optimization

$$ \delta a=M\delta s+\hat{a} $$

  • Optimize \(M\)
    • CMA, Covariance Matrix Adaption ([Hansen 2006])
    • For the running task:
      • #optimization variables: \(12 ^\ast 9 = 108 / (12^\ast 3+3 ^\ast 9) = 63\)
    • 12 minutes on 24 cores

本文出自CaterpillarStudyGroup,转载请注明出处。

https://caterpillarstudygroup.github.io/GAMES105_mdbook/