基本概念

发布时间 2023-06-10 11:01:46作者: POLAYOR

基本概念

State

$$
s_i\quad, \quad S = {s_i}
$$

  • 表示状态和状态空间(集合)

Action

$$
a_i \quad , \quad A = {a_i}
$$

  • 表示动作和动作空间(集合)
  • 可用Tabular representation表示

Policy

$$
\pi \quad , \quad \pi (a_i | s_j) = c_{k}
$$

  • 用概率形式表示动作可能的结果
  • 针对一个状态的概率之和为1
  • 可用Tabular representation表示

Deterministic policy (确定性情况)

对于一个状态S_j,一个动作a_i对他的概率为1,其余动作对该状态的概率均为0

Stochastic policy(不确定性情况)

不存在某一个动作对一个状态的概率为1

Reward

  • positive reward -> encouragement
  • negative reward -> punishment

$$
p(r=-1|s_1, a_1) = 1 \quad & \quad p(r \neq -1 | s_1,a_1) = 0
$$

Discount rate

$$
\gamma \in [0,1)
$$

Discounted return

$$
\begin{align}
\text{discounted return} &= p_1 + \gamma p_2 + \gamma ^2 p_3 + \gamma ^3 p_4 + \gamma ^4 p_5 + \gamma ^5 p_6 + \dots \

\text{In the case: }& p_1 =0 , p_2=0 , p_3=0 , p_4=1 , p_5=1 , p_6=1 \

\text{discounted return} &= \gamma ^3 (1+ \gamma + \gamma ^2 + \dots) \
&=\gamma ^3 \frac{1}{1-\gamma}.
\end{align}
$$

Roles:

  1. the sum becomes finite;

  2. balance the far and near future rewards:

    • $$
      \text{If } \gamma \text{ is close to 0, the value of the discounted return is dominated by the rewards obtained in the near future.}
      $$

    • $$
      \text{If } \gamma \text{ is close to 1, the value of the discounted return is dominated by the rewards obtained in the far future.}
      $$

Markov decision process (MDP)

Markov property: memoryless property (不具有记忆性:与历史无关)
$$
p(s_{t+1}|a_{t+1},s_t, \dots ,a_1,s_0) = p(s_{t+1}|a_{t+1},s_t), \
p(r_{t+1}|a_{t+1},s_t, \dots ,a_1,s_0) = p(r_{t+1}|a_{t+1},s_t).
$$

  • Markov process 是带有概率的动作
  • 被赋予了 policy 的 Markov process 是 Markov decision process