a RL policy optimization where the critic model is the same size as the policy models
samples a group of ouputs from given policy and optimize policy model :
**
and RL.
a RL policy optimization where the critic model is the same size as the policy models
samples a group of ouputs from given policy and optimize policy model :
**