Bug of PPO #1072

zkx741481546 · 2020-03-13T11:01:54Z

ratio = tf.exp(pi.log_prob(action) - old_pi.log_prob(action))
surr = ratio * adv
...
loss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - self.epsilon, 1. + self.epsilon) * adv) )

should use ratio in tf.minimum rather than surr, because surr=ration*adv, and there could be negative value in adv, so the result of tf.minimum may contain a value like -1e10, and cause actor's loss failed.

zkx741481546 · 2020-03-13T11:02:36Z

it should be like this:
self.cliped_ratio = tf.clip_by_value(self.ratio, 1. - METHOD['epsilon'],
1. + METHOD['epsilon'])
self.min_temp = tf.minimum(self.ratio, self.cliped_ratio)
self.aloss = -tf.reduce_mean(self.min_temp * self.tfadv)

quantumiracle · 2020-03-13T15:04:08Z

Why the negative value causes failure in actor loss?
You can also refer to OpenAI baselines here, which has similar process as our repo.

zkx741481546 · 2020-03-13T16:41:00Z

Why the negative value causes failure in actor loss?
You can also refer to OpenAI baselines here, which has similar process as our repo.

I drawed the loss polt and reward plot, when there is a very small negative value, such as 1e-10, the loss will be extremly larger than normally, and the reward stoped increase.
I just tried lower learning rate, and there was no such 1e-10 value came out.
I wonder if it's the same that use my code above, since it's more robust.

Oct	NOV	Dec
	29
2019	2020	2021

tensorlayer / tensorlayer

Bug of PPO #1072

Bug of PPO #1072

zkx741481546 commented Mar 13, 2020

zkx741481546 commented Mar 13, 2020

quantumiracle commented Mar 13, 2020

zkx741481546 commented Mar 13, 2020 •

edited

tensorlayer / tensorlayer

Join GitHub today

GitHub is where the world builds software

Bug of PPO #1072

Bug of PPO #1072

Comments

zkx741481546 commented Mar 13, 2020

zkx741481546 commented Mar 13, 2020

quantumiracle commented Mar 13, 2020

zkx741481546 commented Mar 13, 2020 • edited

Essential cookies

Always active

Analytics cookies

zkx741481546 commented Mar 13, 2020 •

edited