The Wayback Machine - https://web.archive.org/web/20201129210409/https://github.com/tensorlayer/tensorlayer/issues/1072
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug of PPO #1072

Open
zkx741481546 opened this issue Mar 13, 2020 · 3 comments
Open

Bug of PPO #1072

zkx741481546 opened this issue Mar 13, 2020 · 3 comments

Comments

@zkx741481546
Copy link

@zkx741481546 zkx741481546 commented Mar 13, 2020

ratio = tf.exp(pi.log_prob(action) - old_pi.log_prob(action))
surr = ratio * adv
...
loss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - self.epsilon, 1. + self.epsilon) * adv) )

should use ratio in tf.minimum rather than surr, because surr=ration*adv, and there could be negative value in adv, so the result of tf.minimum may contain a value like -1e10, and cause actor's loss failed.

@zkx741481546
Copy link
Author

@zkx741481546 zkx741481546 commented Mar 13, 2020

it should be like this:
self.cliped_ratio = tf.clip_by_value(self.ratio, 1. - METHOD['epsilon'],
1. + METHOD['epsilon'])
self.min_temp = tf.minimum(self.ratio, self.cliped_ratio)
self.aloss = -tf.reduce_mean(self.min_temp * self.tfadv)

@quantumiracle
Copy link
Member

@quantumiracle quantumiracle commented Mar 13, 2020

Why the negative value causes failure in actor loss?
You can also refer to OpenAI baselines here, which has similar process as our repo.

@zkx741481546
Copy link
Author

@zkx741481546 zkx741481546 commented Mar 13, 2020

Why the negative value causes failure in actor loss?
You can also refer to OpenAI baselines here, which has similar process as our repo.

I drawed the loss polt and reward plot, when there is a very small negative value, such as 1e-10, the loss will be extremly larger than normally, and the reward stoped increase.
I just tried lower learning rate, and there was no such 1e-10 value came out.
I wonder if it's the same that use my code above, since it's more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.