In REINFORCE we first select actions that take us from the beginning of a cart-pole game until (typically) the pole tips over or the cart goes out of bounds. We do this without updating parameters. We save those actions and, in effect, go through the entire scenario all over again, this time computing loss and updating parameters. Note that if we had saved the actions and their soft max probabilities then we could compute the loss without doing all the computation that feeds into the loss function a second time. Explain why this nevertheless does not work | why REINFORCE would not learn anything if we did this without the duplicated computation.
Already registered? Login
Not Account? Sign up
Enter your email address to reset your password
Back to Login? Click here