In REINFORCE we first select actions that take us from the beginning of a cart-pole game until (typically) the pole tips over or the cart goes out of bounds. We do this without updating parameters. We...


In REINFORCE we first select actions that take us from the beginning of a cart-pole game until (typically) the pole tips over or the cart goes out of bounds. We do this without updating parameters. We save those actions and, in effect, go through the entire scenario all over again, this time computing loss and updating parameters. Note that if we had saved the actions and their soft max probabilities then we could compute the loss without doing all the computation that feeds into the loss function a second time. Explain why this nevertheless does not work | why REINFORCE would not learn anything if we did this without the duplicated computation.




May 18, 2022
SOLUTION.PDF

Get Answer To This Question

Submit New Assignment

Copy and Paste Your Assignment Here