Consider an undiscounted MDP having three states, (1, 2, 3), with rewards -1, -2, 0, respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b. The...

Consider an undiscounted MDP having three states, (1, 2, 3), with rewards -1, -2, 0,<br>respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b.<br>The transition model is as follows:<br>- In state 1, action a moves the agent to state 2 with probability 0.6 and makes the agent stay<br>put with probability 0.4.<br>In state 2, action a moves the agent to state 1 with probability 0.6 and makes the agent stay<br>put with probability 0.4<br>- In either state 1 or state 2, action b moves the agent to state 3 with probability 0.2 and makes<br>the agent stay put with probability 0.8.<br>Answer the following questions:<br>1. What can be determined qualitatively about the optimal policy in states 1 and 2?<br>2. Apply policy iteration, showing each step in full, to determine the optimal policy and the<br>values of states 1 and 2. Assume that the initial policy has action b in both states.<br>3. What happens to policy iteration if the initial policy has action a in both states? Does<br>discounting help? Does the optimal policy depend on the discount factor?<br>

Extracted text: Consider an undiscounted MDP having three states, (1, 2, 3), with rewards -1, -2, 0, respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b. The transition model is as follows: - In state 1, action a moves the agent to state 2 with probability 0.6 and makes the agent stay put with probability 0.4. In state 2, action a moves the agent to state 1 with probability 0.6 and makes the agent stay put with probability 0.4 - In either state 1 or state 2, action b moves the agent to state 3 with probability 0.2 and makes the agent stay put with probability 0.8. Answer the following questions: 1. What can be determined qualitatively about the optimal policy in states 1 and 2? 2. Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and 2. Assume that the initial policy has action b in both states. 3. What happens to policy iteration if the initial policy has action a in both states? Does discounting help? Does the optimal policy depend on the discount factor?

Jun 03, 2022

SOLUTION.PDF

Consider an undiscounted MDP having three states, (1, 2, 3), with rewards -1, -2, 0, respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b. The...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment