From bd23fb7eff26454b1c4ef882a7ebca4e00b06af9 Mon Sep 17 00:00:00 2001 From: Aabir Abubaker Kar Date: Mon, 26 Feb 2018 17:35:38 -0500 Subject: [PATCH 1/2] Fixed typos and added inline LaTeX --- mdp.ipynb | 313 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 174 insertions(+), 139 deletions(-) diff --git a/mdp.ipynb b/mdp.ipynb index 59d8b8e3a..7882d0f85 100644 --- a/mdp.ipynb +++ b/mdp.ipynb @@ -247,7 +247,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": { "collapsed": true }, @@ -279,7 +279,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": { "collapsed": true }, @@ -316,7 +316,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": { "collapsed": true }, @@ -525,16 +525,16 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 5, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -553,7 +553,7 @@ "\n", "Now that we have looked how to represent MDPs. Let's aim at solving them. Our ultimate goal is to obtain an optimal policy. We start with looking at Value Iteration and a visualisation that should help us understanding it better.\n", "\n", - "We start by calculating Value/Utility for each of the states. The Value of each state is the expected sum of discounted future rewards given we start in that state and follow a particular policy _pi_. The value or the utility of a state is given by\n", + "We start by calculating Value/Utility for each of the states. The Value of each state is the expected sum of discounted future rewards given we start in that state and follow a particular policy $pi$. The value or the utility of a state is given by\n", "\n", "$$U(s)=R(s)+\\gamma\\max_{a\\epsilon A(s)}\\sum_{s'} P(s'\\ |\\ s,a)U(s')$$\n", "\n", @@ -682,40 +682,40 @@ "source": [ "psource(value_iteration)" ] - }, + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "It takes as inputs two parameters, an MDP to solve and epsilon the maximum error allowed in the utility of any state. It returns a dictionary containing utilities where the keys are the states and values represent utilities.
Value Iteration starts with arbitrary initial values for the utilities, calculates the right side of the Bellman equation and plugs it into the left hand side, thereby updating the utility of each state from the utilities of its neighbors. \n", + "It takes as inputs two parameters, an MDP to solve and epsilon, the maximum error allowed in the utility of any state. It returns a dictionary containing utilities where the keys are the states and values represent utilities.
Value Iteration starts with arbitrary initial values for the utilities, calculates the right side of the Bellman equation and plugs it into the left hand side, thereby updating the utility of each state from the utilities of its neighbors. \n", "This is repeated until equilibrium is reached. \n", - "It works on the principle of _Dynamic Programming_. \n", - "If U_i(s) is the utility value for state _s_ at the _i_ th iteration, the iteration step, called Bellman update, looks like this:\n", + "It works on the principle of _Dynamic Programming_ - using precomputed information to simplify the subsequent computation. \n", + "If $U_i(s)$ is the utility value for state $s$ at the $i$ th iteration, the iteration step, called Bellman update, looks like this:\n", "\n", "$$ U_{i+1}(s) \\leftarrow R(s) + \\gamma \\max_{a \\epsilon A(s)} \\sum_{s'} P(s'\\ |\\ s,a)U_{i}(s') $$\n", "\n", "As you might have noticed, `value_iteration` has an infinite loop. How do we decide when to stop iterating? \n", "The concept of _contraction_ successfully explains the convergence of value iteration. \n", "Refer to **Section 17.2.3** of the book for a detailed explanation. \n", - "In the algorithm, we calculate a value _delta_ that measures the difference in the utilities of the current time step and the previous time step. \n", + "In the algorithm, we calculate a value $delta$ that measures the difference in the utilities of the current time step and the previous time step. \n", "\n", "$$\\delta = \\max{(\\delta, \\begin{vmatrix}U_{i + 1}(s) - U_i(s)\\end{vmatrix})}$$\n", "\n", - "This value of delta decreases over time.\n", - "We terminate the algorithm if the delta value is less than a threshold value determined by the hyperparameter _epsilon_.\n", + "This value of delta decreases as the values of $U_i$ converge.\n", + "We terminate the algorithm if the $delta$ value is less than a threshold value determined by the hyperparameter _epsilon_.\n", "\n", "$$\\delta \\lt \\epsilon \\frac{(1 - \\gamma)}{\\gamma}$$\n", "\n", - "To summarize, the Bellman update is a _contraction_ by a factor of `gamma` on the space of utility vectors. \n", - "Hence, from the properties of contractions in general, it follows that `value_iteration` always converges to a unique solution of the Bellman equations whenever gamma is less than 1.\n", + "To summarize, the Bellman update is a _contraction_ by a factor of $gamma$ on the space of utility vectors. \n", + "Hence, from the properties of contractions in general, it follows that `value_iteration` always converges to a unique solution of the Bellman equations whenever $gamma$ is less than 1.\n", "We then terminate the algorithm when a reasonable approximation is achieved.\n", - "In practice, it often occurs that the policy _pi_ becomes optimal long before the utility function converges. For the given 4 x 3 environment with _gamma = 0.9_, the policy _pi_ is optimal when _i = 4_, even though the maximum error in the utility function is stil 0.46.This can be clarified from **figure 17.6** in the book. Hence, to increase computational efficiency, we often use another method to solve MDPs called Policy Iteration which we will see in the later part of this notebook. \n", + "In practice, it often occurs that the policy $pi$ becomes optimal long before the utility function converges. For the given 4 x 3 environment with $gamma = 0.9$, the policy $pi$ is optimal when $i = 4$ (at the 4th iteration), even though the maximum error in the utility function is stil 0.46. This can be clarified from **figure 17.6** in the book. Hence, to increase computational efficiency, we often use another method to solve MDPs called Policy Iteration which we will see in the later part of this notebook. \n", "
For now, let us solve the **sequential_decision_environment** GridMDP using `value_iteration`." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -734,7 +734,7 @@ " (3, 2): 1.0}" ] }, - "execution_count": 6, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -752,7 +752,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -781,7 +781,7 @@ "" ] }, - "execution_count": 2, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -795,23 +795,23 @@ "metadata": {}, "source": [ "### AIMA3e\n", - "__function__ VALUE-ITERATION(_mdp_, _ε_) __returns__ a utility function \n", - " __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_), \n", - "      rewards _R_(_s_), discount _γ_ \n", - "   _ε_, the maximum error allowed in the utility of any state \n", - " __local variables__: _U_, _U′_, vectors of utilities for states in _S_, initially zero \n", - "        _δ_, the maximum change in the utility of any state in an iteration \n", - "\n", - " __repeat__ \n", - "   _U_ ← _U′_; _δ_ ← 0 \n", - "   __for each__ state _s_ in _S_ __do__ \n", - "     _U′_\\[_s_\\] ← _R_(_s_) + _γ_ max_a_ ∈ _A_(_s_) Σ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] \n", - "     __if__ | _U′_\\[_s_\\] − _U_\\[_s_\\] | > _δ_ __then__ _δ_ ← | _U′_\\[_s_\\] − _U_\\[_s_\\] | \n", - " __until__ _δ_ < _ε_(1 − _γ_)/_γ_ \n", - " __return__ _U_ \n", - "\n", - "---\n", - "__Figure ??__ The value iteration algorithm for calculating utilities of states. The termination condition is from Equation (__??__)." + "__function__ VALUE-ITERATION(_mdp_, _ε_) __returns__ a utility function \n", + " __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_), \n", + "      rewards _R_(_s_), discount _γ_ \n", + "   _ε_, the maximum error allowed in the utility of any state \n", + " __local variables__: _U_, _U′_, vectors of utilities for states in _S_, initially zero \n", + "        _δ_, the maximum change in the utility of any state in an iteration \n", + "\n", + " __repeat__ \n", + "   _U_ ← _U′_; _δ_ ← 0 \n", + "   __for each__ state _s_ in _S_ __do__ \n", + "     _U′_\\[_s_\\] ← _R_(_s_) + _γ_ max_a_ ∈ _A_(_s_) Σ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] \n", + "     __if__ | _U′_\\[_s_\\] − _U_\\[_s_\\] | > _δ_ __then__ _δ_ ← | _U′_\\[_s_\\] − _U_\\[_s_\\] | \n", + " __until__ _δ_ < _ε_(1 − _γ_)/_γ_ \n", + " __return__ _U_ \n", + "\n", + "---\n", + "__Figure ??__ The value iteration algorithm for calculating utilities of states. The termination condition is from Equation (__??__)." ] }, { @@ -1366,18 +1366,13 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 11, "metadata": {}, - "outputs": [], - "source": [ - "pseudocode('Policy-Iteration')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### AIMA3e\n", + "outputs": [ + { + "data": { + "text/markdown": [ + "### AIMA3e\n", "__function__ POLICY-ITERATION(_mdp_) __returns__ a policy \n", " __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_) \n", " __local variables__: _U_, a vector of utilities for states in _S_, initially zero \n", @@ -1395,6 +1390,42 @@ "\n", "---\n", "__Figure ??__ The policy iteration algorithm for calculating an optimal policy." + ], + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pseudocode('Policy-Iteration')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### AIMA3e\n", + "__function__ POLICY-ITERATION(_mdp_) __returns__ a policy \n", + " __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_) \n", + " __local variables__: _U_, a vector of utilities for states in _S_, initially zero \n", + "        _π_, a policy vector indexed by state, initially random \n", + "\n", + " __repeat__ \n", + "   _U_ ← POLICY\\-EVALUATION(_π_, _U_, _mdp_) \n", + "   _unchanged?_ ← true \n", + "   __for each__ state _s_ __in__ _S_ __do__ \n", + "     __if__ max_a_ ∈ _A_(_s_) Σ_s′_ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] > Σ_s′_ _P_(_s′_ | _s_, _π_\\[_s_\\]) _U_\\[_s′_\\] __then do__ \n", + "       _π_\\[_s_\\] ← argmax_a_ ∈ _A_(_s_) Σ_s′_ _P_(_s′_ | _s_, _a_) _U_\\[_s′_\\] \n", + "       _unchanged?_ ← false \n", + " __until__ _unchanged?_ \n", + " __return__ _π_ \n", + "\n", + "---\n", + "__Figure ??__ The policy iteration algorithm for calculating an optimal policy." ] }, { @@ -1410,12 +1441,16 @@ "![title](images/grid_mdp.jpg)\n", "
This is the environment for our agent.\n", "We assume for now that the environment is _fully observable_, so that the agent always knows where it is.\n", - "We also assume that the transitions are **Markovian**, that is, the probability of reaching state _s'_ from state _s_ only on _s_ and not on the history of earlier states.\n", + "We also assume that the transitions are **Markovian**, that is, the probability of reaching state $s'$ from state $s$ depends only on $s$ and not on the history of earlier states.\n", "Almost all stochastic decision problems can be reframed as a Markov Decision Process just by tweaking the definition of a _state_ for that particular problem.\n", "
\n", - "However, the actions of our agent in this environment are unreliable.\n", - "In other words, the motion of our agent is stochastic. \n", - "More specifically, the agent does the intended action with a probability of _0.8_, but with probability _0.1_, it moves to the right and with probability _0.1_ it moves to the left of the intended direction.\n", + "However, the actions of our agent in this environment are unreliable. In other words, the motion of our agent is stochastic. \n", + "

\n", + "More specifically, the agent may - \n", + "* move correctly in the intended direction with a probability of _0.8_, \n", + "* move $90^\\circ$ to the right of the intended direction with a probability 0.1\n", + "* move $90^\\circ$ to the left of the intended direction with a probability 0.1\n", + "

\n", "The agent stays put if it bumps into a wall.\n", "![title](images/grid_mdp_agent.jpg)" ] @@ -1429,7 +1464,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -1552,7 +1587,7 @@ "This is the function that gives the agent a rough estimate of how good being in a particular state is, or how much _reward_ an agent receives by being in that state.\n", "The agent then tries to maximize the reward it gets.\n", "As the decision problem is sequential, the utility function will depend on a sequence of states rather than on a single state.\n", - "For now, we simply stipulate that in each state s, the agent receives a finite reward _R(s)_.\n", + "For now, we simply stipulate that in each state $s$, the agent receives a finite reward $R(s)$.\n", "\n", "For any given state, the actions the agent can take are encoded as given below:\n", "- Move Up: (0, 1)\n", @@ -1565,9 +1600,9 @@ "We cannot have fixed action sequences as the environment is stochastic and we can eventually end up in an undesirable state.\n", "Therefore, a solution must specify what the agent shoulddo for _any_ state the agent might reach.\n", "
\n", - "Such a solution is known as a **policy** and is usually denoted by **π**.\n", + "Such a solution is known as a **policy** and is usually denoted by $\\pi$.\n", "
\n", - "The **optimal policy** is the policy that yields the highest expected utility an is usually denoted by **π* **.\n", + "The **optimal policy** is the policy that yields the highest expected utility an is usually denoted by $\\pi^*$.\n", "
\n", "The `GridMDP` class has a useful method `to_arrows` that outputs a grid showing the direction the agent should move, given a policy.\n", "We will use this later to better understand the properties of the environment." @@ -1575,7 +1610,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -1697,7 +1732,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -1828,7 +1863,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 15, "metadata": { "collapsed": true }, @@ -1853,7 +1888,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 16, "metadata": { "collapsed": true }, @@ -1871,7 +1906,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -1898,7 +1933,7 @@ "![title](images/-0.04.jpg)\n", "
\n", "Notice that, because the cost of taking a step is fairly small compared with the penalty for ending up in `(4, 2)` by accident, the optimal policy is conservative. \n", - "In state `(3, 1)` it recommends taking the long way round, rather than taking the shorter way and risking getting a large negative reward of -1 in `(4, 2)`" + "In state `(3, 1)` it recommends taking the long way round, rather than taking the shorter way and risking getting a large negative reward of -1 in `(4, 2)`." ] }, { @@ -1912,7 +1947,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 18, "metadata": { "collapsed": true }, @@ -1926,7 +1961,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -1972,7 +2007,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 20, "metadata": { "collapsed": true }, @@ -1986,7 +2021,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 21, "metadata": {}, "outputs": [ { @@ -2017,7 +2052,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The living reward for each state is now more negative than the most negative terminal. Life is so painful that the agent heads for the nearest exit as even the worst exit is less painful than the current state." + "The living reward for each state is now lower than the least rewarding terminal. Life is so _painful_ that the agent heads for the nearest exit as even the worst exit is less painful than any living state." ] }, { @@ -2031,7 +2066,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 22, "metadata": { "collapsed": true }, @@ -2045,7 +2080,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 23, "metadata": {}, "outputs": [ { @@ -2141,7 +2176,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.5.3" + "version": "3.6.1" }, "widgets": { "state": { @@ -2166,7 +2201,7 @@ "022a5fdfc8e44fb09b21c4bd5b67a0db": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2197,7 +2232,7 @@ "0675230fb92f4539bc257b768fb4cd10": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2213,7 +2248,7 @@ "0783e74a8c2b40cc9b0f5706271192f4": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2241,7 +2276,7 @@ "098f12158d844cdf89b29a4cd568fda0": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2266,7 +2301,7 @@ "0b65fb781274495ab498ad518bc274d4": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2375,7 +2410,7 @@ "1af711fe8e4f43f084cef6c89eec40ae": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2391,7 +2426,7 @@ "1c5c913acbde4e87a163abb2e24e6e38": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2416,7 +2451,7 @@ "200e3ebead3d4858a47e2f6d345ca395": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2534,7 +2569,7 @@ "2d3acd8872c342eab3484302cac2cb05": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2544,7 +2579,7 @@ "2e1351ad05384d058c90e594bc6143c1": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2557,7 +2592,7 @@ "2f5438f1b34046a597a467effd43df11": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2594,7 +2629,7 @@ "319425ba805346f5ba366c42e220f9c6": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2613,7 +2648,7 @@ "332a89c03bfb49c2bb291051d172b735": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2662,7 +2697,7 @@ "388571e8e0314dfab8e935b7578ba7f9": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2684,7 +2719,7 @@ "3a21291c8e7249e3b04417d31b0447cf": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2697,7 +2732,7 @@ "3b22d68709b046e09fe70f381a3944cd": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2707,7 +2742,7 @@ "3c1b2ec10a9041be8a3fad9da78ff9f6": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2732,7 +2767,7 @@ "3e5b9fd779574270bf58101002c152ce": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2742,7 +2777,7 @@ "3e8bb05434cb4a0291383144e4523840": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2791,7 +2826,7 @@ "428e42f04a1e4347a1f548379c68f91b": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2807,7 +2842,7 @@ "4379175239b34553bf45c8ef9443ac55": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2820,7 +2855,7 @@ "4421c121414d464bb3bf1b5f0e86c37b": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2851,7 +2886,7 @@ "4731208453424514b471f862804d9bb8": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2900,7 +2935,7 @@ "4d281cda33fa489d86228370e627a5b0": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2919,7 +2954,7 @@ "4ec035cba73647358d416615cf4096ee": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2944,7 +2979,7 @@ "5141ae07149b46909426208a30e2861e": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -2981,7 +3016,7 @@ "55a1b0b794f44ac796bc75616f65a2a1": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3042,7 +3077,7 @@ "595c537ed2514006ac823b4090cf3b4b": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3103,7 +3138,7 @@ "5f823979d2ce4c34ba18b4ca674724e4": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3143,14 +3178,14 @@ "644dcff39d7c47b7b8b729d01f59bee5": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, "6455faf9dbc6477f8692528e6eb90c9a": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3163,7 +3198,7 @@ "665ed2b201144d78a5a1f57894c2267c": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3206,7 +3241,7 @@ "6a28f605a5d14589907dba7440ede2fc": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3231,7 +3266,7 @@ "6d7effd6bc4c40a4b17bf9e136c5814c": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3280,7 +3315,7 @@ "72dfe79a3e52429da1cf4382e78b2144": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3311,7 +3346,7 @@ "75e344508b0b45d1a9ae440549d95b1a": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3369,7 +3404,7 @@ "7f2f98bbffc0412dbb31c387407a9fed": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3400,7 +3435,7 @@ "82e2820c147a4dff85a01bcddbad8645": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3503,21 +3538,21 @@ "8cffde5bdb3d4f7597131b048a013929": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, "8db2abcad8bc44df812d6ccf2d2d713c": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, "8dd5216b361c44359ba1233ee93683a4": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3563,7 +3598,7 @@ "933904217b6045c1b654b7e5749203f5": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3591,7 +3626,7 @@ "94f2b877a79142839622a61a3a081c03": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3613,7 +3648,7 @@ "97207358fc65430aa196a7ed78b252f0": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3626,7 +3661,7 @@ "986c6c4e92964759903d6eb7f153df8a": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3669,14 +3704,14 @@ "9d5e9658af264ad795f6a5f3d8c3c30f": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, "9d7aa65511b6482d9587609ad7898f54": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3695,7 +3730,7 @@ "9efb46d2bb0648f6b109189986f4f102": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3711,7 +3746,7 @@ "9f43f85a0fb9464e9b7a25a85f6dba9c": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3724,7 +3759,7 @@ "9faa50b44e1842e0acac301f93a129c4": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3749,7 +3784,7 @@ "a1840ca22d834df2b145151baf6d8241": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3786,7 +3821,7 @@ "a39cfb47679c4d2895cda12c6d9d2975": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3817,7 +3852,7 @@ "a87c651448f14ce4958d73c2f1e413e1": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3926,7 +3961,7 @@ "b7e4c497ff5c4173961ffdc3bd3821a9": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3951,7 +3986,7 @@ "b9c138598fce460692cc12650375ee52": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -3970,7 +4005,7 @@ "bbe5dea9d57d466ba4e964fce9af13cf": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4004,7 +4039,7 @@ "beb0c9b29d8d4d69b3147af666fa298b": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4071,7 +4106,7 @@ "c74bbd55a8644defa3fcef473002a626": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4138,7 +4173,7 @@ "ce3a0e82e80d48b9b2658e0c52196644": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4148,7 +4183,7 @@ "ce8d3cd3535b459c823da2f49f3cc526": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4218,7 +4253,7 @@ "d83329fe36014f85bb5d0247d3ae4472": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4252,7 +4287,7 @@ "dc7376a2272e44179f237e5a1c7f6a49": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4349,7 +4384,7 @@ "e4e5dd3dc28d4aa3ab8f8f7c4a475115": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4365,7 +4400,7 @@ "e64ab85e80184b70b69d01a9c6851943": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4462,7 +4497,7 @@ "f262055f3f1b48029f9e2089f752b0b8": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4493,7 +4528,7 @@ "f3df35ce53e0466e81a48234b36a1430": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4572,7 +4607,7 @@ "f9458080ed534d25856c67ce8f93d5a1": { "views": [ { - "cell_index": 27.0 + "cell_index": 27 } ] }, @@ -4633,4 +4668,4 @@ }, "nbformat": 4, "nbformat_minor": 1 -} \ No newline at end of file +} From d8af22bd3f5f70a9d2f077aa1c1a7266d933a847 Mon Sep 17 00:00:00 2001 From: Aabir Abubaker Kar Date: Mon, 26 Feb 2018 20:53:36 -0500 Subject: [PATCH 2/2] Fixed more backslashes --- mdp.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mdp.ipynb b/mdp.ipynb index 7882d0f85..910b49040 100644 --- a/mdp.ipynb +++ b/mdp.ipynb @@ -697,12 +697,12 @@ "As you might have noticed, `value_iteration` has an infinite loop. How do we decide when to stop iterating? \n", "The concept of _contraction_ successfully explains the convergence of value iteration. \n", "Refer to **Section 17.2.3** of the book for a detailed explanation. \n", - "In the algorithm, we calculate a value $delta$ that measures the difference in the utilities of the current time step and the previous time step. \n", + "In the algorithm, we calculate a value $\\delta$ that measures the difference in the utilities of the current time step and the previous time step. \n", "\n", "$$\\delta = \\max{(\\delta, \\begin{vmatrix}U_{i + 1}(s) - U_i(s)\\end{vmatrix})}$$\n", "\n", "This value of delta decreases as the values of $U_i$ converge.\n", - "We terminate the algorithm if the $delta$ value is less than a threshold value determined by the hyperparameter _epsilon_.\n", + "We terminate the algorithm if the $\\delta$ value is less than a threshold value determined by the hyperparameter _epsilon_.\n", "\n", "$$\\delta \\lt \\epsilon \\frac{(1 - \\gamma)}{\\gamma}$$\n", "\n", pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy