1 Exact Dynamic Programming
1.1 DeterministicDynamicProgramming 2
1.1.1 DeterministicProblems 2
1.1.2 TheDynamicProgrammingAlgorithm 7
1.1.3 Approximation inValue Space 12
1.2 StochasticDynamicProgramming 14
1.3 Examples,Variations, and Simplifications 18
1.3.1 Deterministic ShortestPathProblems 19
1.3.2 DiscreteDeterministicOptimization 21
1.3.3 Problemswith aTermination State 25
1.3.4 Forecasts 26
1.3.5 Problems with Uncontrollable State Components 29
1.3.6 PartialState Information andBelief States 34
1.3.7 LinearQuadraticOptimalControl 38
1.3.8 SystemswithUnknownParameters -Adaptive
Control 40
1.4 ReinforcementLearning andOptimalControl - Some
Terminology 43
1.5 Notes and Sources 45
2 Approximation in Value Space
2.1 ApproximationApproaches inReinforcementLearning 50
2.1.1 General Issues ofApproximation inValue Space 54
2.1.2 Off-Line andOn-LineMethods 56
2.1.3 Model-Based Simplification of the Lookahead
Minimization 57
2.1.4 Model-Free off-Line Q-Factor Approximation 58
2.1.5 Approximation inPolicy Space onTop of
ApproximationinValue Space 61
2.1.6 When is Approximation in Value Space Effective? 62
2.2 Multistep Lookahead 64
??ii
viii Contents
2.2.1 Multistep Lookahead and Rolling Horizon 65
2.2.2 Multistep Lookahead and Deterministic Problems 67
2.3 Problem Approximation 69
2.3.1 Enforced Decomposition 69
2.3.2 Probabilistic Approximation - Certainty Equivalent
Control 76
2.4 Rollout and the Policy Improvement Principle 83
2.4.1 On-Line Rollout for Deterministic Discrete
Optimization 84
2.4.2 Stochastic Rollout and Monte Carlo Tree Search 95
2.4.3 Rollout with an Expert 104
2.5 On-Line Rollout for Deterministic Infinite-Spaces Problems -
Optimization Heuristics 106
2.5.1 Model Predictive Control 108
2.5.2 Target Tubes and the Constrained Controllability
Condition 115
2.5.3 Variants of Model Predictive Control 118
2.6 Notes and Sources 120
3 Parametric Approximation
3.1 Approximation Architectures 126
3.1.1 Linear and Nonlinear Feature-Based Architectures 126
3.1.2 Training of Linear and Nonlinear Architectures 134
3.1.3 Incremental Gradient and Newton Methods 135
3.2 Neural Networks 149
3.2.1 Training of Neural Networks 153
3.2.2 Multilayer and Deep Neural Networks 157
3.3 Sequential Dynamic Programming Approximation 161
3.4 Q-Factor Parametric Approximation 162
3.5 Parametric Approximation in Policy Space by
Classification 165
3.6 Notes and Sources 171
4 Infinite Horizon Dynamic Programming
4.1 An Overview of Infinite Horizon Problems 174
4.2 Stochastic Shortest Path Problems 177
4.3 Discounted Problems 187
4.4 Semi-Markov Discounted Problems 192
4.5 Asynchronous Distributed Value Iteration 197
4.6 Policy Iteration 200
4.6.1 Exact Policy Iteration 200
4.6.2 Optimistic and Multistep Lookahead Policy
Iteration 205
4.6.3 Policy Iteration for Q-factors 208
Contents i??
4.7 Notes and Sources 209
4.8 Appendix: MathematicalAnalysis 211
4.8.1 Proofs for Stochastic ShortestPathProblems 212
4.8.2 Proofs forDiscountedProblems 217
4.8.3 ConvergenceofExact andOptimistic
Policy Iteration 218
5 Infinite Horizon Reinforcement Learning
5.1 Approximation in Value Space - Performance Bounds 222
5.1.1 LimitedLookahead 224
5.1.2 Rollout and Approximate Policy Improvement 227
5.1.3 ApproximatePolicy Iteration 232
5.2 FittedValue Iteration 235
5.3 Simulation-BasedPolicy IterationwithParametric
Approximation 239
5.3.1 Self-Learning andActor-CriticMethods 239
5.3.2 Model-Based Variant of a Critic-Only Method 241
5.3.3 Model-FreeVariant of aCritic-OnlyMethod 243
5.3.4 Implementation Issues ofParametricPolicy
Iteration 246
5.3.5 Convergence Issues ofParametricPolicy Iteration -
Oscillations 249
5.4 Q-Learning 253
5.4.1 Optimistic Policy Iteration with Parametric Q-Factor
Approximation- SARSAandDQN 255
5.5 AdditionalMethods -TemporalDifferences 256
5.6 Exact andApproximateLinearProgramming 267
5.7 Approximation inPolicy Space 270
5.7.1 Training byCostOptimization -PolicyGradient,
Cross-Entropy,andRandomSearchMethods 276
5.7.2 Expert-BasedSupervisedLearning 286
5.7.3 ApproximatePolicy Iteration,Rollout, and
ApproximationinPolicySpace 288
5.8 Notes and Sources 293
5.9 Appendix: MathematicalAnalysis 298
5.9.1 Performance Bounds for Multistep Lookahead 299
5.9.2 Performance Bounds for Rollout 301
5.9.3 Performance Bounds for Approximate Policy
Iteration 304
6 Aggregation
6.1 AggregationwithRepresentativeStates 308
6.1.1 Continuous State and Control Space Discretization p 314
6.1.2 Continuous State Space - POMDP Discretization 315
?? Contents
6.2 AggregationwithRepresentativeFeatures 317
6.2.1 Hard Aggregation and Error Bounds 320
6.2.2 AggregationUsingFeatures 322
6.3 Methods for Solving theAggregateProblem 328
6.3.1 Simulation-BasedPolicy Iteration 328
6.3.2 Simulation-Based Value Iteration 331
6.4 Feature-BasedAggregationwith aNeuralNetwork 332
6.5 BiasedAggregation 334
6.6 Notes and Sources 337
6.7 Appendix: MathematicalAnalysis 340
References 345
Index 369