Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Premium control with reinforcement learning

Premium control with reinforcement learning s1.sIntroductionsAn insurance company’s claim costs and investment earnings fluctuate randomly over time. The insurance company needs to determine the premiums before the coverage periods start, that is before knowing what claim costs will appear and without knowing how its invested capital will develop. Hence, the insurance company is facing a dynamic stochastic control problem. The problem is complicated because of delays and feedback effects: premiums are paid before claim costs materialise and premium levels affect whether the company attracts or loses customers.sAn insurance company wants a steady high surplus. The optimal dividend problem introduced by de Finetti (1957) (and solved by Gerber, 1969) has the objective to maximise the expected present value of future dividends. Its solution takes into account that paying dividends too generously is suboptimal since a probability of default that is too high affects the expected present value of future dividends negatively. A practical problem with implementing the optimal premium rule, that is a rule that maps the state of the stochastic environment to a premium level, obtained from solving the optimal dividend problem is that the premiums would be fluctuating more than what would be feasible for a real insurance market with competition. A good premium rule needs to generate premiums that do not fluctuate wildly over time.sFor a mutual insurance company, different from a private company owned by shareholders, maximising dividends is not the main objective. Instead the premiums should be low and suitably averaged over time, but also making sure that the surplus is sufficiently high to avoid a too high probability of default. Solving this multiple-objective optimisation problem is the focus of the present paper. Similar premium control problems have been studied by Martin-Löf (1983, 1994), and these papers have been a source of inspiration for our work.sMartin-Löf (1983) carefully sets up the balance equations for the key economic variables of relevance for the performance of the insurance company and studies the premium control problem as a linear control problem under certain simplifying assumptions enabling application of linear control theory. The paper analyses the effects of delays in the insurance dynamical system on the linear control law with feedback and discusses designs of the premium control that ensure that the probability of default is small.sMartin-Löf (1994) considers an application of general optimal control theory in a setting similar to, but simpler than, the setting considered in Martin-Löf (1983). The paper derives and discusses the optimal premium rule that achieves low and averaged premiums and also targets sufficient solvency of the insurance company.sThe literature on optimal control theory in insurance is vast, see for example the textbook treatment by Schmidli (2008) and references therein. Our aim is to provide solutions to realistic premium control problems in order to allow the optimal premium rule to be used with confidence by insurance companies. In particular, we avoid considering convenient stochastic models that may fit well with optimal control theory but fail to take key features of real dynamical insurance systems into account. Instead, we consider an insurance company that has enough data to suggest realistic models for the insurance environment, but the complexity of these models do not allow for explicit expressions for transition probabilities of the dynamical system. In this sense, the model of the environment is not fully known. However, the models can be used for simulating the behaviour of the environment.sIncreased computing power and methodological advances during the recent decades make it possible to revisit the problems studied in Martin-Löf (1983, 1994) and in doing so allow for more complex and realistic dynamics of the insurance dynamical system. Allowing realistic complex dynamics means that optimal premium rules, if possible to be obtained, will allow insurance companies to not only be given guidance on how to set premiums but actually have premium rules that they can use with certain confidence. The methodological advances that we use in this work is reinforcement learning and in particular reinforcement learning combined with function approximation, see for example Bertsekas and Tsitsiklis (1996) and Sutton and Barto (2018) and references therein. In the present paper, we focus on the temporal difference control algorithms SARSA and Q-learning. SARSA was first proposed by Rummery and Niranjan (1994) and named by Sutton (1995). Q-learning was introduced by Watkins (1989). By using reinforcement learning methods combined with function approximation, we obtain premium rules in terms of Markovian controls for Markov decision processes whose state spaces are much larger/more realistic than what was considered in the premium control problem studied in Martin-Löf (1994).sThere exist other methods for solving general stochastic control problems with a known model of the environment, see for example Han and E (2016) and Germain et al. (2021). However, the deep learning methods in these papers are developed to solve fixed finite-time horizon problems. The stochastic control problem considered in the present paper has an indefinite-time horizon, since the terminal time is random and unbounded. A random terminal time also causes problems for the computation of gradients in deep learning methods. There are reinforcement learning methods, such as policy gradient methods (see e.g., Williams, 1992; Sutton et al., 1999, or for an overview Sutton and Barto, 2018, Ch. 13), that enable direct approximation of the premium rule by neural networks (or other function approximators) when the terminal time is random. However, for problems where the terminal time can be quite large (as in the present paper) these methods likely require an additional approximation of the value function (so-called actor-critic methods).sIn the mathematical finance literature, there has recently been significant interest in the use of reinforcement learning, in particular related to hedging combined with function approximation, for instance the influential paper by Buehler et al. (2019) on deep hedging. Carbonneau (2021) uses the methodology in Buehler et al. (2019) and studies approaches to risk management of long-term financial derivatives motivated by guarantees and options embedded in life-insurance products. Another approach to deep hedging based on reinforcement learning for managing risks stemming from long-term life-insurance products is presented in Chong et al. (2021). Dynamic pricing has been studied extensively in the operations research literature. For instance, the problem of finding the optimal balance between learning an unknown demand function and maximising revenue is related to reinforcement learning. We refer to den Boer and Zwart (2014) and references therein. Reinforcement learning is used in Krasheninnikova et al. (2019) for determining a renewal pricing strategy in an insurance setting. However, the problem formulation and solution method are different from what is considered in the present paper. Krasheninnikova et al. (2019) considers retention of customers while maximising revenue and does not take claims costs into account. Furthermore, in Krasheninnikova et al. (2019) the state space is discretised in order to use standard Q-learning, while the present paper solves problems with a large or infinite state space by combining reinforcement learning with function approximation.sThe paper is organised as follows. Section 2 describes the relevant insurance economics by presenting the involved cash flows, key economic quantities such as surplus, earned premium, reserves and how such quantities are connected to each other and their dynamics or balance equations. Section 2 also introduces stochastic models (simple, intermediate and realistic) giving a complete description of the stochastic environment in which the insurance company operates and aims to determine an optimal premium rule. The stochastic model will serve us by enabling simulation of data from which the reinforcement learning methods gradually learn the stochastic environment in the search for optimal premium rules. The models are necessarily somewhat complex since we want to take realistic insurance features into account, such as delays between accidents and payments and random fluctuations in the number of policyholders, partly due to varying premium levels.sSection 3 sets up the premium control problem we aim to solve in terms of a Markov decision process and standard elements of stochastic control theory such as the Bellman equation. Finding the optimal premium rule by directly solving the Bellman (optimality) equation numerically is not possible when considering state spaces for the Markov decision process matching a realistic model for the insurance dynamical system. Therefore, we introduce reinforcement learning methods in Section 4. In particular, we present basic theory for the temporal difference learning methods Q-learning and SARSA. We explain why these methods will not be able to provide us with reliable estimates of optimal premium rules unless we restrict ourselves to simplified versions of the insurance dynamical system. We argue that SARSA combined with function approximation of the so-called action-value function will allow us to determine optimal premium rules. We also highlight several pitfalls that the designer of the reinforcement learning method must be aware of and make sure to avoid.sSection 5 presents the necessary details in order to solve the premium control problem using SARSA with function approximation. We analyse the effects of different model/method choices on the performance of different reinforcement learning techniques and compare the performance of the optimal premium rule with those of simpler benchmark rules.sFinally, Section 6 concludes the paper. We emphasise that the premium control problem studied in the present paper is easily adjusted to fit the features of a particular insurance company and that the excellent performance of a carefully set up reinforcement learning method with function approximation provides the insurance company with an optimal premium rule that can be used in practice and communicated to stakeholders.s2.sA stochastic model of the insurance companysThe number of contracts written during year s$t+1$sis denoted s$N_{t+1}$s, known at the end of year s$t+1$s. The premium per contract s$P_t$sduring year s$t+1$sis decided at the end of year t. Hence, s$P_t$sis s$\mathcal{F}_t$s-measurable, where s$\mathcal{F}_t$sdenotes the s$\sigma$s-algebra representing the available information at the end of year t. Contracts are assumed to be written uniformly in time over the year and provide insurance coverage for one year. Therefore, assuming that the premium income is earned linearly with time, the earned premium during year s$t+1$sis s\begin{equation*}\text{EP}_{t+1}=\frac{1}{2}\left(P_tN_{t+1}+P_{t-1}N_t\right),\end{equation*}sthat is for contracts written during year s$t+1$s, on average half of the premium income s$P_tN_{t+1}$swill be earned during year s$t+1$s, and half during year s$t+2$s. Since only half of the premium income s$P_tN_{t+1}$sis earned during year s$t+1$s, the other half, which should cover claims during year s$t+2$s, will be stored in the premium reserve. The balance equation for the premium reserve is s$V_{t+1}=V_t+P_tN_{t+1}-\text{EP}_{t+1}$s. Note that when we add cash flows or reserves occurring at time s$t+1$sto cash flows or reserves occurring at time t, the time t amounts should be interpreted as adjusted for the time value of money. We choose not to write this out explicitly in order to simplify notation.sThat contracts are written uniformly in time over the year means that s$I_{t,k}$s, the incremental payment to policyholders during year s$t+k$sfor accidents during year s$t+1$s, will consist partly of payments to contracts written during year s$t+1$sand partly of payments to contracts written during year t. Hence, we assume that s$I_{t,k}$sdepends on both s$N_{t+1}$sand s$N_t$s. Table 1 shows a claims triangle with entries s$I_{j,k}$srepresenting incremental payments to policyholders during year s$j+k$sfor accidents during year s$j+1$s. For ease of presentation, other choices could of course be made, we will assume that the maximum delay between an accident and a resulting payment is four years. Entries s$I_{j,k}$swith s$j+k\leq t$sare s$\mathcal{F}_t$s-measurable and coloured blue in Table 1. Let s\begin{align*}\text{IC}_{t+1}&=I_{t,1}+\operatorname{E}\!\left[I_{t,2}+I_{t,3}+I_{t,4}\mid \mathcal{F}_{t+1}\right],\quad \text{PC}_{t+1}=I_{t,1}+I_{t-1,2}+I_{t-2,3}+I_{t-3,4},\\\text{RP}_{t+1}&=\operatorname{E}\!\left[I_{t-3,4}+I_{t-2,3}+I_{t-2,4}+I_{t-1,2}+I_{t-1,3}+I_{t-1,4}\mid\mathcal{F}_t\right]\\&\quad-\operatorname{E}\!\left[I_{t-3,4}+I_{t-2,3}+I_{t-2,4}+I_{t-1,2}+I_{t-1,3}+I_{t-1,4}\mid\mathcal{F}_{t+1}\right],\end{align*}swhere IC, PC and RP denote, respectively, incurred claims, paid claims and runoff profit. The balance equation for the claims reserve is s$E_{t+1}=E_{t}+\text{IC}_{t+1}-\text{RP}_{t+1}-\text{PC}_{t+1}$s, where s(2.1)sThe profit or loss during year s$t+1$sdepends on changes in the reserves: s\begin{align*}\text{P&L}_{t+1} = P_tN_{t+1} - \text{PC}_{t+1} + \text{IE}_{t+1}-\text{OE}_{t+1} - (E_{t+1}-E_t + V_{t+1}-V_t),\end{align*}swhere IE denotes investment earnings and OE denotes operating expenses. The dynamics of the surplus fund is therefore s(2.2)s\begin{align}G_{t+1}=G_t+\text{P&L}_{t+1}=G_t+\text{EP}_{t+1}+\text{IE}_{t+1}-\text{OE}_{t+1}-\text{IC}_{t+1}+\text{RP}_{t+1}.\end{align}sTable 1.sPaid claim amounts from accidents during years s$t-2,\dots,t+1$s.sWe consider three models of increasing complexity. The simple model allows us to solve the premium control problem with classical methods. In this situation, we can compare the results obtained with classical methods with the results obtained with more flexible methods, allowing the assessment of the performance of a chosen flexible method. Classical solution methods are not feasible for the intermediate model. However, the similarity between the simple and intermediate model allows us to understand how increasing model complexity affects the optimal premium rule. Finally, we consider a realistic model, where the models for the claims payments and investment earnings align closer with common distributional assumptions for these quantities. Since the simple model is a simplified version of the intermediate model, we begin by defining the intermediate model in Section 2.1, followed by the simple model in Section 2.2. In Section 2.3, the more realistic models for claims payments and investment earnings are defined.s2.1.sIntermediate modelsWe choose to model the key random quantities as integer-valued random variables with conditional distributions that are either Poisson or Negative Binomial distributions. Other choices of distributions on the integers are possible without any major effects on the analysis that follows. Let s(2.3)s\begin{align}\mathcal{L}\!\left(N_{t+1}\mid \mathcal{F}_t\right)=\mathcal{L}(N_{t+1}\mid P_t)=\textsf{Pois}\!\left(aP_t^{b}\right),\end{align}swhere s$a>0$sis a constant, and s$b<0$sis the price elasticity of demand. The notation says that the conditional distribution of the number of contracts written during year s$t+1$sgiven the information at the end of year t depends on that information only through the premium decided at the end of year t for those contracts.sLet s$\widetilde{N}_{t+1}=(N_{t+1}+N_t)/2$sdenote the number of contracts during year s$t+1$sthat provide coverage for accidents during year s$t+1$s. Let s(2.4)s\begin{align}\text{OE}_{t+1}=\beta_0 +\beta_1 \widetilde{N}_{t+1},\end{align}ssaying that the operating expenses have both a fixed part and a variable part proportional to the number of active contracts. The appearance of s$\widetilde N_{t+1}$sinstead of s$N_{t+1}$sin the expressions above is due to the assumption that contracts are written uniformly in time over the year and that accidents occur uniformly in time over the year.sLet s$\alpha_1,\dots,\alpha_4\in [0,1]$swith s$\sum_{i=1}^4\alpha_i = 1$s. The constant s$\alpha_k$sis, for a given accident year, the expected fraction of claim costs paid during development year k. Let s(2.5)s\begin{align}\mathcal{L}\!\left(I_{t,k}\mid \mathcal{F}_{t},\widetilde{N}_{t+1}\right)=\mathcal{L}\!\left(I_{t,k}\mid \widetilde{N}_{t+1}\right)=\textsf{Pois}\!\left(\alpha_k\mu\widetilde N_{t+1}\right),\end{align}swhere s$\mu$sdenotes the expected claim cost per contract. We assume that different incremental claims payments s$I_{j,k}$sare conditionally independent given information about the corresponding numbers of contracts written. Formally, the elements in the set s\begin{align*}\left\{I_{j,k}\,:\,j\in\{t-l,\dots,t\}, k\in \{1,\dots,4\}\right\}\end{align*}sare conditionally independent given s$N_{t-l},\dots,N_{t+1}$s. Therefore, using (2.1) and (2.5), s(2.6)s\begin{align}\mathcal{L}\!\left(\text{PC}_{t+1}\mid \mathcal{F}_t, \widetilde{N}_{t+1}\right) &=\mathcal{L}\!\left(\text{PC}_{t+1}\mid \widetilde{N}_{t-2},\dots,\widetilde{N}_{t+1}\right)=\textsf{Pois}\!\left(\alpha_1\mu \widetilde{N}_{t+1}+\dots+\alpha_4\mu \widetilde{N}_{t-2}\right),\nonumber \\\text{IC}_{t+1}-\text{RP}_{t+1}&=\text{PC}_{t+1}+\left(\alpha_2+\alpha_3+\alpha_4\right)\mu \widetilde{N}_{t+1} - \alpha_2\mu \widetilde{N}_t-\alpha_3\mu \widetilde{N}_{t-1}-\alpha_4\mu \widetilde{N}_{t-2}. \end{align}sThe model for the investment earnings s$\text{IE}_{t+1}$sis chosen so that s$G_t\leq 0$simplies s$\text{IE}_{t+1}=0$ssince s$G_t\leq 0$smeans that nothing is invested. Moreover, we assume that s(2.7)s\begin{align}\mathcal{L}(\text{IE}_{t+1}+G_t\mid \mathcal{F}_t,G_t>0)=\mathcal{L}(\text{IE}_{t+1}+G_t\mid G_t,G_t>0)=\textsf{NegBin}\bigg(\nu G_t,\frac{1+\xi}{1+\xi+\nu}\bigg),\end{align}swhere s$\textsf{NegBin}(r,p)$sdenotes the negative binomial distribution with probability mass function s\begin{align*}k\mapsto \binom{k+r-1}{k}(1-p)^{r}p^k\end{align*}swhich corresponds to mean and variance s\begin{align*}\operatorname{E}\![\text{IE}_{t+1}+G_t\mid G_t,G_t>0] & = \frac{p}{1-p}r = (1+\xi)G_t,\\[4pt]\operatorname{Var}(\text{IE}_{t+1}+G_t\mid G_t,G_t>0) & = \frac{p}{(1-p)^2}r=\frac{1+\xi+\nu}{\nu}(1+\xi)G_t.\end{align*}sGiven a premium rule s$\pi$sthat given the state s$S_t=(G_t,P_{t-1},N_{t-3},N_{t-2},N_{t-1},N_t)$sgenerates the premium s$P_t$s, the system s$(S_t)$sevolves in a Markovian manner according to the transition probabilities that follows from (2.3)–(2.7) and (2.2). Notice that if we consider a less long-tailed insurance product so that s$\alpha_3=\alpha_4=0$s(at most one year delay from occurrence of the accident to final payment), then the dimension of the state space reduces to four, that is s$S_t=(G_t,P_{t-1},N_{t-1},N_t)$s.s2.2sSimple modelsConsider the situation where the insurer has a fixed number N of policyholders, who at some initial time point bought insurance policies with automatic contract renewal for the price s$P_t$syear s$t+1$s. The state at time t is s$S_t=(G_t,P_{t-1})$s. In this simplified setting, s$\text{OE}_{t+1}=\beta_0+\beta_1 N$s, all payments s$I_{t,k}$sare independent, s$\mathcal{L}(I_{t,k})=\textsf{Pois}(\alpha_k\mu N)$s, s$\text{IC}_{t+1}-\text{RP}_{t+1}=\text{PC}_{t+1}$sand s$\mathcal{L}(\text{PC}_{t+1})=\textsf{Pois}(\mu N)$s.s2.3sRealistic modelsIn this model, we change the distributional assumptions for both investment earnings and the incremental claims payments from the previously used integer-valued distributions. Let s$(Z_{t})$sbe a sequence of iid standard normals and let s\begin{align*}\mathcal{L}\!\left(\text{IE}_{t+1}+G_t \mid G_t,G_t>0\right) = \mathcal{L}\!\left(G_t\exp\{\widetilde \mu + \widetilde\sigma Z_{t+1}\}\mid G_t,G_t>0\right).\end{align*}sLet s$C_{t,j}$sdenote the cumulative claims payments for accidents occurring during year s$t+1$sup to and including development year j. Hence, s$I_{t,1}=C_{t,1}$s, and s$I_{t,j} = C_{t,j}-C_{t,j-1}$sfor s$j>1$s. We use the following model for the cumulative claims payments: s(2.8)s\begin{align}C_{t,1}&=c_0\widetilde{N}_{t+1}\exp\!\left\{\nu_0Z_{t,1}-\nu_0^2/2\right\}, \nonumber\\[4pt]C_{t,j+1}&=C_{t,j}\exp\!\left\{\mu_j+\nu_jZ_{t,j+1}\right\}, \quad j=1\dots,J-1,\end{align}swhere s$c_0$sis interpreted as the average claims payment per policyholder during the first development year, and s$Z_{t,j}$sare iid standard normals. Then, s\begin{align*}\operatorname{E}\!\left[C_{t,j+1}\mid C_{t,j}\right]&=C_{t,j}\exp\!\left\{\mu_j+\nu_j^2/2\right\}, \\[5pt]\operatorname{Var}\!\left(C_{t,j+1}\mid C_{t,j}\right)&=C_{t,j}^2\left(\exp\!\left\{\nu_j^2\right\}-1\right)\exp\!\left\{2\mu_j+\nu_j^2\right\}.\end{align*}sWe do not impose restrictions on the parameters s$\mu_j$sand s$\nu_j^2$sand as a consequence values s$f_j=\exp\!\left\{\mu_j+\nu_j^2/2\right\}\in (0,1)$sare allowed (allowing for negative incremental paid amounts s$C_{t,j+1}-C_{t,j}<0$s). This is in line with the model assumption s$\operatorname{E}\!\left[C_{t,j+1} \mid C_{t,1},\dots,C_{t,j}\right]=f_jC_{t,j}$sfor some s$f_j>0$sof the classical distribution-free Chain Ladder model by Mack (1993). Moreover, with s$\mu_{0,t}=\log \!\left(c_0\widetilde{N}_{t+1}\right)-\nu_0^2/2$s, s(2.9)s\begin{align} \mathcal{L}\!\left(\log C_{t,1}\right)=\textsf{N}\!\left(\mu_{0,t},\nu_0^2\right), \quad \mathcal{L}\bigg(\log\frac{C_{t,j+1}}{C_{t,j}}\bigg)=\textsf{N}\!\left(\mu_j,\nu_j^2\right).\end{align}sGiven an (incremental) development pattern s$(\alpha_1,\dots,\alpha_J)$s, s$\sum_{j=1}^J\alpha_j=1$s, s$\alpha_j\geq 0$s, s\begin{align*}\alpha_1=\frac{1}{\prod_{k=1}^{J-1}f_k}, \quad \alpha_j=\frac{(f_{j-1}-1)\prod_{k=1}^{j-2}f_k}{\prod_{k=1}^{J-1}f_k}, \quad j=2,\dots,J,\end{align*}swith the convention s$\prod_{s=a}^b c_s=1$sif s$a>b$s, where s$f_j=\exp\{\mu_j+\nu_j^2/2\}$s. We have s\begin{align*}&\text{IC}_{t+1}=C_{t,1}\prod_{k=1}^{J-1}f_k,\quad \text{PC}_{t+1}=C_{t,1}+\sum_{k=1}^{J-1}\left(C_{t-k,k+1}-C_{t-k,k}\right),\\&\text{RP}_{t+1}=\sum_{k=1}^{J-1}\left(C_{t-k,k}f_k-C_{t-k,k+1}\right)\prod_{j=k+1}^{J-1}f_j.\end{align*}sThe state is s$S_t=\left(G_t,P_{t-1},N_{t-1},N_t,C_{t-1,1},\dots,C_{t-J+1,J-1}\right)$s.s3.sThe control problemsWe consider a set of states s$\mathcal S^+$s, a set of non-terminal states s$\mathcal S\subseteq \mathcal S^+$s, and for each s$s\in \mathcal S$sa set of actions s$\mathcal A(s)$savailable from state s, with s$\mathcal A=\cup_{s\in\mathcal S}\mathcal A(s)$s. We assume that s$\mathcal A$sis discrete (finite or countable). In order to simplify notation and limit the need for technical details, we will here and in Section 4 restrict our presentation to the case where s$\mathcal S^+$sis also discrete. However, we emphasise that when using function approximation in Section 4.2 the update Equation (4.3) for the weight vector is still valid when the state space is uncountable, as is the case for the realistic model. For each s$s\in\mathcal S$s, s$s^{\prime}\in\mathcal S^+$s, s$a\in\mathcal A(s)$swe define the reward received after taking action a in state s and transitioning to ss′, s$-f(a,s,s^{\prime})$s, and the probability of transitioning from state s to state ss′ after taking action a, s$p(s^{\prime}|s,a)$s. We assume that rewards and transition probabilities are stationary (time-homogeneous). This defines a Markov decision process (MDP). A policy s$\pi$sspecifies how to determine what action to take in each state. A stochastic policy describes, for each state, a probability distribution on the set of available actions. A deterministic policy is a special case of a stochastic policy, specifying a degenerate probability distribution, that is a one-point distribution.sOur objective is to find the premium policy that minimises the expected value of the premium payments over time, but that also results in s$(P_t)$sbeing more averaged over time, and further ensures that the surplus s$(G_t)$sis large enough so that the risk that the insurer cannot pay the claim costs and other expenses is small. By combining rewards with either constraints on the available actions from each state or the definition of terminal states, this will be accomplished with a single objective function, see further Sections 3.1–3.2. We formulate this in terms of a MDP, that is we want to solve the following optimisation problem: s(3.1)s\begin{equation}\begin{aligned} \underset{\pi}{\text{minimise }} &\operatorname{E}_{\pi} \!\bigg[\sum_{t=0}^T \gamma^t f(P_t,S_t,S_{t+1})\mid S_0=s\bigg],\end{aligned}\end{equation}swhere s$\pi$sis a policy generating the premium s$P_t$sgiven the state s$S_t$s, s$\mathcal{A}(s)$sis the set of premium levels available from state s, s$\gamma$sis the discount factor, f is the cost function, and s$\operatorname{E}_{\pi}\![{\cdot}]$sdenotes the expectation given that policy s$\pi$sis used. Note that the discount factor s$\gamma^t$sshould not be interpreted as the price of a zero-coupon bond maturing at time t, since the cost that is discounted does not represent an economic cost. Instead s$\gamma$sreflects how much weight is put on costs that are immediate compared to costs further in the future. The transition probabilities are s\begin{align*}p\!\left(s^{\prime}| s,a\right) = \operatorname{P}\!(S_{t+1}=s^{\prime}\mid S_t=s,P_t=a),\end{align*}sand we consider stationary policies, letting s$\pi(a|s)$sdenote the probability of taking action a in state s under policy s$\pi$s, s\begin{align*}\pi(a|s)=\operatorname{P}_{\pi}\!(P_t=a\mid S_t=s).\end{align*}sIf there are no terminal states, we have s$T=\infty$s, and s$\mathcal S^+ =\mathcal S$s. We want to choose s$\mathcal A(s), s\in\mathcal S$s, f, and any terminal states such that the objective discussed above is achieved. We will do this in two ways, see Sections 3.1 and 3.2.sThe value function of state s under a policy s$\pi$sgenerating the premium s$P_t$sis defined as s\begin{align*}v_{\pi}(s)&=\operatorname{E}_{\pi} \bigg[\sum_{t=0}^T \gamma^t ({-}f(P_t,S_t,S_{t+1}))\mid S_0=s\bigg].\end{align*}sThe Bellman equation for the value function is s\begin{align*}v_{\pi}(s)&=\sum_{a\in\mathcal A(s)}\pi(a|s)\sum_{s^{\prime}\in\mathcal S}p\!\left(s^{\prime}| s,a\right)\big({-}f(a,s,s^{\prime})+\gamma v_{\pi}(s^{\prime})\big).\end{align*}sWhen the policy is deterministic, we let s$\pi$sbe a mapping from s$\mathcal S$sto s$\mathcal A$s, and s\begin{align*}v_{\pi}(s)&=\sum_{s^{\prime}\in\mathcal S}p(s^{\prime}| s,\pi(s))\big({-}f(\pi(s),s,s^{\prime})+\gamma v_{\pi}(s^{\prime})\big).\end{align*}sThe optimal value function is s$v_*(s)=\sup_{\pi}v_{\pi}(s)$s. When the action space is finite, the supremum is attained, which implies the existence of an optimal deterministic stationary policy (see Puterman, 2005, Cor. 6.2.8, for other sufficient conditions for attainment of the supremum, see Puterman, 2005, Thm. 6.2.10). Hence, if the transition probabilities are known, we can use the Bellman optimality equation to find s$v_*(s)$s: s\begin{align*}v_*(s)=\max_{a\in\mathcal{A}(s)}\sum_{s^{\prime}\in\mathcal S}p\!\left(s^{\prime}| s,a\right)\big({-}f(a,s,s^{\prime})+\gamma v_{*}(s^{\prime})\big).\end{align*}sWe use policy iteration in order to find the solution numerically. Let s$k=0$s, and choose some initial deterministic policy s$\pi_k(s)$sfor all s$s\in\mathcal{S}$s. Thens(i)sDetermine s$V_{k}(s)$sas the unique solution to the system of equations s\begin{align*}V_{k}(s) = \sum_{s^{\prime}\in\mathcal{S}}p\!\left(s^{\prime}|s,\pi_k(s)\right)\Big({-}f\!\left(\pi_k(s),s,s^{\prime}\right)+\gamma V_k(s^{\prime})\Big).\end{align*}s(ii)sDetermine an improved policy s$\pi_{k+1}(s)$sby computing s\begin{align*}\pi_{k+1}(s)=\mathop{\operatorname{argmax}}\limits_{a\in\mathcal{A}(s)}\sum_{s^{\prime}\in\mathcal{S}}p\!\left(s^{\prime}|s,a\right)\Big({-}f(a,s,s^{\prime})+\gamma V_k(s^{\prime})\Big).\end{align*}s(iii)sIf s$\pi_{k+1}(s)\neq\pi_k(s)$sfor some s$s\in \mathcal{S}$s, then increase k by 1 and return to step (i).sNote that if the state space is large enough, solving the system of equations in step (i) directly might be too time-consuming. In that case, this step can be solved by an additional iterative procedure, called iterative policy evaluation, see for example Sutton and Barto (2018, Ch. 4.1).s3.1.sMDP with constraint on the action spacesThe premiums s$(P_t)$swill be averaged if we minimise s$\sum_t c(P_t)$s, where c is an increasing, strictly convex function. Thus for the first MDP, we let s$f(a,s,s^{\prime})=c(a)$s. To ensure that the surplus s$(G_t)$sdoes not become negative too often, we combine this with the constraint saying that the premium needs to be chosen so that the expected value, given the current state, of the surplus stays nonnegative, that is s(3.2)s\begin{align}\mathcal{A}(S_t) = \{P_t:\operatorname{E}_{\pi}\![G_{t+1}\mid S_t]\geq 0\},\end{align}sand the optimisation problem becomes s(3.3)s\begin{align} \underset{\pi}{\text{minimise }} \operatorname{E}_{\pi} \bigg[\sum_{t=0}^\infty \gamma^t c(P_t)\mid S_0=s\bigg]\quad \text{subject to } \operatorname{E}_{\pi}\![G_{t+1}\mid S_t]\geq 0 \text{ for all } t.\end{align}sThe choice of the convex function c, together with the constraint, will affect how quickly the premium can be lowered as the surplus or previous premium increases, and how quickly the premium must be increased as the surplus or previous premium decreases. Different choices of c affect how well different parts of the objective are achieved. Hence, one choice of c might put a higher emphasis on the premium being more averaged over time but slightly higher, while another choice might promote a lower premium level that is allowed to vary a bit more from one time point to another. Furthermore, it is not clear from the start what choice of c will lead to a specific result, thus designing the reward signal might require searching through trial and error for the cost function that achieves the desired result.s3.2.sMDP with a terminal statesThe constraint (3.2) requires a prediction of s$N_{t+1}$saccording to (2.3). However, estimating the price elasticity in (2.3) is difficult task; hence, it would be desirable to solve the optimisation problem without having to rely on this prediction. To this end, we remove the constraint on the action space, that is we let s$\mathcal A(s) = \mathcal A$sfor all s$s\in\mathcal S$s, and instead introduce a terminal state which has a larger negative reward than all other states. This terminal state is reached when the surplus s$G_t$sis below some predefined level, and it can be interpreted as the state where the insurer defaults and has to shut down. If we let s$\mathcal G$sdenote the set of non-terminal states for the first state variable (the surplus), then s\begin{align*}f\!\left(P_t,S_{t},S_{t+1}\right)= h\!\left(P_t,S_{t+1}\right)=\begin{cases}c(P_t), & \quad \text{if } G_{t+1}\geq \min\mathcal G,\\c(\!\max\mathcal A)(1+\eta), & \quad\text{if } G_{t+1}<\min\mathcal G,\end{cases}\end{align*}swhere s$\eta>0$s. The optimisation problem becomes s(3.4)s\begin{equation}\begin{aligned} \underset{\pi}{\text{minimise }} &\operatorname{E}_{\pi} \!\bigg[\sum_{t=0}^T \gamma^t h(P_t,S_{t+1})\mid S_0=s\bigg],\quad T = \min\!\left\{t\,:\,G_{t+1}<\min\mathcal G\right\}.\end{aligned}\end{equation}sThe reason for choosing s$\eta>0$sis to ensure that the reward when transitioning to the terminal state is lower than the reward when using action s$\max \mathcal A$s(the maximal premium level), that is, it should be more costly to terminate and restart compared with attempting to increase the surplus when the surplus is low. The particular value of the parameter s$\eta>0$stogether with the choice of the convex function c determines the reward signal, that is the compromise between minimising the premium, averaging the premium and ensuring that the risk of default is low. One way of choosing s$\eta$sis to set it high enough so that the reward when terminating is lower than the total reward using any other policy. Then, we require that s\begin{align*}(1+\eta) c(\max\mathcal A) > \sum_{t=0}^\infty \gamma^tc(\max\mathcal A)=\frac{1}{1-\gamma} c(\max\mathcal A),\end{align*}sthat is s$\eta>\gamma/(1-\gamma)$s. This choice of s$\eta$swill put a higher emphasis on ensuring that the risk of default is low, compared with using a lower value of s$\eta$s.s3.3.sChoice of cost functionsThe function c is chosen to be an increasing, strictly convex function. That it is increasing captures the objective of a low premium. As discussed in Martin-Löf (1994), that it is convex means that the premiums will be more averaged, since s\begin{align*}\frac{1}{T}\sum_{t=1}^Tc(p_t)\geq c\bigg(\frac{1}{T}\sum_{t=1}^T p_t\bigg),\end{align*}sThe more convex shape the function has, the more stable the premium will be over time. One could also force stability by adding a term related to the absolute value of the difference between successive premium levels to the cost function. We have chosen a slightly simpler cost function, defined by c, and for the case with terminal states, by the parameter s$\eta$s.sAs for the specific choice of the function c used in Section 5, we have simply used the function suggested in Martin-Löf (1994), but with slightly adjusted parameter values. That the function c, together with the constraint or the choice of terminal states and the value of s$\eta$s, leads to the desired goal of a low, stable premium and a low probability of default needs to be determined on a case by case basis, since we have three competing objectives, and different insurers might put different weight on each of them. This is part of designing the reward function. Hence, adjusting c and s$\eta$swill change how much weight is put on each of the three objectives, and the results in Section 5 can be used as basis for adjustments.s4.sReinforcement learningsIf the model of the environment is not fully known, or if the state space or action space are not finite, the control problem can no longer be solved by classical dynamic programming approaches. Instead, we can utilise different reinforcement learning algorithms.s4.1.sTemporal-difference learningsTemporal-difference (TD) methods can learn directly from real or simulated experience of the environment. Given a specific policy s$\pi$swhich determines the action taken in each state, and the sampled or observed state at time t, s$S_{t}$s, state at time s$t+1$s, s$S_{t+1}$s, and reward s$R_{t+1}$s, the iterative update for the value function, using the one-step TD method, is s\begin{align*}V(S_t) \gets V(S_t)+\alpha_t\big(R_{t+1}+\gamma V(S_{t+1})-V(S_t)\big),\end{align*}swhere s$\alpha_t$sis a step size parameter. Hence, the target for the TD update is s$R_{t+1}+\gamma V(S_{t+1})$s. Thus, we update s$V(S_t)$s, which is an estimate of s$v_{\pi}(S_t)=\operatorname{E}_{\pi}\![R_{t+1}+\gamma v_{\pi}(S_{t+1}) \mid S_t]$s, based on another estimate, namely s$V(S_{t+1})$s. The intuition behind using s$R_{t+1}+\gamma V(S_{t+1})$sas the target in the update is that this is a slightly better estimate of s$v_{\pi}(S_t)$s, since it consists of an actual (observed or sampled) reward at s$t+1$sand an estimate of the value function at the next observed state.sIt has been shown in for example Dayan (1992) that the value function (for a given policy s$\pi$s) computed using the one-step TD method converges to the true value function if the step size parameter s$0\leq\alpha_t\leq 1$ssatisfies the following stochastic approximation conditions s\begin{align*}\sum_{k=1}^\infty \alpha_{t^k(s)} = \infty, \quad \sum_{k=1}^\infty \alpha_{t^k(s)}^2<\infty, \quad \text{for all } s \in \mathcal S,\end{align*}swhere s$t^k(s)$sis the time step when state s is visited for the kth time.s4.1.1.sTD control algorithmssThe one-step TD method described above gives us an estimate of the value function for a given policy s$\pi$s. To find the optimal policy using TD learning, a TD control algorithm, such as SARSA or Q-learning, can be used. The goal of these algorithms is to estimate the optimal action-value function s$q_{*}(s,a) = \max_{\pi}q_{\pi}(s,a)$s, where s$q_{\pi}$sis the action-value function for policy s$\pi$s, s\begin{align*}q_{\pi}(s,a) = \operatorname{E}_{\pi}\bigg[\sum_{t=0}^\infty \gamma^t R_{t+1} \mid S_0=s,A_0=a\bigg].\end{align*}sTo keep a more streamlined presentation, we will here focus on the algorithm SARSA. The main reason for this has to do with the topic of the next section, namely function approximation. While there are some convergence results for SARSA with function approximation, there are none for standard Q-learning with function approximation. In fact, there are examples of divergence when combining off-policy training (as is done in Q-learning) with function approximation, see for example Sutton and Barto (2018, Ch. 11). However, some numerical results for the simple model with standard Q-learning can be found in Section 5, and we do provide complete details on Q-learning in the Supplemental Material, Section 2.sThe iterative update for the action-value function, using SARSA, is s\begin{align*}Q\!\left(S_t,A_t\right) \gets Q\!\left(S_t,A_t\right)+\alpha_t\!\left(R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q\!\left(S_t,A_t\right)\right).\end{align*}sHence, we need to generate transitions from state-action pairs s$\left(S_t,A_t\right)$sto state-action pairs s$(S_{t+1},A_{t+1})$sand observe the rewards s$R_{t+1}$sobtained during each transition. To do this, we need a behaviour policy, that is a policy that determines which action is taken in the state we are currently in when the transitions are generated. Thus, SARSA gives an estimate of the action-value function s$q_{\pi}$sgiven the behaviour policy s$\pi$s. Under the condition that all state-action pairs continue to be updated, and that the behaviour policy is greedy in the limit, it has been shown in Singh et al. (2000) that SARSA converges to the true optimal action-value function if the step size parameter s$0\leq\alpha_t\leq1$ssatisfies the following stochastic approximation conditions s(4.1)s\begin{align} \sum_{k=1}^\infty \alpha_{t^k(s,a)} = \infty, \quad \sum_{k=1}^\infty \alpha_{t^k(s,a)}^2<\infty, \quad \text{for all } s \in \mathcal S, a\in\mathcal{A}(s),\end{align}swhere s$t^k(s,a)$sis the time step when a visit in state s is followed by taking action a for the kth time.sTo ensure that all state-action pairs continue to be updated, the behaviour policy needs to be exploratory. At the same time, we want to exploit what we have learned so far by choosing actions that we believe will give us large future rewards. A common choice of policy that compromises in this way between exploration and exploitation is the s$\varepsilon$s-greedy policy, which with probability s$1-\varepsilon$schooses the action that maximises the action-value function in the current state, and with probability s$\varepsilon$schooses any other action uniformly at random: s\begin{align*}\pi(a|s)=\begin{cases}1-\varepsilon,&\quad \text{if } a = \operatorname{argmax}_a Q(s,a),\\[4pt]\dfrac{\varepsilon}{|\mathcal A|-1}, & \quad \text{otherwise}.\end{cases}\end{align*}sAnother example is the softmax policy s\begin{align*}\pi(a| s) = \frac{e^{Q(s,a)/\tau }}{\sum_{a \in \mathcal{A}(s)} e^{Q(s,a)/\tau }}.\end{align*}sTo ensure that the behaviour policy s$\pi$sis greedy in the limit, it needs to be changed over time towards the greedy policy that maximises the action-value function in each state. This can be accomplished by letting s$\varepsilon$sor s$\tau$sslowly decay towards zero.s4.2.sFunction approximationsThe methods discussed thus far are examples of tabular solution methods, where the value functions can be represented as tables. These methods are suitable when the state and action space are not too large, for example for the simple model in Section 2.2. However, when the state space and/or action space is very large, or even continuous, these methods are not feasible, due to not being able to fit tables of this size in memory, and/or due to the time required to visit all state-action pairs multiple times. This is the case for the intermediate and realistic models presented in Sections 2.1 and 2.3. In both models, we allow the number of contracts written per year to vary, which increases the dimension of the state space. For the intermediate model, it also has the effect that the surplus process, depending on the parameter values chosen, can take non-integer values. For the simple model s$\mathcal S = \mathcal G \times \mathcal A$s, and with the parameters chosen in Section 5, we have s$|\mathcal G|=171$sand s$|\mathcal A|=100$s. For the intermediate model, if we let s$\mathcal N$sdenote the set of integer values that s$N_t$sis allowed to take values in, then s$\mathcal S = \mathcal G\times \mathcal A\times\mathcal N^l$s, where l denotes the maximum number of development years. With the parameters chosen in Section 5, the total number of states is approximately s$10^8$sfor the intermediate model. For the realistic model, several of the state variables are continuous, that is the state space is no longer finite.sThus, to solve the optimisation problem for the intermediate and the realistic model, we need approximate solution methods, in order to generalise from the states that have been experienced to other states. In approximate solution methods, the value function s$v_{\pi}(s)$s(or action-value function s$q_{\pi}(s,a)$s) is approximated by a parameterised function, s$\hat v(s;\,w)$s(or s$\hat q(s,a;\,w)$s). When the state space is discrete, it is common to minimise the following objective function, s(4.2)s\begin{align} J(w)=\sum_{s\in\mathcal S}\mu_{\pi}(s)\big(v_{\pi}(s)-\hat v(s;\,w)\big)^2,\end{align}swhere s$\mu_{\pi}(s)$sis the fraction of time spent in state s. For the model without terminal states, s$\mu_{\pi}$sis the stationary distribution under policy s$\pi$s. For the model with terminal states, to determine the fraction of time spent in each transient state, we need to compute the expected number of visits s$\eta_{\lambda,\pi}(s)$sto each transient state s$s\in\mathcal S$sbefore reaching a terminal (absorbing) state, where s$\lambda(s) = \operatorname{P}(S_0=s)$sis the initial distribution. For ease of notation, we omit s$\lambda$sfrom the subscript below, and write s$\eta_{\pi}$sand s$\operatorname{P}_{\pi}$sinstead of s$\eta_{\lambda,\pi}$sand s$\operatorname{P}_{\lambda,\pi}$s. Let s$p(s|s^{\prime})$sbe the probability of transitioning from state ss′ to state s under policy s$\pi$s, that is s$p(s| s^{\prime}) =\operatorname{P}_{\pi}(S_{t}=s\mid S_{t-1}=s^{\prime})$s. Then, s\begin{align*}\eta_{\pi}(s) &= \operatorname{E}_{\pi}\left[\sum_{t=0}^\infty \mathbf{1}_{\{S_t=s\}} \right]=\lambda(s) + \sum_{t=1}^\infty \operatorname{P}_{\pi}\!(S_t=s)\\&=\lambda(s) +\sum_{t=1}^\infty\sum_{s^{\prime}\in \mathcal S}p(s|s^{\prime})\operatorname{P}_{\pi}\!(S_{t-1}= s^{\prime})= \lambda(s) + \sum_{ s^{\prime}\in \mathcal S}p(s|s^{\prime})\sum_{t=0}^\infty\operatorname{P}_{\pi}(S_{t}=s^{\prime})\\&=\lambda(s) + \sum_{ s^{\prime}\in \mathcal S}p(s| s^{\prime})\eta_{\pi}(s^{\prime}),\end{align*}sor, in matrix form, s$\eta_{\pi} = \lambda + P^\top\eta_{\pi}$s, where P is the part of the transition matrix corresponding to transitions between transient states. If we label the states s$0,1,\ldots,|\mathcal S|$s(where state 0 represents all terminal states), then s$P=(p_{ij}\,:\,i,j\in\{1,2,\ldots,|\mathcal S|\})$s, where s$p_{ij}=p(j\mid i)$s. After solving this system of equations, the fraction of time spent in each transient state under policy s$\pi$scan be computed according to s\begin{align*}\mu_{\pi}(s) = \frac{\eta_{\pi}(s)}{\sum_{s^{\prime}\in\mathcal S}\eta_{\pi}(s^{\prime})}, \quad \text{for all } s \in\mathcal S.\end{align*}sThis computation of s$\mu_{\pi}$srelies on the model of the environment being fully known and the transition probabilities explicitly computable, as is the case for the simple model in Section 2.2. However, for the situation at hand, where we need to resort to function approximation and determine s$\hat v(s;\,w)$s(or s$\hat q(s,a;\,w)$s) by minimising (4.2), we cannot explicitly compute s$\mu_{\pi}$s. Instead, s$\mu_{\pi}$sin (4.2) is captured by learning incrementally from real or simulated experience, as in semi-gradient TD learning. Using semi-gradient TD learning, the iterative update for the weight vector w becomes s\begin{align*}w_{t+1} = w_t+\alpha_t\!\left(R_{t+1}+\gamma\hat v\!\left(S_{t+1};\,w_t\right)-\hat v\!\left(S_t;\,w_t\right)\right)\nabla \hat v\!\left(S_t;\,w_t\right).\end{align*}sThis update can be used to estimate s$v_{\pi}$sfor a given policy s$\pi$s, generating transitions from state to state by taking actions according to this policy. Similarly to standard TD learning (Section 4.1), the target s$R_{t+1}+\gamma\hat v\!\left(S_{t+1};\,w_t\right)$sis an estimate of the true (unknown) s$v_{\pi}(S_{t+1})$s. The name “semi-gradient” comes from the fact that the update is not based on the true gradient of s$\left(R_{t+1}+\gamma\hat v\!\left(S_{t+1};\,w_t\right)-\hat v\!\left(S_t;\,w_t\right)\right)^2$s; instead, the target is seen as fixed when the gradient is computed, despite the fact that it depends on the weight vector s$w_t$s.sAs in the previous section, estimating the value function given a specific policy is not our final goal – instead we want to find the optimal policy. Hence, we need a TD control algorithm with function approximation. One example of such an algorithm is semi-gradient SARSA, which estimates s$q_\pi$s. The iterative update for the weight vector is s(4.3)s\begin{align} w_{t+1} = w_t+\alpha_t\!\left(R_{t+1}+\gamma\hat q\!\left(S_{t+1},A_{t+1};\,w_t\right)-\hat q\!\left(S_t,A_t;\,w_t\right)\right)\nabla \hat q\!\left(S_t,A_t;\,w_t\right).\end{align}sAs with standard SARSA, we need a behaviour policy that generates transitions from state-action pairs to state-action pairs, that both explores and exploits, for example an s$\varepsilon$s-greedy or softmax policy. Furthermore, for the algorithm to estimate s$q_{*}$swe need the behaviour policy to be changed over time towards the greedy policy. However, convergence guarantees only exist when using linear function approximation, see Section 4.2.1 below.s4.2.1.sLinear function approximationsThe simplest form of function approximation is linear function approximation. The value function is approximated by s$\hat v(s;\,w) =w^\top x(s)$s, where x(s) are basis functions. Using the Fourier basis as defined in Konidaris et al. (2011), the ith basis function for the Fourier basis of order n is (here s$\pi\approx 3.14$sis a number) s\begin{align*}x_i(s) = \cos\!\left(\pi s^\top c^{(i)}\right),\end{align*}swhere s$s=\left(s_1,s_2,\ldots,s_k\right)^\top$s, s$c^{(i)}=\left(c_1^{(i)},\ldots,c_k^{(i)}\right)^\top$s, and k is the dimension of the state space. The s$c^{(i)}$s’s are given by the k-tuples over the set s$\{0,\ldots,n\}$s, and hence, s$i=1,\ldots,(n+1)^k$s. This means that s$x(s)\in\mathbb{R}^{(n+1)^k}$s. One-step semi-gradient TD learning with linear function approximation has been shown to converge to a weight vector s$w^*$s. However, s$w^*$sis not necessarily a minimiser of J. Tsitsiklis and Van Roy (1997) derive the upper bound s\begin{align*}J(w^*)\leq\frac{1}{1-\gamma}\min_{w}J(w).\end{align*}sSince s$\gamma$sis often close to one, this bound can be quite large.sUsing linear function approximation for estimating the action-value function, we have s$\hat q(s,a;\,w) = w^\top x(s,a)$s, and the ith basis function for the Fourier basis of order n is s\begin{align*}x_i(s,a) = \cos\!\left(\pi \!\left(s^\top c^{(i)}_{1:k} + ac_{k+1}^{(i)}\right)\right),\end{align*}swhere s$s=\left(s_1,\ldots,s_k\right)^\top$s, s$c^{(i)}_{1:k}=\left(c_1^{(i)},\ldots,c_k^{(i)}\right)^\top$s, s$c_j^{(i)}\in\{0,\ldots,n\},j=1,\ldots,k+1$s, and s$i=1,\ldots,(n+1)^{k+1}$s.sThe convergence results for semi-gradient SARSA with linear function approximation depend on what type of policy is used in the algorithm. When using an s$\varepsilon$s-greedy policy, the weights have been shown to converge to a bounded region and might oscillate within that region, see Gordon (2001). Furthermore, Perkins and Precup (2003) have shown that if the policy improvement operator s$\Gamma$sis Lipschitz continuous with constant s$L>0$sand s$\varepsilon$s-soft, then SARSA will converge to a unique policy. The policy improvement operator maps every s$q\in\mathbb{R}^{|\mathcal S||\mathcal A|}$sto a stochastic policy and gives the updated policy after iteration t as s$\pi_{t+1}=\Gamma(q^{(t)})$s, where s$q^{(t)}$scorresponds to a vectorised version of the state-action values after iteration t, that is s$q^{(t)}=xw_t$sfor the case where we use linear function approximation, where s$x\in\mathbb{R}^{|\mathcal S||\mathcal A|\times d}$sis a matrix with rows s$x(s,a)^\top$s, for each s$s\in\mathcal S, a\in\mathcal A$s, and d is the number of basis functions. That s$\Gamma$sis Lipschitz continuous with constant L means that s$\lVert \Gamma(q)-\Gamma(q^{\prime}) \rVert_2\leq L\lVert q-q^{\prime} \rVert_2$s, for all s$q,q^{\prime}\in\mathbb{R}^{|\mathcal S||\mathcal A|}$s. That s$\Gamma$sis s$\varepsilon$s-soft means that it produces a policy s$\pi=\Gamma(q)$sthat is s$\varepsilon$s-soft, that is s$\pi(a|s)\geq\varepsilon/|\mathcal A|$sfor all s$s\in\mathcal S$sand s$a\in\mathcal A$s. In both Gordon (2001) and Perkins and Precup (2003), the policy improvement operator was not applied at every time step; hence, it is not the online SARSA-algorithm considered in the present paper that was investigated. The convergence of online SARSA under the assumption that the policy improvement operator is Lipschitz continuous with a small enough constant L was later shown in Melo et al. (2008). The softmax policy is Lipschitz continuous, see further the Supplemental Material, Section 3.sHowever, the value of the Lipschitz constant L that ensures convergence depends on the problem at hand, and there is no guarantee that the policy the algorithm converges to is optimal. Furthermore, for SARSA to approximate the optimal action-value function, we need the policy to get closer to the greedy policy over time, for example by decreasing the temperature parameter when using a softmax policy. Thus, the Lipschitz constant L, which is inversely proportional to the temperature parameter, will increase as the algorithm progresses, making the convergence results in Perkins and Precup (2003) and Melo et al. (2008) less likely to hold. As discussed in Melo et al. (2008), this is not an issue specific to the softmax policy. Any Lipschitz continuous policy that over time gets closer to the greedy policy will in fact approach a discontinuous policy, and hence, the Lipschitz constant of the policy might eventually become too large for the convergence result to hold. Furthermore, the results in Perkins and Precup (2003) and Melo et al. (2008) are not derived for a Markov decision process with an absorbing state. Despite this, it is clear from the numerical results in Section 5 that a softmax policy performs substantially better compared to an s$\varepsilon$s-greedy policy, and for the simple model approximates the true optimal policy well.sThe convergence results in Gordon (2001), Perkins and Precup (2003) and Melo et al. (2008) are based on the stochastic approximation conditions s(4.4)s\begin{align} \sum_{t=1}^\infty \alpha_{t} = \infty, \quad \sum_{t=1}^\infty \alpha_{t}^2<\infty,\end{align}swhere s$\alpha_t$sis the step size parameter used at time step t. Note that when using tabular methods (see Section 4.1), we had a vector of step sizes for each state-action pair. Here, this is not the case. This is a consequence of both that a vector of this size might not be possible to store in memory when the state space is large, and that we want to generalise from state-action pairs visited to other state-action pairs that are rarely/never visited, making the number of visits to each state-action pair less relevant.s5.sNumerical illustrationss5.1.sSimple modelsWe use the following parameter values: s$\gamma=0.9$s, s$N=10$s, s$\mu=5$s, s$\beta_0=10$s, s$\beta_1=1$s, s$\xi=0.05$s, and s$\nu=1$s. This means that the expected yearly total cost for the insurer is 70 and the expected yearly cost per customer is 7. We emphasise that parameter values are meant to be interpreted in suitable units to fit the application in mind. The cost function is s(5.1)s\begin{align} c(p)=p+c_1\left(c_2^{p}-1\right),\end{align}swith s$c_1=1$sand s$c_2=1.2$s. For the model with a terminal state, we use s$\eta=10>\gamma/(1-\gamma)$s, as suggested in Section 3.2. The premium level is truncated and discretised according to s$\mathcal A = \{0.2,0.4,\ldots,19.8,20.0\}$s. As the model for the MDP is formulated, s$\operatorname{P}(G_t=g)>0$sfor all integers g. However, for actions/premiums that are considered with the aim of solving the optimisation problem, it will be sufficient to only consider a finite range of integer values for s$G_t$ssince transitions to values outside this range will have negligible probability. Specifically, we will only consider s$\{-20,-19,\dots,149,150\}$sas possible surplus values. In order to ensure that transition probabilities sum to one, we must adjust the probabilities of transitions to the limiting surplus values according to the original probabilities of exiting the range of possible surplus values. For details on the transition probabilities and truncation for the simple model, see the Supplemental Material, Section 1.sRemark 5.1s(Cost function). The cost function (5.1) was suggested in Martin-Löf (1994), since it is an increasing, convex function, and thus will lead to the premium being more averaged over time. However, in Martin-Löf (1994), s$c_2=1.5$swas used in the calculations. We have chosen a slightly lower value of s$c_2$sdue to that too extreme rewards can lead to numerical problems when using SARSA with linear function approximation.sRemark 5.2s(Truncation). Truncating the surplus process at 150 does not have a material effect on the optimal policy. However, the minimum value (here -20) will have an effect on the optimal policy for the MDP with a terminal state and should be seen as another parameter value that needs to be chosen to determine the reward signal, see Section 3.2.s5.1.1.sPolicy iterationsThe top row in Figure 1 shows the optimal policy and the stationary distribution under the optimal policy, for the simple model with a constraint on the action space (Section 3.1) using policy iteration. The bottom row in Figure 1 shows the optimal policy and the fraction of time spent in each state under the optimal policy, for the simple model with terminal state (Section 3.2) using policy iteration. In both cases, the premium charged increases as the surplus or the previously charged premium decreases. Based on the fraction of time spent in each state under each of these two policies, we note that in both cases the average premium level is close to the expected cost per contract (7), but the average surplus level is slightly lower when using the policy for the model with a constraint on the action space compared to when using the policy for the model with the terminal state. However, the policies obtained for these two models are quite similar, and since (as discussed in Section 3.2) the model with the terminal state is more appropriate in more realistic settings, we focus the remainder of the analysis only on the model with the terminal state.sFigure 1.sSimple model using policy iteration. Top: with constraint. Bottom: with terminal state. First and second column: optimal policy. Third column: fraction of time spent in each state under the optimal policy.s5.1.2sLinear function approximationsWe have a 2-dimensional state space, and hence, s$k+1=3$s. When using the Fourier basis we should have s$s\in[0,1]^{k}, a\in[0,1]$s; hence, we rescale the inputs according to s\begin{align*}\tilde s_1 = \frac{s_1-\min\mathcal G}{\max\mathcal G-\min\mathcal G}, \quad\tilde s_2= \frac{s_2-\min\mathcal A}{\max\mathcal A-\min\mathcal A}, \quad\tilde a = \frac{a-\min\mathcal A}{\max\mathcal A-\min\mathcal A},\end{align*}sand use s$(\tilde s_1, \tilde s_2, \tilde a)^\top$sas input. We use a softmax policy, that is s\begin{align*}\pi(a| s) = \frac{e^{\hat q(s,a;\,\boldsymbol{{w}})/\tau }}{\sum_{a \in \mathcal{A}(s)} e^{\hat q(s,a;\,\boldsymbol{{w}})/\tau }},\end{align*}swhere s$\tau$sis slowly decreased according to s\begin{align*}\tau_t = \max\!\left\{\tau_{\min},\tau_{0}d^{t-1}\right\}, \quad \tau_0=2,\quad \tau_{\min}=0.02,\quad d=0.99999,\end{align*}swhere s$\tau_t$sis the parameter used during episode t. This schedule for decreasing the temperature parameter is somewhat arbitrary, and the parameters have not been tuned. The choice of a softmax policy is based on the results in Perkins and Precup (2003), Melo et al. (2008), discussed in Section 4.2.1. Since a softmax policy is Lipschitz continuous, convergence of SARSA to a unique policy is guaranteed, under the condition that the policy is also s$\varepsilon$s-soft and that the Lipschitz constant L is small enough. However, since the temperature parameter s$\tau$sis slowly decreased, the policy chosen is not necessarily s$\varepsilon$s-soft for all states and time steps, and the Lipschitz constant increases as s$\tau$sdecreases. Despite this, our results show that the algorithm converges to a policy that approximates the optimal policy derived with policy iteration well when using a 3rd order Fourier basis, see the first column in Figure 2. The same cannot be said for an s$\varepsilon$s-greedy policy. In this case, the algorithm converges to a policy that in general charges a higher premium than the optimal policy derived with policy iteration, see the fourth column in Figure 2. For the s$\varepsilon$s-greedy policy, we decrease the parameter according to s\begin{align*}\varepsilon_t = \max\!\left\{\varepsilon_{\min},\varepsilon_{0}d^{t-1}\right\},\quad \varepsilon_0=0.2, \quad \varepsilon_{\min}=0.01,\end{align*}swhere s$\varepsilon_t$sis the parameter used during episode t.sFigure 2.sOptimal policy for simple model with terminal state using linear function approximation. First column: 3rd order Fourier basis. Second column: 2nd order Fourier basis. Third column: 1st order Fourier basis. Fourth column: 3rd order Fourier basis with s$\varepsilon$s-greedy policy.sThe starting state is selected uniformly at random from s$\mathcal S$s. Furthermore, since discounting will lead to rewards after a large number of steps having a very limited effect on the total reward, we run each episode for at most 100 steps, before resetting to a starting state, again selected uniformly at random from s$\mathcal S$s. This has the benefit of diversifying the states experienced, enabling us to achieve an approximate policy that is closer to the policy derived with dynamic programming as seen over the whole state space. The step size parameter used is s(5.2)s\begin{align} \alpha_t = \min\!\left\{\alpha_0,\frac{1}{t^{0.5+\theta}}\right\},\end{align}swhere s$\alpha_t$sis the step size parameter used during episode t, and s$0<\theta\leq0.5$s. The largest s$\alpha_0$sthat ensures that the weights do not explode can be found via trial and error. However, the value of s$\alpha_0$sobtained in this way coincides with the “rule of thumb” for setting the step size parameter suggested in Sutton and Barto (2018, Ch. 9.6), namely s\begin{align*}\alpha_0 = \frac{1}{\operatorname{E}_{\pi}\left[x^\top x\right]}, \quad \operatorname{E}_{\pi}\left[x^\top x\right] = \sum_{s,a}\mu_{\pi}(s)\pi(a|s) x(s,a)^\top x(s,a).\end{align*}sIf s$x\!\left(S_t,A_t\right)^\top x\!\left(S_t,A_t\right)\approx \operatorname{E}_{\pi}\left[x^\top x\right]$s, then this step size ensures that the error (i.e., the difference between the updated estimate s$w_{t+1}^\top x\!\left(S_t,A_t\right)$sand the target s$R_{t+1}+\gamma w_t^\top x(S_{t+1},A_{t+1}))$sis reduced to zero after one update. Hence, using a step size larger than s$\alpha_0=\left(\operatorname{E}_{\pi}\left[x^\top x\right]\right)^{-1}$srisks overshooting the optimum, or even divergence of the algorithm. When using the Fourier basis of order n, this becomes for the examples studied here s\begin{align*}\operatorname{E}_{\pi}\left[x^\top x\right] &= \sum_{s,a}\mu_{\pi}(s)\pi(a|s)\sum_{i=1}^{(n+1)^{k+1}}\cos^2\Big(\pi(sc_{1:k}^{(i)}+ac_{k+1}^{(i)})\Big)\approx \frac{(n+1)^{k+1}}{2}.\end{align*}sFor the simple model, we have used s$\alpha_0 = 0.2$sfor s$n=1$s, s$\alpha_0=0.07$sfor s$n=2$s, and s$\alpha_0=0.03$sfor s$n=3$s. For the intermediate model, we used s$\alpha_0 = 0.06$sfor s$n=1$s, s$\alpha_0=0.008$sfor s$n=2$s, and s$\alpha_0=0.002$sfor s$n=3$s. For the realistic model, we have used s$\alpha_0=0.002$sfor s$n=3$s. In all cases s$\alpha_0\approx \left(\operatorname{E}_{\pi}\left[x^\top x\right] \right)^{-1}$s. For s$\theta,$swe tried values in the set s$\{0.001,0.1,0.2,0.3,0.4,0.5\}$s. For the simple model, the best results were obtained with s$\theta = 0.001$sirrespective of n. For the intermediate model, we used s$\theta=0.5$sfor s$n=1$s, s$\theta=0.2$sfor s$n=2$s, and s$\theta=0.3$sfor s$n=3$s. For the realistic model, we used s$\theta=0.2$sfor s$n=3$s.sRemark 5.3s(Step size). There are automatic methods for adapting the step size. One such method is the Autostep method from Mahmood et al. (2012), a tuning-free version of the Incremental Delta-Bar-Delta (IDBD) algorithm from Sutton (1992). When using this method, with parameters set as suggested by Mahmood et al. (2012), the algorithm performs marginally worse compared to the results below.sFigure 2 shows the optimal policy for the simple model with terminal state using linear function approximation with 3rd-, 2nd-, and 1st-order Fourier basis using a softmax policy, and with 3rd-order Fourier basis using an s$\varepsilon$s-greedy policy. Figure 1 shows that the approximate optimal policy using 3rd-order Fourier basis is close to the optimal policy derived with policy iteration. Using 1st- or 2nd-order Fourier basis also gives a reasonable approximation of the optimal policy, but worse performance. Combining 3rd-order Fourier basis and an s$\varepsilon$s-greedy policy gives considerably worse performance.sThe same conclusions can be drawn from Table 2, where we see the expected total discounted reward per episode for these policies, together with the results for the optimal policy derived with policy iteration, the policy derived with Q-learning, and several benchmark policies (see Section 5.1.3). Clearly, the performance of 3rd-order Fourier basis is very close to the performance of the optimal policy derived with policy iteration, and hence, we conclude that the linear function approximation with 3rd-order Fourier basis using a softmax policy appears to converge to approximately the optimal policy. The policy derived with Q-learning shows worse performance than both the 3rd- and 2nd-order Fourier basis, while the number of episodes run for the Q-learning algorithm is approximately a factor 30 bigger than the number of episodes run before convergence of SARSA with linear function approximation. Hence, even for this simple model, the number of states is too large for the Q-learning algorithm to converge within a reasonable amount of time. Furthermore, we see that all policies derived with linear function approximation using a softmax policy outperform the benchmark policies. Note that the optimal policy derived with policy iteration, the best constant policy, and the myopic policy with the terminal state require full knowledge of the underlying model and the transition probabilities, and the myopic policy with the constraint requires an estimate of the expected surplus one time step ahead, while the policies derived with function approximation or Q-learning only require real or simulated experience of the environment.sTable 2.sExpected discounted total reward (uniformly distributed starting states) for simple model with terminal state. The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.sExpected rewardsTerminationssPolicy iterations−85.91s0.006sQ-learnings−86.50s0.002sFourier 3 with softmax policys−86.11s0.006sFourier 2 with softmax policys−86.30s0.007sFourier 1 with softmax policys−86.59s0.011sFourier 3 with s$\varepsilon$s-greedy policys−92.74s0.000sBest constant policys−122.70s0.040sMyopic policy with terminal state, s$p_{\min}=0.2$s−97.06s0.133sMyopic policy with terminal state, s$p_{\min}=5.8$s−90.40s0.096sMyopic policy with constraint, s$p_{\min}=0.2$s−121.52s0.337sMyopic policy with constraint, s$p_{\min}=6.4$s−93.58s0.132sTo analyse the difference between some of the policies, we simulate 300 episodes for the policy with the 3rd-order Fourier basis, the best constant policy, and the myopic policy with terminal state, s$p_{\min}=5.8$s, for a few different starting states, two of which can be seen in Figure 3. A star in the figure corresponds to one or more terminations. The total number of terminations (of 300 episodes) are as follows: s$S_0=({-}10,2)$s: Fourier 3:1, best constant: 291, myopic s$p_{\min}=5.8$s: 20. s$S_0=(50,7)$s: Fourier 3: 1, best constant: 0, myopic s$p_{\min}=5.8$s: 20. For other starting states, the comparison is similar to that in Figure 3. We see that the policy with the 3rd order Fourier basis appears to outperform the myopic policy in all respects, that is on average the premium is lower, the premium is more stable over time, and we have very few defaults. The best constant policy naturally is the most stable, but leads to in general a higher premium compared to the other policies, and will for more strained starting states quickly lead to a large number of terminations.sFigure 3.sSimple model. Top row: policy with 3rd order Fourier basis. Bottom row: myopic policy with terminal state, s$p_{\min}=5.8$s. Left: starting state s$S_0=({-}10,2)$s. Right: starting state s$S_0=(50,7)$s. The red line shows the best constant policy. A star indicates at least one termination.s5.1.3.sBenchmark policiessBest constant policy. The best constant policy is the solution to s\begin{align*}\underset{p}{\text{minimise }}&\operatorname{E} \!\left[\sum_{t=0}^T \gamma^t h\!\left(p,S_{t+1}\right)\right].\end{align*}sFor both the simple and intermediate models, s$p=7.4$ssolves this optimisation problem. For details, see the Supplemental Material, Section 4.1.sMyopic policy for MDP with constraint. The myopic policy maximises immediate rewards. For the model with a constraint on the action space, the myopic policy solves s(5.3)s\begin{align} \underset{p}{\text{minimise }} c(p)\quad\text{subject to } \operatorname{E}\!\left[G_{1}\mid S_0=s,P_0=p\right]\geq 0.\end{align}sThe myopic policy is given by the lowest premium level that satisfies the constraint. For details on how (5.3) is solved, see the Supplemental Material, Section 4.2.sFor the simple, intermediate, and realistic model, the myopic policy charges the minimum premium level for a large number of states. Since this policy so quickly reduces the premium to the minimum level as the surplus or previously charged premium increases, it is not likely to work that well. Hence, we suggest an additional benchmark policy where we set the minimum premium level to a higher value, s$p_{\min}$s. Thus, this adjusted myopic policy is given by s$\tilde \pi(s) = \max\{\pi(s),p_{\min}\}$s, where s$\pi$sdenotes the policy that solves (5.3). Based on simulations of the total expected discounted reward per episode for different values of s$p_{\min}$s, we conclude that s$p_{\min}=6.4$sachieves the best results for both the simple and intermediate model, and s$p_{\min}=10.5$sachieves the best result for the realistic model.sMyopic policy for MDP with terminal state. For the model with a terminal state, the myopic policy is the solution to the optimisation problem s(5.4)s\begin{align} \underset{p}{\text{minimise }} &\operatorname{E} \!\left[h(p,S_{1})\mid S_0=s,P_0=p\right].\end{align}sFor details on how (5.4) is solved, see the Supplemental Material, Section 4.3. We also suggest an additional benchmark policy where the minimum premium level has been set to s$p_{\min}=5.8$s, the level that achieves the best results based on simulations of the total expected discounted reward per episode. For the intermediate and realistic model, this myopic policy is too complex and is therefore not a good benchmark.s5.2.sIntermediate modelsWe use the following parameter values: s$\gamma=0.9$s, s$\mu=5$s, s$\beta_0=10$s, s$\beta_1=1$s, s$\xi=0.05$s, s$\nu=1$s, s$\alpha_1=0.7$s, s$\alpha_2=0.3$s, s$\eta=10$s, s$a=18$s, s$b=-0.3$s, and cost function (5.1) with s$c_1=1$sand s$c_2=1.2$s. The premium level and number of contracts written are truncated and discretised according to s$\mathcal A = \{0.2,0.4,\ldots,19.8,20.0\}$sand s$\mathcal N = \{0,1,\ldots,30\}$s. The surplus process no longer only takes integer values (as in the simple model), instead the values that the surplus process can take are determined by the parameter values chosen. However, it is still truncated to lie between –20 and 150. For the parameter values above, we have s$\mathcal G = \{-20.00,-19.95,\ldots,149.95,150.00\}$s.sFigure 4 shows the optimal policy for the intermediate model using linear function approximation, with 3rd-, 2nd-, and 1st-order Fourier basis, for s$N_{t}=N_{t-1}=10$s. Comparing the policy with the 3rd-order Fourier basis with the policy with the 2nd order Fourier basis, the former appears to require a slightly lower premium when the surplus or previously charged premium is very low. The policy with the 1st-order Fourier basis appears quite extreme compared to the other two policies. Comparing the policy with the 3rd-order Fourier basis for s$N_{t}=N_{t-1}=10$swith the optimal policy for the simple model (bottom row in Figure 1), we note that s$\pi^{i}(g,p,10,10)\neq \pi^s(g,p)$s, where s$\pi^s$sand s$\pi^{i}$sdenotes, respectively, the policy for the simple and the intermediate model. There is a qualitative difference between these policies, since even given that we are in a state where s$N_t=N_{t-1}=10$susing the intermediate model, the policy from the simple model does not take into account the effect the premium charged will have on the number of contracts issued at time s$t+1$s. The policy with 3rd-order Fourier basis for s$N_t,N_{t-1}\in\{5,10,15\}$scan be seen in Figure 5.sFigure 4.sOptimal policy for intermediate model with terminal state using linear function approximation, s$N_t=N_{t-1}=10$s. Left: 3rd-order Fourier basis. Middle: 2nd-order Fourier basis. Right: 1st-order Fourier basis.sFigure 5.sOptimal policy for the intermediate model with terminal state using linear function approximation with 3rd-order Fourier basis, for s$N_t,N_{t-1}\in\{5,10,15\}$s.sTo determine the performance of the policies for the intermediate model, we simulate the expected total discounted reward per episode for these policies. The results can be seen in Table 3. Here we clearly see that the policy with 3rd-order Fourier basis outperforms the other policies and that the policy with 1st-order Fourier basis performs quite badly since is not flexible enough to be used in this more realistic setting. We also compare the policies with the optimal policy derived with policy iteration from the simple model while simulating from the intermediate model. Though this policy performs worse compared to the policy with 3rd- and 2nd-order Fourier basis, it outperforms the policy with 1st-order Fourier basis. Note that the policies derived with function approximation only require real or simulated experience of the environment. The results for the myopic policy in Table 3 use the true parameters when computing the expected value of the surplus. Despite this, the policy derived with the 3rd order Fourier basis outperforms the myopic policy.sTable 3.sExpected discounted total reward based on simulation, (uniformly distributed starting states). The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.sExpected rewardsTerminationssFourier 3s−97.17s0.015sFourier 2s−104.41s0.025sFourier 1s−128.83s0.036sPolicy from simple models−116.70s0.075sMyopic policy with constraint, s$p_{\min}=0.2$s−360.69s0.988sMyopic policy with constraint, s$p_{\min}=6.4$s−100.92s0.169sBest constant policys−131.85s0.060sTo analyse the difference between some of the policies, we simulate 300 episodes for the policy with the 3rd-order Fourier basis, the best constant policy and the policy from the simple model, for a few different starting states, two of which can be seen in Figure 6. Each star in the figures correspond to one or more terminations at that time point. The total number of terminations (of 300 episodes) are as follows: s$S_0=(0, 7, 10, 10)$s: Fourier 3: 1, best constant: 13, simple: 5. s$S_0=(100, 15, 5, 5)$s: Fourier 3: 0, best constant: 0, simple: 10. Comparing the policy with the 3rd-order Fourier basis with the policy from the simple model, we see that it tends to on average give a lower premium and leads to very few defaults, but is slightly more variable compared to the premium charged by the simple policy. This is not surprising, since the simple policy does not take the variation in the number of contracts issued into account. At the same time, this is to the detriment of the simple policy, since it cannot correctly take the risk of the varying number of contracts into account, hence leading to more defaults. For example, for the more strained starting state s$S_0=({-}10,2,20,20)$s(not shown in figure), the number of defaults for the policy with the 3rd order Fourier basis is 91 of 300, and for the simple policy it is 213 of 300. Similarly, for starting state s$S_0=(100, 15, 5, 5)$s(second column in Figure 6), the simple policy will tend set the premium much too low during the first time step, hence leading to more early defaults compared to for example starting state s$S_0=(0, 7, 20, 20)$s(first column in Figure 6), despite the fact that the latter starting state has a much lower starting surplus.sFigure 6.sIntermediate model. Top row: policy with 3rd Fourier basis. Bottom row: policy from the simple model. Left: starting state s$S_0=(0, 7, 20, 20)$s. Right: starting state s$S_0=(100, 15, 5, 5)$s. The red line shows the best constant policy. A star indicates at least one termination.s5.3.sRealistic modelsTo estimate the parameters of the model for the cumulative claims payments in (2.8), we use the motor third party liability data considered in Miranda et al. (2012). The data consist of incremental runoff triangles for number of reported accidents and incremental payments, with 10 development years. We have no information on the number of contracts. For parameter estimation only, we assume a constant number of contracts over the ten observed years, that is s$\widetilde N_{t+1}=N$s, and that the total number of claims for each accident year is approximately 5% of the number of contracts. We further assume that the total number of claims per accident year is well approximated by the number of reported claims over the first two development years. This leads to an estimate of the number of contracts s$\widetilde N_{t+1}$sin (2.8) of s$\widehat N = 2.17\cdot 10^5$s. For parameter estimation, we assume that s$\mu_{0,t}=\mu_0$sin (2.9). Hence, s$\mu_0$sand s$\nu_0^2$sare estimated as the sample mean and variance of s$\left(\log\!\left(C_{t,1}\right)\right)_{t=1}^{10}$s. Similarly, s$\mu_j$sand s$\nu_j^2$sare estimated as the sample mean and variance of s$\left(\log\!\left(C_{t,j+1}/C_{t,j}\right)\right)_{t=1}^{10-j}$sfor s$j=1,\ldots,8$s. Since we only have one observation for s$j=9$s, we let s$\widehat \mu_9=\log\!\left(C_{1,10}/C_{1,9}\right)$sand s$\widehat \nu_9=0$s. s$c_0$sis estimated by s$\widehat c_0 = \exp\!\left\{\widehat \mu_0+\widehat \nu_0^2/2\right\}/\widehat N\approx 2.64$s. The parameter estimates can be seen in Table 1 in the Supplemental Material, Section 5.sFor the model for investment earnings, the parameters are set to s$\widetilde \sigma=0.03$sand s$\widetilde \mu = \log(1.05)-\widetilde\sigma^2/2$s, which gives similar variation in investment earnings as in the intermediate model (2.7) when the surplus is approximately 50. The remaining parameters are s$\gamma=0.9$s, s$\beta_0=2\cdot 10^5$s, s$\beta_1=1$s, s$\eta=10$s, and s$b=-0.3$s. The parameter a is set so that the expected number of contracts is s$2\cdot 10^5$swhen the premium level corresponds to the expected total cost per contract, s$\beta_0/(2\cdot10^5)+\beta_1+c_0/\alpha_1 \approx 10.4$s. Hence, s$a\approx 4.03\cdot 10^5$s. The premium level is truncated and discretised according to s$\mathcal A = \{0.5,1.0,\ldots,29.5,30.0\}$sThe cost function is as before (5.1), but now adjusted to give rewards of similar size as in the simple and intermediate setting. Hence, when computing the reward, the premium is adjusted to lie in s$[0.2,20.0]$saccording to s\begin{align*}c(p)=\widetilde p +c_1\left(c_2^{\widetilde p}-1\right),\quad \text{where }\widetilde p = 0.2 + \frac{p-0.5}{30-0.5}(20-0.2).\end{align*}sThe number of contracts is truncated according to s$\mathcal N= \{144170, \ldots, 498601\}$s. This is based on the s$0.001$s-quantile of s$\textsf{Pois}(a(\!\max\mathcal A)^b)$s, and the s$0.999$s-quantile of s$\textsf{Pois}(a(\min\mathcal A)^b)$s. The truncation of the cumulative claims payments s$C_{t,j}$sis based on the same quantile levels and lognormal distributions with parameters s$\mu=\log(c_0\min\mathcal N)-\nu_0^2/2+\sum_{k=1}^{j-1}\mu_k$sand s$\sigma^2 = \sum_{k=0}^{j-1}\nu_k$s, and with parameters s$\mu=\log(c_0\max\mathcal N)-\nu_0^2/2+\sum_{k=1}^{j-1}\mu_k$sand s$\sigma^2 = \sum_{k=0}^{j-1}\nu_k$s, respectively, for s$j=1,\ldots,9$s. The truncation for the cumulative claims payments can be seen in Table 2 in the Supplemental Material, Section 5. The surplus is truncated to lie in s$[{-}0.6,4.5]\cdot 10^6$s. Note that for the realistic model, with 10 development years the state space becomes 13-dimensional. Using the full 3rd-order Fourier basis is not possible since it consists of s$4^{14}$sbasis functions. We reduce the number of basis functions allowing for more flexibility where the model is likely to need it. Specifically, s\begin{align*}&\left\{\left(c^{(i)}_1,c^{(i)}_2,c^{(i)}_3,c^{(i)}_4,c^{(i)}_{14}\right)\,:\,i=1,\dots,4^5\right\}=\{0,\dots,3\}^5, \\[4pt]&c^{(i)}_1=c^{(i)}_2=c^{(i)}_3=c^{(i)}_4=c^{(i)}_{14}=0 \quad \text{for } i=4^5+1,\dots,4^5+9,\end{align*}sand, for s$j=1,\ldots,9$s, s\begin{align*}c^{(i)}_{4+j}=\left\{\begin{array}{ll}1 & \text{for } i=4^5 + j, \\[5pt]0 & \text{for } i\neq 4^5+j.\end{array}\right.\end{align*}sThis means less flexibility for variables corresponding to the cumulative payments and no interaction terms between a variable corresponding to a cumulative payment and any other variable.sFigure 7 shows the optimal policy for the realistic model using linear function approximation for s$N_t,N_{t-1}\in\{1.75,2.00,2.50\}\cdot 10^5$s, s$C_{t-1,1} = c_0\cdot2\cdot 10^5$s, and s$C_{t-j,j} = c_0\cdot2\cdot 10^5\prod_{k=1}^{j-1}f_k$sfor s$j=2,\ldots,9$s. To determine the performance of the approximate optimal policy for the realistic model, we simulate the expected total discounted reward per episode for this policy. The results can be seen in Table 4. The approximate optimal policy outperforms all benchmark policies. The best performing benchmark policy, the “interval policy,” corresponds to choosing the premium level to be equal to the expected total cost per contract, when the number of contracts is s$2\cdot 10^5$s, as long as the surplus lies in the interval s$[1.2,2.8]\cdot 10^6$s. This is based on a target surplus of s$2\cdot 10^6$sand choosing s$\phi\in\{0.1,0.2,\ldots,1.0\}$sthat results in the best expected total reward, where the interval for the surplus is given by s$[1-\phi,1+\phi]\cdot 2\cdot 10^6$s. When the surplus s$G_t$sis below (above) this interval, the premium is increased (decreased) in order to decrease (increase) the surplus. The premium for this benchmark policy is s\begin{align*}P_t =\begin{cases}\min\!\left\{10.5+\dfrac{(1-\phi)\cdot2\cdot10^6-G_t}{2\cdot10^5},\max\mathcal A\right\},\quad &\text{if } G_t<(1-\phi)\cdot2\cdot10^6,\\[9pt] 10.5, \quad & \text{if } (1-\phi)\cdot2\cdot10^6\leq G_t\leq (1+\phi)\cdot2\cdot10^6,\\[9pt] \max\!\left\{10.5+\dfrac{(1+\phi)\cdot2\cdot10^6-G_t}{2\cdot10^5},\min\mathcal A\right\},\quad &\text{if } (1+\phi)\cdot2\cdot10^6<G_t,\end{cases}\end{align*}sand rounded to the nearest half integer, to lie in s$\mathcal A$s. As before the approximate optimal policy outperforms the benchmark policies despite the fact that both the myopic and the interval policy use the true parameters when computing the expected surplus or the expected total cost per contract.sFigure 7.sOptimal policy for realistic model with terminal state using linear function approximation, for s$N_t,N_{t-1}\in\{1.75, 2.00, 2.50\}\cdot10^5$s, s$C_{t-1,1} = c_0\cdot2\cdot 10^5$s, and s$C_{t-j,j} = c_0\cdot2\cdot 10^5\prod_{k=1}^{j-1}f_k$s.sTable 4.sExpected discounted total reward based on simulation, (uniformly distributed starting states). The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.sExpected rewardsTerminationssFourier 3s$-93.14$s0.019sInterval policy (s$p=10.5$swhen s$G_t\in[1.2,2.8]\cdot 10^6$s)s$-106.70$s0.034sMyopic policy with constraint, s$p_{\min}=10.5$s$-112.70$s0.041sBest constant policy, s$p=11.5$s$-136.60$s0.061sA comparison of the approximate optimal policy and the best benchmark policy, including a figure similar to Figure 6, is found in the Supplemental Material, Section 5.s6.sConclusionsClassical methods for solving premium control problems are suitable for simple dynamical insurance systems, and the model choice must to a large extent be based on how to make the problem solvable, rather than reflecting the real dynamics of the stochastic environment. For this reason, the practical use of the optimal premium rules derived with classical methods is often limited.sReinforcement learning methods enable us to solve premium control problems in realistic settings that adequately capture the complex dynamics of the system. Since these techniques can learn directly from real or simulated experience of the stochastic environment, they do not require explicit expressions for transition probabilities. Further, these methods can be combined with function approximation in order to overcome the curse of dimensionality as the state space tends to be large in more realistic settings. This makes it possible to take key features of real dynamical insurance systems into account, for example payment delays and how the number of contracts issued in the future will vary depending on the premium rule. Hence, the optimal policies derived with these techniques can be used as a basis for decisions on how to set the premium for insurance companies.sWe have illustrated strengths and weaknesses of different methods for solving the premium control problem for a mutual insurer and demonstrated that given a complex dynamical system, the approximate policy derived with SARSA using function approximation outperforms several benchmark policies. In particular, it clearly outperforms the policy derived with classical methods based on a more simplistic model of the stochastic environment, which fails to take important aspects of a real dynamical insurance system into account. Furthermore, the use of these methods is not specific to the model choices made in Section 2. The present paper provides guidance on how to carefully design a reinforcement learning method with function approximation for the purpose of obtaining an optimal premium rule, which together with models that fit the experience of the specific insurance company allows for optimal premium rules that can be used in practice.sThe models may be extended to include dependence on covariates. However, it should be noted that if we want to model substantial heterogeneity among policyholders and consider a large covariate set, then the action space becomes much larger and function approximation also for the policy may become necessary.s http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Astin Bulletin Cambridge University Press

Premium control with reinforcement learning

Astin Bulletin , Volume 53 (2): 25 – May 1, 2023

Loading next page...
 
/lp/cambridge-university-press/premium-control-with-reinforcement-learning-NwLPDh96ef

References (30)

Publisher
Cambridge University Press
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of The International Actuarial Association
ISSN
0515-0361
eISSN
1783-1350
DOI
10.1017/asb.2023.13
Publisher site
See Article on Publisher Site

Abstract

s1.sIntroductionsAn insurance company’s claim costs and investment earnings fluctuate randomly over time. The insurance company needs to determine the premiums before the coverage periods start, that is before knowing what claim costs will appear and without knowing how its invested capital will develop. Hence, the insurance company is facing a dynamic stochastic control problem. The problem is complicated because of delays and feedback effects: premiums are paid before claim costs materialise and premium levels affect whether the company attracts or loses customers.sAn insurance company wants a steady high surplus. The optimal dividend problem introduced by de Finetti (1957) (and solved by Gerber, 1969) has the objective to maximise the expected present value of future dividends. Its solution takes into account that paying dividends too generously is suboptimal since a probability of default that is too high affects the expected present value of future dividends negatively. A practical problem with implementing the optimal premium rule, that is a rule that maps the state of the stochastic environment to a premium level, obtained from solving the optimal dividend problem is that the premiums would be fluctuating more than what would be feasible for a real insurance market with competition. A good premium rule needs to generate premiums that do not fluctuate wildly over time.sFor a mutual insurance company, different from a private company owned by shareholders, maximising dividends is not the main objective. Instead the premiums should be low and suitably averaged over time, but also making sure that the surplus is sufficiently high to avoid a too high probability of default. Solving this multiple-objective optimisation problem is the focus of the present paper. Similar premium control problems have been studied by Martin-Löf (1983, 1994), and these papers have been a source of inspiration for our work.sMartin-Löf (1983) carefully sets up the balance equations for the key economic variables of relevance for the performance of the insurance company and studies the premium control problem as a linear control problem under certain simplifying assumptions enabling application of linear control theory. The paper analyses the effects of delays in the insurance dynamical system on the linear control law with feedback and discusses designs of the premium control that ensure that the probability of default is small.sMartin-Löf (1994) considers an application of general optimal control theory in a setting similar to, but simpler than, the setting considered in Martin-Löf (1983). The paper derives and discusses the optimal premium rule that achieves low and averaged premiums and also targets sufficient solvency of the insurance company.sThe literature on optimal control theory in insurance is vast, see for example the textbook treatment by Schmidli (2008) and references therein. Our aim is to provide solutions to realistic premium control problems in order to allow the optimal premium rule to be used with confidence by insurance companies. In particular, we avoid considering convenient stochastic models that may fit well with optimal control theory but fail to take key features of real dynamical insurance systems into account. Instead, we consider an insurance company that has enough data to suggest realistic models for the insurance environment, but the complexity of these models do not allow for explicit expressions for transition probabilities of the dynamical system. In this sense, the model of the environment is not fully known. However, the models can be used for simulating the behaviour of the environment.sIncreased computing power and methodological advances during the recent decades make it possible to revisit the problems studied in Martin-Löf (1983, 1994) and in doing so allow for more complex and realistic dynamics of the insurance dynamical system. Allowing realistic complex dynamics means that optimal premium rules, if possible to be obtained, will allow insurance companies to not only be given guidance on how to set premiums but actually have premium rules that they can use with certain confidence. The methodological advances that we use in this work is reinforcement learning and in particular reinforcement learning combined with function approximation, see for example Bertsekas and Tsitsiklis (1996) and Sutton and Barto (2018) and references therein. In the present paper, we focus on the temporal difference control algorithms SARSA and Q-learning. SARSA was first proposed by Rummery and Niranjan (1994) and named by Sutton (1995). Q-learning was introduced by Watkins (1989). By using reinforcement learning methods combined with function approximation, we obtain premium rules in terms of Markovian controls for Markov decision processes whose state spaces are much larger/more realistic than what was considered in the premium control problem studied in Martin-Löf (1994).sThere exist other methods for solving general stochastic control problems with a known model of the environment, see for example Han and E (2016) and Germain et al. (2021). However, the deep learning methods in these papers are developed to solve fixed finite-time horizon problems. The stochastic control problem considered in the present paper has an indefinite-time horizon, since the terminal time is random and unbounded. A random terminal time also causes problems for the computation of gradients in deep learning methods. There are reinforcement learning methods, such as policy gradient methods (see e.g., Williams, 1992; Sutton et al., 1999, or for an overview Sutton and Barto, 2018, Ch. 13), that enable direct approximation of the premium rule by neural networks (or other function approximators) when the terminal time is random. However, for problems where the terminal time can be quite large (as in the present paper) these methods likely require an additional approximation of the value function (so-called actor-critic methods).sIn the mathematical finance literature, there has recently been significant interest in the use of reinforcement learning, in particular related to hedging combined with function approximation, for instance the influential paper by Buehler et al. (2019) on deep hedging. Carbonneau (2021) uses the methodology in Buehler et al. (2019) and studies approaches to risk management of long-term financial derivatives motivated by guarantees and options embedded in life-insurance products. Another approach to deep hedging based on reinforcement learning for managing risks stemming from long-term life-insurance products is presented in Chong et al. (2021). Dynamic pricing has been studied extensively in the operations research literature. For instance, the problem of finding the optimal balance between learning an unknown demand function and maximising revenue is related to reinforcement learning. We refer to den Boer and Zwart (2014) and references therein. Reinforcement learning is used in Krasheninnikova et al. (2019) for determining a renewal pricing strategy in an insurance setting. However, the problem formulation and solution method are different from what is considered in the present paper. Krasheninnikova et al. (2019) considers retention of customers while maximising revenue and does not take claims costs into account. Furthermore, in Krasheninnikova et al. (2019) the state space is discretised in order to use standard Q-learning, while the present paper solves problems with a large or infinite state space by combining reinforcement learning with function approximation.sThe paper is organised as follows. Section 2 describes the relevant insurance economics by presenting the involved cash flows, key economic quantities such as surplus, earned premium, reserves and how such quantities are connected to each other and their dynamics or balance equations. Section 2 also introduces stochastic models (simple, intermediate and realistic) giving a complete description of the stochastic environment in which the insurance company operates and aims to determine an optimal premium rule. The stochastic model will serve us by enabling simulation of data from which the reinforcement learning methods gradually learn the stochastic environment in the search for optimal premium rules. The models are necessarily somewhat complex since we want to take realistic insurance features into account, such as delays between accidents and payments and random fluctuations in the number of policyholders, partly due to varying premium levels.sSection 3 sets up the premium control problem we aim to solve in terms of a Markov decision process and standard elements of stochastic control theory such as the Bellman equation. Finding the optimal premium rule by directly solving the Bellman (optimality) equation numerically is not possible when considering state spaces for the Markov decision process matching a realistic model for the insurance dynamical system. Therefore, we introduce reinforcement learning methods in Section 4. In particular, we present basic theory for the temporal difference learning methods Q-learning and SARSA. We explain why these methods will not be able to provide us with reliable estimates of optimal premium rules unless we restrict ourselves to simplified versions of the insurance dynamical system. We argue that SARSA combined with function approximation of the so-called action-value function will allow us to determine optimal premium rules. We also highlight several pitfalls that the designer of the reinforcement learning method must be aware of and make sure to avoid.sSection 5 presents the necessary details in order to solve the premium control problem using SARSA with function approximation. We analyse the effects of different model/method choices on the performance of different reinforcement learning techniques and compare the performance of the optimal premium rule with those of simpler benchmark rules.sFinally, Section 6 concludes the paper. We emphasise that the premium control problem studied in the present paper is easily adjusted to fit the features of a particular insurance company and that the excellent performance of a carefully set up reinforcement learning method with function approximation provides the insurance company with an optimal premium rule that can be used in practice and communicated to stakeholders.s2.sA stochastic model of the insurance companysThe number of contracts written during year s$t+1$sis denoted s$N_{t+1}$s, known at the end of year s$t+1$s. The premium per contract s$P_t$sduring year s$t+1$sis decided at the end of year t. Hence, s$P_t$sis s$\mathcal{F}_t$s-measurable, where s$\mathcal{F}_t$sdenotes the s$\sigma$s-algebra representing the available information at the end of year t. Contracts are assumed to be written uniformly in time over the year and provide insurance coverage for one year. Therefore, assuming that the premium income is earned linearly with time, the earned premium during year s$t+1$sis s\begin{equation*}\text{EP}_{t+1}=\frac{1}{2}\left(P_tN_{t+1}+P_{t-1}N_t\right),\end{equation*}sthat is for contracts written during year s$t+1$s, on average half of the premium income s$P_tN_{t+1}$swill be earned during year s$t+1$s, and half during year s$t+2$s. Since only half of the premium income s$P_tN_{t+1}$sis earned during year s$t+1$s, the other half, which should cover claims during year s$t+2$s, will be stored in the premium reserve. The balance equation for the premium reserve is s$V_{t+1}=V_t+P_tN_{t+1}-\text{EP}_{t+1}$s. Note that when we add cash flows or reserves occurring at time s$t+1$sto cash flows or reserves occurring at time t, the time t amounts should be interpreted as adjusted for the time value of money. We choose not to write this out explicitly in order to simplify notation.sThat contracts are written uniformly in time over the year means that s$I_{t,k}$s, the incremental payment to policyholders during year s$t+k$sfor accidents during year s$t+1$s, will consist partly of payments to contracts written during year s$t+1$sand partly of payments to contracts written during year t. Hence, we assume that s$I_{t,k}$sdepends on both s$N_{t+1}$sand s$N_t$s. Table 1 shows a claims triangle with entries s$I_{j,k}$srepresenting incremental payments to policyholders during year s$j+k$sfor accidents during year s$j+1$s. For ease of presentation, other choices could of course be made, we will assume that the maximum delay between an accident and a resulting payment is four years. Entries s$I_{j,k}$swith s$j+k\leq t$sare s$\mathcal{F}_t$s-measurable and coloured blue in Table 1. Let s\begin{align*}\text{IC}_{t+1}&=I_{t,1}+\operatorname{E}\!\left[I_{t,2}+I_{t,3}+I_{t,4}\mid \mathcal{F}_{t+1}\right],\quad \text{PC}_{t+1}=I_{t,1}+I_{t-1,2}+I_{t-2,3}+I_{t-3,4},\\\text{RP}_{t+1}&=\operatorname{E}\!\left[I_{t-3,4}+I_{t-2,3}+I_{t-2,4}+I_{t-1,2}+I_{t-1,3}+I_{t-1,4}\mid\mathcal{F}_t\right]\\&\quad-\operatorname{E}\!\left[I_{t-3,4}+I_{t-2,3}+I_{t-2,4}+I_{t-1,2}+I_{t-1,3}+I_{t-1,4}\mid\mathcal{F}_{t+1}\right],\end{align*}swhere IC, PC and RP denote, respectively, incurred claims, paid claims and runoff profit. The balance equation for the claims reserve is s$E_{t+1}=E_{t}+\text{IC}_{t+1}-\text{RP}_{t+1}-\text{PC}_{t+1}$s, where s(2.1)sThe profit or loss during year s$t+1$sdepends on changes in the reserves: s\begin{align*}\text{P&L}_{t+1} = P_tN_{t+1} - \text{PC}_{t+1} + \text{IE}_{t+1}-\text{OE}_{t+1} - (E_{t+1}-E_t + V_{t+1}-V_t),\end{align*}swhere IE denotes investment earnings and OE denotes operating expenses. The dynamics of the surplus fund is therefore s(2.2)s\begin{align}G_{t+1}=G_t+\text{P&L}_{t+1}=G_t+\text{EP}_{t+1}+\text{IE}_{t+1}-\text{OE}_{t+1}-\text{IC}_{t+1}+\text{RP}_{t+1}.\end{align}sTable 1.sPaid claim amounts from accidents during years s$t-2,\dots,t+1$s.sWe consider three models of increasing complexity. The simple model allows us to solve the premium control problem with classical methods. In this situation, we can compare the results obtained with classical methods with the results obtained with more flexible methods, allowing the assessment of the performance of a chosen flexible method. Classical solution methods are not feasible for the intermediate model. However, the similarity between the simple and intermediate model allows us to understand how increasing model complexity affects the optimal premium rule. Finally, we consider a realistic model, where the models for the claims payments and investment earnings align closer with common distributional assumptions for these quantities. Since the simple model is a simplified version of the intermediate model, we begin by defining the intermediate model in Section 2.1, followed by the simple model in Section 2.2. In Section 2.3, the more realistic models for claims payments and investment earnings are defined.s2.1.sIntermediate modelsWe choose to model the key random quantities as integer-valued random variables with conditional distributions that are either Poisson or Negative Binomial distributions. Other choices of distributions on the integers are possible without any major effects on the analysis that follows. Let s(2.3)s\begin{align}\mathcal{L}\!\left(N_{t+1}\mid \mathcal{F}_t\right)=\mathcal{L}(N_{t+1}\mid P_t)=\textsf{Pois}\!\left(aP_t^{b}\right),\end{align}swhere s$a>0$sis a constant, and s$b<0$sis the price elasticity of demand. The notation says that the conditional distribution of the number of contracts written during year s$t+1$sgiven the information at the end of year t depends on that information only through the premium decided at the end of year t for those contracts.sLet s$\widetilde{N}_{t+1}=(N_{t+1}+N_t)/2$sdenote the number of contracts during year s$t+1$sthat provide coverage for accidents during year s$t+1$s. Let s(2.4)s\begin{align}\text{OE}_{t+1}=\beta_0 +\beta_1 \widetilde{N}_{t+1},\end{align}ssaying that the operating expenses have both a fixed part and a variable part proportional to the number of active contracts. The appearance of s$\widetilde N_{t+1}$sinstead of s$N_{t+1}$sin the expressions above is due to the assumption that contracts are written uniformly in time over the year and that accidents occur uniformly in time over the year.sLet s$\alpha_1,\dots,\alpha_4\in [0,1]$swith s$\sum_{i=1}^4\alpha_i = 1$s. The constant s$\alpha_k$sis, for a given accident year, the expected fraction of claim costs paid during development year k. Let s(2.5)s\begin{align}\mathcal{L}\!\left(I_{t,k}\mid \mathcal{F}_{t},\widetilde{N}_{t+1}\right)=\mathcal{L}\!\left(I_{t,k}\mid \widetilde{N}_{t+1}\right)=\textsf{Pois}\!\left(\alpha_k\mu\widetilde N_{t+1}\right),\end{align}swhere s$\mu$sdenotes the expected claim cost per contract. We assume that different incremental claims payments s$I_{j,k}$sare conditionally independent given information about the corresponding numbers of contracts written. Formally, the elements in the set s\begin{align*}\left\{I_{j,k}\,:\,j\in\{t-l,\dots,t\}, k\in \{1,\dots,4\}\right\}\end{align*}sare conditionally independent given s$N_{t-l},\dots,N_{t+1}$s. Therefore, using (2.1) and (2.5), s(2.6)s\begin{align}\mathcal{L}\!\left(\text{PC}_{t+1}\mid \mathcal{F}_t, \widetilde{N}_{t+1}\right) &=\mathcal{L}\!\left(\text{PC}_{t+1}\mid \widetilde{N}_{t-2},\dots,\widetilde{N}_{t+1}\right)=\textsf{Pois}\!\left(\alpha_1\mu \widetilde{N}_{t+1}+\dots+\alpha_4\mu \widetilde{N}_{t-2}\right),\nonumber \\\text{IC}_{t+1}-\text{RP}_{t+1}&=\text{PC}_{t+1}+\left(\alpha_2+\alpha_3+\alpha_4\right)\mu \widetilde{N}_{t+1} - \alpha_2\mu \widetilde{N}_t-\alpha_3\mu \widetilde{N}_{t-1}-\alpha_4\mu \widetilde{N}_{t-2}. \end{align}sThe model for the investment earnings s$\text{IE}_{t+1}$sis chosen so that s$G_t\leq 0$simplies s$\text{IE}_{t+1}=0$ssince s$G_t\leq 0$smeans that nothing is invested. Moreover, we assume that s(2.7)s\begin{align}\mathcal{L}(\text{IE}_{t+1}+G_t\mid \mathcal{F}_t,G_t>0)=\mathcal{L}(\text{IE}_{t+1}+G_t\mid G_t,G_t>0)=\textsf{NegBin}\bigg(\nu G_t,\frac{1+\xi}{1+\xi+\nu}\bigg),\end{align}swhere s$\textsf{NegBin}(r,p)$sdenotes the negative binomial distribution with probability mass function s\begin{align*}k\mapsto \binom{k+r-1}{k}(1-p)^{r}p^k\end{align*}swhich corresponds to mean and variance s\begin{align*}\operatorname{E}\![\text{IE}_{t+1}+G_t\mid G_t,G_t>0] & = \frac{p}{1-p}r = (1+\xi)G_t,\\[4pt]\operatorname{Var}(\text{IE}_{t+1}+G_t\mid G_t,G_t>0) & = \frac{p}{(1-p)^2}r=\frac{1+\xi+\nu}{\nu}(1+\xi)G_t.\end{align*}sGiven a premium rule s$\pi$sthat given the state s$S_t=(G_t,P_{t-1},N_{t-3},N_{t-2},N_{t-1},N_t)$sgenerates the premium s$P_t$s, the system s$(S_t)$sevolves in a Markovian manner according to the transition probabilities that follows from (2.3)–(2.7) and (2.2). Notice that if we consider a less long-tailed insurance product so that s$\alpha_3=\alpha_4=0$s(at most one year delay from occurrence of the accident to final payment), then the dimension of the state space reduces to four, that is s$S_t=(G_t,P_{t-1},N_{t-1},N_t)$s.s2.2sSimple modelsConsider the situation where the insurer has a fixed number N of policyholders, who at some initial time point bought insurance policies with automatic contract renewal for the price s$P_t$syear s$t+1$s. The state at time t is s$S_t=(G_t,P_{t-1})$s. In this simplified setting, s$\text{OE}_{t+1}=\beta_0+\beta_1 N$s, all payments s$I_{t,k}$sare independent, s$\mathcal{L}(I_{t,k})=\textsf{Pois}(\alpha_k\mu N)$s, s$\text{IC}_{t+1}-\text{RP}_{t+1}=\text{PC}_{t+1}$sand s$\mathcal{L}(\text{PC}_{t+1})=\textsf{Pois}(\mu N)$s.s2.3sRealistic modelsIn this model, we change the distributional assumptions for both investment earnings and the incremental claims payments from the previously used integer-valued distributions. Let s$(Z_{t})$sbe a sequence of iid standard normals and let s\begin{align*}\mathcal{L}\!\left(\text{IE}_{t+1}+G_t \mid G_t,G_t>0\right) = \mathcal{L}\!\left(G_t\exp\{\widetilde \mu + \widetilde\sigma Z_{t+1}\}\mid G_t,G_t>0\right).\end{align*}sLet s$C_{t,j}$sdenote the cumulative claims payments for accidents occurring during year s$t+1$sup to and including development year j. Hence, s$I_{t,1}=C_{t,1}$s, and s$I_{t,j} = C_{t,j}-C_{t,j-1}$sfor s$j>1$s. We use the following model for the cumulative claims payments: s(2.8)s\begin{align}C_{t,1}&=c_0\widetilde{N}_{t+1}\exp\!\left\{\nu_0Z_{t,1}-\nu_0^2/2\right\}, \nonumber\\[4pt]C_{t,j+1}&=C_{t,j}\exp\!\left\{\mu_j+\nu_jZ_{t,j+1}\right\}, \quad j=1\dots,J-1,\end{align}swhere s$c_0$sis interpreted as the average claims payment per policyholder during the first development year, and s$Z_{t,j}$sare iid standard normals. Then, s\begin{align*}\operatorname{E}\!\left[C_{t,j+1}\mid C_{t,j}\right]&=C_{t,j}\exp\!\left\{\mu_j+\nu_j^2/2\right\}, \\[5pt]\operatorname{Var}\!\left(C_{t,j+1}\mid C_{t,j}\right)&=C_{t,j}^2\left(\exp\!\left\{\nu_j^2\right\}-1\right)\exp\!\left\{2\mu_j+\nu_j^2\right\}.\end{align*}sWe do not impose restrictions on the parameters s$\mu_j$sand s$\nu_j^2$sand as a consequence values s$f_j=\exp\!\left\{\mu_j+\nu_j^2/2\right\}\in (0,1)$sare allowed (allowing for negative incremental paid amounts s$C_{t,j+1}-C_{t,j}<0$s). This is in line with the model assumption s$\operatorname{E}\!\left[C_{t,j+1} \mid C_{t,1},\dots,C_{t,j}\right]=f_jC_{t,j}$sfor some s$f_j>0$sof the classical distribution-free Chain Ladder model by Mack (1993). Moreover, with s$\mu_{0,t}=\log \!\left(c_0\widetilde{N}_{t+1}\right)-\nu_0^2/2$s, s(2.9)s\begin{align} \mathcal{L}\!\left(\log C_{t,1}\right)=\textsf{N}\!\left(\mu_{0,t},\nu_0^2\right), \quad \mathcal{L}\bigg(\log\frac{C_{t,j+1}}{C_{t,j}}\bigg)=\textsf{N}\!\left(\mu_j,\nu_j^2\right).\end{align}sGiven an (incremental) development pattern s$(\alpha_1,\dots,\alpha_J)$s, s$\sum_{j=1}^J\alpha_j=1$s, s$\alpha_j\geq 0$s, s\begin{align*}\alpha_1=\frac{1}{\prod_{k=1}^{J-1}f_k}, \quad \alpha_j=\frac{(f_{j-1}-1)\prod_{k=1}^{j-2}f_k}{\prod_{k=1}^{J-1}f_k}, \quad j=2,\dots,J,\end{align*}swith the convention s$\prod_{s=a}^b c_s=1$sif s$a>b$s, where s$f_j=\exp\{\mu_j+\nu_j^2/2\}$s. We have s\begin{align*}&\text{IC}_{t+1}=C_{t,1}\prod_{k=1}^{J-1}f_k,\quad \text{PC}_{t+1}=C_{t,1}+\sum_{k=1}^{J-1}\left(C_{t-k,k+1}-C_{t-k,k}\right),\\&\text{RP}_{t+1}=\sum_{k=1}^{J-1}\left(C_{t-k,k}f_k-C_{t-k,k+1}\right)\prod_{j=k+1}^{J-1}f_j.\end{align*}sThe state is s$S_t=\left(G_t,P_{t-1},N_{t-1},N_t,C_{t-1,1},\dots,C_{t-J+1,J-1}\right)$s.s3.sThe control problemsWe consider a set of states s$\mathcal S^+$s, a set of non-terminal states s$\mathcal S\subseteq \mathcal S^+$s, and for each s$s\in \mathcal S$sa set of actions s$\mathcal A(s)$savailable from state s, with s$\mathcal A=\cup_{s\in\mathcal S}\mathcal A(s)$s. We assume that s$\mathcal A$sis discrete (finite or countable). In order to simplify notation and limit the need for technical details, we will here and in Section 4 restrict our presentation to the case where s$\mathcal S^+$sis also discrete. However, we emphasise that when using function approximation in Section 4.2 the update Equation (4.3) for the weight vector is still valid when the state space is uncountable, as is the case for the realistic model. For each s$s\in\mathcal S$s, s$s^{\prime}\in\mathcal S^+$s, s$a\in\mathcal A(s)$swe define the reward received after taking action a in state s and transitioning to ss′, s$-f(a,s,s^{\prime})$s, and the probability of transitioning from state s to state ss′ after taking action a, s$p(s^{\prime}|s,a)$s. We assume that rewards and transition probabilities are stationary (time-homogeneous). This defines a Markov decision process (MDP). A policy s$\pi$sspecifies how to determine what action to take in each state. A stochastic policy describes, for each state, a probability distribution on the set of available actions. A deterministic policy is a special case of a stochastic policy, specifying a degenerate probability distribution, that is a one-point distribution.sOur objective is to find the premium policy that minimises the expected value of the premium payments over time, but that also results in s$(P_t)$sbeing more averaged over time, and further ensures that the surplus s$(G_t)$sis large enough so that the risk that the insurer cannot pay the claim costs and other expenses is small. By combining rewards with either constraints on the available actions from each state or the definition of terminal states, this will be accomplished with a single objective function, see further Sections 3.1–3.2. We formulate this in terms of a MDP, that is we want to solve the following optimisation problem: s(3.1)s\begin{equation}\begin{aligned} \underset{\pi}{\text{minimise }} &\operatorname{E}_{\pi} \!\bigg[\sum_{t=0}^T \gamma^t f(P_t,S_t,S_{t+1})\mid S_0=s\bigg],\end{aligned}\end{equation}swhere s$\pi$sis a policy generating the premium s$P_t$sgiven the state s$S_t$s, s$\mathcal{A}(s)$sis the set of premium levels available from state s, s$\gamma$sis the discount factor, f is the cost function, and s$\operatorname{E}_{\pi}\![{\cdot}]$sdenotes the expectation given that policy s$\pi$sis used. Note that the discount factor s$\gamma^t$sshould not be interpreted as the price of a zero-coupon bond maturing at time t, since the cost that is discounted does not represent an economic cost. Instead s$\gamma$sreflects how much weight is put on costs that are immediate compared to costs further in the future. The transition probabilities are s\begin{align*}p\!\left(s^{\prime}| s,a\right) = \operatorname{P}\!(S_{t+1}=s^{\prime}\mid S_t=s,P_t=a),\end{align*}sand we consider stationary policies, letting s$\pi(a|s)$sdenote the probability of taking action a in state s under policy s$\pi$s, s\begin{align*}\pi(a|s)=\operatorname{P}_{\pi}\!(P_t=a\mid S_t=s).\end{align*}sIf there are no terminal states, we have s$T=\infty$s, and s$\mathcal S^+ =\mathcal S$s. We want to choose s$\mathcal A(s), s\in\mathcal S$s, f, and any terminal states such that the objective discussed above is achieved. We will do this in two ways, see Sections 3.1 and 3.2.sThe value function of state s under a policy s$\pi$sgenerating the premium s$P_t$sis defined as s\begin{align*}v_{\pi}(s)&=\operatorname{E}_{\pi} \bigg[\sum_{t=0}^T \gamma^t ({-}f(P_t,S_t,S_{t+1}))\mid S_0=s\bigg].\end{align*}sThe Bellman equation for the value function is s\begin{align*}v_{\pi}(s)&=\sum_{a\in\mathcal A(s)}\pi(a|s)\sum_{s^{\prime}\in\mathcal S}p\!\left(s^{\prime}| s,a\right)\big({-}f(a,s,s^{\prime})+\gamma v_{\pi}(s^{\prime})\big).\end{align*}sWhen the policy is deterministic, we let s$\pi$sbe a mapping from s$\mathcal S$sto s$\mathcal A$s, and s\begin{align*}v_{\pi}(s)&=\sum_{s^{\prime}\in\mathcal S}p(s^{\prime}| s,\pi(s))\big({-}f(\pi(s),s,s^{\prime})+\gamma v_{\pi}(s^{\prime})\big).\end{align*}sThe optimal value function is s$v_*(s)=\sup_{\pi}v_{\pi}(s)$s. When the action space is finite, the supremum is attained, which implies the existence of an optimal deterministic stationary policy (see Puterman, 2005, Cor. 6.2.8, for other sufficient conditions for attainment of the supremum, see Puterman, 2005, Thm. 6.2.10). Hence, if the transition probabilities are known, we can use the Bellman optimality equation to find s$v_*(s)$s: s\begin{align*}v_*(s)=\max_{a\in\mathcal{A}(s)}\sum_{s^{\prime}\in\mathcal S}p\!\left(s^{\prime}| s,a\right)\big({-}f(a,s,s^{\prime})+\gamma v_{*}(s^{\prime})\big).\end{align*}sWe use policy iteration in order to find the solution numerically. Let s$k=0$s, and choose some initial deterministic policy s$\pi_k(s)$sfor all s$s\in\mathcal{S}$s. Thens(i)sDetermine s$V_{k}(s)$sas the unique solution to the system of equations s\begin{align*}V_{k}(s) = \sum_{s^{\prime}\in\mathcal{S}}p\!\left(s^{\prime}|s,\pi_k(s)\right)\Big({-}f\!\left(\pi_k(s),s,s^{\prime}\right)+\gamma V_k(s^{\prime})\Big).\end{align*}s(ii)sDetermine an improved policy s$\pi_{k+1}(s)$sby computing s\begin{align*}\pi_{k+1}(s)=\mathop{\operatorname{argmax}}\limits_{a\in\mathcal{A}(s)}\sum_{s^{\prime}\in\mathcal{S}}p\!\left(s^{\prime}|s,a\right)\Big({-}f(a,s,s^{\prime})+\gamma V_k(s^{\prime})\Big).\end{align*}s(iii)sIf s$\pi_{k+1}(s)\neq\pi_k(s)$sfor some s$s\in \mathcal{S}$s, then increase k by 1 and return to step (i).sNote that if the state space is large enough, solving the system of equations in step (i) directly might be too time-consuming. In that case, this step can be solved by an additional iterative procedure, called iterative policy evaluation, see for example Sutton and Barto (2018, Ch. 4.1).s3.1.sMDP with constraint on the action spacesThe premiums s$(P_t)$swill be averaged if we minimise s$\sum_t c(P_t)$s, where c is an increasing, strictly convex function. Thus for the first MDP, we let s$f(a,s,s^{\prime})=c(a)$s. To ensure that the surplus s$(G_t)$sdoes not become negative too often, we combine this with the constraint saying that the premium needs to be chosen so that the expected value, given the current state, of the surplus stays nonnegative, that is s(3.2)s\begin{align}\mathcal{A}(S_t) = \{P_t:\operatorname{E}_{\pi}\![G_{t+1}\mid S_t]\geq 0\},\end{align}sand the optimisation problem becomes s(3.3)s\begin{align} \underset{\pi}{\text{minimise }} \operatorname{E}_{\pi} \bigg[\sum_{t=0}^\infty \gamma^t c(P_t)\mid S_0=s\bigg]\quad \text{subject to } \operatorname{E}_{\pi}\![G_{t+1}\mid S_t]\geq 0 \text{ for all } t.\end{align}sThe choice of the convex function c, together with the constraint, will affect how quickly the premium can be lowered as the surplus or previous premium increases, and how quickly the premium must be increased as the surplus or previous premium decreases. Different choices of c affect how well different parts of the objective are achieved. Hence, one choice of c might put a higher emphasis on the premium being more averaged over time but slightly higher, while another choice might promote a lower premium level that is allowed to vary a bit more from one time point to another. Furthermore, it is not clear from the start what choice of c will lead to a specific result, thus designing the reward signal might require searching through trial and error for the cost function that achieves the desired result.s3.2.sMDP with a terminal statesThe constraint (3.2) requires a prediction of s$N_{t+1}$saccording to (2.3). However, estimating the price elasticity in (2.3) is difficult task; hence, it would be desirable to solve the optimisation problem without having to rely on this prediction. To this end, we remove the constraint on the action space, that is we let s$\mathcal A(s) = \mathcal A$sfor all s$s\in\mathcal S$s, and instead introduce a terminal state which has a larger negative reward than all other states. This terminal state is reached when the surplus s$G_t$sis below some predefined level, and it can be interpreted as the state where the insurer defaults and has to shut down. If we let s$\mathcal G$sdenote the set of non-terminal states for the first state variable (the surplus), then s\begin{align*}f\!\left(P_t,S_{t},S_{t+1}\right)= h\!\left(P_t,S_{t+1}\right)=\begin{cases}c(P_t), & \quad \text{if } G_{t+1}\geq \min\mathcal G,\\c(\!\max\mathcal A)(1+\eta), & \quad\text{if } G_{t+1}<\min\mathcal G,\end{cases}\end{align*}swhere s$\eta>0$s. The optimisation problem becomes s(3.4)s\begin{equation}\begin{aligned} \underset{\pi}{\text{minimise }} &\operatorname{E}_{\pi} \!\bigg[\sum_{t=0}^T \gamma^t h(P_t,S_{t+1})\mid S_0=s\bigg],\quad T = \min\!\left\{t\,:\,G_{t+1}<\min\mathcal G\right\}.\end{aligned}\end{equation}sThe reason for choosing s$\eta>0$sis to ensure that the reward when transitioning to the terminal state is lower than the reward when using action s$\max \mathcal A$s(the maximal premium level), that is, it should be more costly to terminate and restart compared with attempting to increase the surplus when the surplus is low. The particular value of the parameter s$\eta>0$stogether with the choice of the convex function c determines the reward signal, that is the compromise between minimising the premium, averaging the premium and ensuring that the risk of default is low. One way of choosing s$\eta$sis to set it high enough so that the reward when terminating is lower than the total reward using any other policy. Then, we require that s\begin{align*}(1+\eta) c(\max\mathcal A) > \sum_{t=0}^\infty \gamma^tc(\max\mathcal A)=\frac{1}{1-\gamma} c(\max\mathcal A),\end{align*}sthat is s$\eta>\gamma/(1-\gamma)$s. This choice of s$\eta$swill put a higher emphasis on ensuring that the risk of default is low, compared with using a lower value of s$\eta$s.s3.3.sChoice of cost functionsThe function c is chosen to be an increasing, strictly convex function. That it is increasing captures the objective of a low premium. As discussed in Martin-Löf (1994), that it is convex means that the premiums will be more averaged, since s\begin{align*}\frac{1}{T}\sum_{t=1}^Tc(p_t)\geq c\bigg(\frac{1}{T}\sum_{t=1}^T p_t\bigg),\end{align*}sThe more convex shape the function has, the more stable the premium will be over time. One could also force stability by adding a term related to the absolute value of the difference between successive premium levels to the cost function. We have chosen a slightly simpler cost function, defined by c, and for the case with terminal states, by the parameter s$\eta$s.sAs for the specific choice of the function c used in Section 5, we have simply used the function suggested in Martin-Löf (1994), but with slightly adjusted parameter values. That the function c, together with the constraint or the choice of terminal states and the value of s$\eta$s, leads to the desired goal of a low, stable premium and a low probability of default needs to be determined on a case by case basis, since we have three competing objectives, and different insurers might put different weight on each of them. This is part of designing the reward function. Hence, adjusting c and s$\eta$swill change how much weight is put on each of the three objectives, and the results in Section 5 can be used as basis for adjustments.s4.sReinforcement learningsIf the model of the environment is not fully known, or if the state space or action space are not finite, the control problem can no longer be solved by classical dynamic programming approaches. Instead, we can utilise different reinforcement learning algorithms.s4.1.sTemporal-difference learningsTemporal-difference (TD) methods can learn directly from real or simulated experience of the environment. Given a specific policy s$\pi$swhich determines the action taken in each state, and the sampled or observed state at time t, s$S_{t}$s, state at time s$t+1$s, s$S_{t+1}$s, and reward s$R_{t+1}$s, the iterative update for the value function, using the one-step TD method, is s\begin{align*}V(S_t) \gets V(S_t)+\alpha_t\big(R_{t+1}+\gamma V(S_{t+1})-V(S_t)\big),\end{align*}swhere s$\alpha_t$sis a step size parameter. Hence, the target for the TD update is s$R_{t+1}+\gamma V(S_{t+1})$s. Thus, we update s$V(S_t)$s, which is an estimate of s$v_{\pi}(S_t)=\operatorname{E}_{\pi}\![R_{t+1}+\gamma v_{\pi}(S_{t+1}) \mid S_t]$s, based on another estimate, namely s$V(S_{t+1})$s. The intuition behind using s$R_{t+1}+\gamma V(S_{t+1})$sas the target in the update is that this is a slightly better estimate of s$v_{\pi}(S_t)$s, since it consists of an actual (observed or sampled) reward at s$t+1$sand an estimate of the value function at the next observed state.sIt has been shown in for example Dayan (1992) that the value function (for a given policy s$\pi$s) computed using the one-step TD method converges to the true value function if the step size parameter s$0\leq\alpha_t\leq 1$ssatisfies the following stochastic approximation conditions s\begin{align*}\sum_{k=1}^\infty \alpha_{t^k(s)} = \infty, \quad \sum_{k=1}^\infty \alpha_{t^k(s)}^2<\infty, \quad \text{for all } s \in \mathcal S,\end{align*}swhere s$t^k(s)$sis the time step when state s is visited for the kth time.s4.1.1.sTD control algorithmssThe one-step TD method described above gives us an estimate of the value function for a given policy s$\pi$s. To find the optimal policy using TD learning, a TD control algorithm, such as SARSA or Q-learning, can be used. The goal of these algorithms is to estimate the optimal action-value function s$q_{*}(s,a) = \max_{\pi}q_{\pi}(s,a)$s, where s$q_{\pi}$sis the action-value function for policy s$\pi$s, s\begin{align*}q_{\pi}(s,a) = \operatorname{E}_{\pi}\bigg[\sum_{t=0}^\infty \gamma^t R_{t+1} \mid S_0=s,A_0=a\bigg].\end{align*}sTo keep a more streamlined presentation, we will here focus on the algorithm SARSA. The main reason for this has to do with the topic of the next section, namely function approximation. While there are some convergence results for SARSA with function approximation, there are none for standard Q-learning with function approximation. In fact, there are examples of divergence when combining off-policy training (as is done in Q-learning) with function approximation, see for example Sutton and Barto (2018, Ch. 11). However, some numerical results for the simple model with standard Q-learning can be found in Section 5, and we do provide complete details on Q-learning in the Supplemental Material, Section 2.sThe iterative update for the action-value function, using SARSA, is s\begin{align*}Q\!\left(S_t,A_t\right) \gets Q\!\left(S_t,A_t\right)+\alpha_t\!\left(R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q\!\left(S_t,A_t\right)\right).\end{align*}sHence, we need to generate transitions from state-action pairs s$\left(S_t,A_t\right)$sto state-action pairs s$(S_{t+1},A_{t+1})$sand observe the rewards s$R_{t+1}$sobtained during each transition. To do this, we need a behaviour policy, that is a policy that determines which action is taken in the state we are currently in when the transitions are generated. Thus, SARSA gives an estimate of the action-value function s$q_{\pi}$sgiven the behaviour policy s$\pi$s. Under the condition that all state-action pairs continue to be updated, and that the behaviour policy is greedy in the limit, it has been shown in Singh et al. (2000) that SARSA converges to the true optimal action-value function if the step size parameter s$0\leq\alpha_t\leq1$ssatisfies the following stochastic approximation conditions s(4.1)s\begin{align} \sum_{k=1}^\infty \alpha_{t^k(s,a)} = \infty, \quad \sum_{k=1}^\infty \alpha_{t^k(s,a)}^2<\infty, \quad \text{for all } s \in \mathcal S, a\in\mathcal{A}(s),\end{align}swhere s$t^k(s,a)$sis the time step when a visit in state s is followed by taking action a for the kth time.sTo ensure that all state-action pairs continue to be updated, the behaviour policy needs to be exploratory. At the same time, we want to exploit what we have learned so far by choosing actions that we believe will give us large future rewards. A common choice of policy that compromises in this way between exploration and exploitation is the s$\varepsilon$s-greedy policy, which with probability s$1-\varepsilon$schooses the action that maximises the action-value function in the current state, and with probability s$\varepsilon$schooses any other action uniformly at random: s\begin{align*}\pi(a|s)=\begin{cases}1-\varepsilon,&\quad \text{if } a = \operatorname{argmax}_a Q(s,a),\\[4pt]\dfrac{\varepsilon}{|\mathcal A|-1}, & \quad \text{otherwise}.\end{cases}\end{align*}sAnother example is the softmax policy s\begin{align*}\pi(a| s) = \frac{e^{Q(s,a)/\tau }}{\sum_{a \in \mathcal{A}(s)} e^{Q(s,a)/\tau }}.\end{align*}sTo ensure that the behaviour policy s$\pi$sis greedy in the limit, it needs to be changed over time towards the greedy policy that maximises the action-value function in each state. This can be accomplished by letting s$\varepsilon$sor s$\tau$sslowly decay towards zero.s4.2.sFunction approximationsThe methods discussed thus far are examples of tabular solution methods, where the value functions can be represented as tables. These methods are suitable when the state and action space are not too large, for example for the simple model in Section 2.2. However, when the state space and/or action space is very large, or even continuous, these methods are not feasible, due to not being able to fit tables of this size in memory, and/or due to the time required to visit all state-action pairs multiple times. This is the case for the intermediate and realistic models presented in Sections 2.1 and 2.3. In both models, we allow the number of contracts written per year to vary, which increases the dimension of the state space. For the intermediate model, it also has the effect that the surplus process, depending on the parameter values chosen, can take non-integer values. For the simple model s$\mathcal S = \mathcal G \times \mathcal A$s, and with the parameters chosen in Section 5, we have s$|\mathcal G|=171$sand s$|\mathcal A|=100$s. For the intermediate model, if we let s$\mathcal N$sdenote the set of integer values that s$N_t$sis allowed to take values in, then s$\mathcal S = \mathcal G\times \mathcal A\times\mathcal N^l$s, where l denotes the maximum number of development years. With the parameters chosen in Section 5, the total number of states is approximately s$10^8$sfor the intermediate model. For the realistic model, several of the state variables are continuous, that is the state space is no longer finite.sThus, to solve the optimisation problem for the intermediate and the realistic model, we need approximate solution methods, in order to generalise from the states that have been experienced to other states. In approximate solution methods, the value function s$v_{\pi}(s)$s(or action-value function s$q_{\pi}(s,a)$s) is approximated by a parameterised function, s$\hat v(s;\,w)$s(or s$\hat q(s,a;\,w)$s). When the state space is discrete, it is common to minimise the following objective function, s(4.2)s\begin{align} J(w)=\sum_{s\in\mathcal S}\mu_{\pi}(s)\big(v_{\pi}(s)-\hat v(s;\,w)\big)^2,\end{align}swhere s$\mu_{\pi}(s)$sis the fraction of time spent in state s. For the model without terminal states, s$\mu_{\pi}$sis the stationary distribution under policy s$\pi$s. For the model with terminal states, to determine the fraction of time spent in each transient state, we need to compute the expected number of visits s$\eta_{\lambda,\pi}(s)$sto each transient state s$s\in\mathcal S$sbefore reaching a terminal (absorbing) state, where s$\lambda(s) = \operatorname{P}(S_0=s)$sis the initial distribution. For ease of notation, we omit s$\lambda$sfrom the subscript below, and write s$\eta_{\pi}$sand s$\operatorname{P}_{\pi}$sinstead of s$\eta_{\lambda,\pi}$sand s$\operatorname{P}_{\lambda,\pi}$s. Let s$p(s|s^{\prime})$sbe the probability of transitioning from state ss′ to state s under policy s$\pi$s, that is s$p(s| s^{\prime}) =\operatorname{P}_{\pi}(S_{t}=s\mid S_{t-1}=s^{\prime})$s. Then, s\begin{align*}\eta_{\pi}(s) &= \operatorname{E}_{\pi}\left[\sum_{t=0}^\infty \mathbf{1}_{\{S_t=s\}} \right]=\lambda(s) + \sum_{t=1}^\infty \operatorname{P}_{\pi}\!(S_t=s)\\&=\lambda(s) +\sum_{t=1}^\infty\sum_{s^{\prime}\in \mathcal S}p(s|s^{\prime})\operatorname{P}_{\pi}\!(S_{t-1}= s^{\prime})= \lambda(s) + \sum_{ s^{\prime}\in \mathcal S}p(s|s^{\prime})\sum_{t=0}^\infty\operatorname{P}_{\pi}(S_{t}=s^{\prime})\\&=\lambda(s) + \sum_{ s^{\prime}\in \mathcal S}p(s| s^{\prime})\eta_{\pi}(s^{\prime}),\end{align*}sor, in matrix form, s$\eta_{\pi} = \lambda + P^\top\eta_{\pi}$s, where P is the part of the transition matrix corresponding to transitions between transient states. If we label the states s$0,1,\ldots,|\mathcal S|$s(where state 0 represents all terminal states), then s$P=(p_{ij}\,:\,i,j\in\{1,2,\ldots,|\mathcal S|\})$s, where s$p_{ij}=p(j\mid i)$s. After solving this system of equations, the fraction of time spent in each transient state under policy s$\pi$scan be computed according to s\begin{align*}\mu_{\pi}(s) = \frac{\eta_{\pi}(s)}{\sum_{s^{\prime}\in\mathcal S}\eta_{\pi}(s^{\prime})}, \quad \text{for all } s \in\mathcal S.\end{align*}sThis computation of s$\mu_{\pi}$srelies on the model of the environment being fully known and the transition probabilities explicitly computable, as is the case for the simple model in Section 2.2. However, for the situation at hand, where we need to resort to function approximation and determine s$\hat v(s;\,w)$s(or s$\hat q(s,a;\,w)$s) by minimising (4.2), we cannot explicitly compute s$\mu_{\pi}$s. Instead, s$\mu_{\pi}$sin (4.2) is captured by learning incrementally from real or simulated experience, as in semi-gradient TD learning. Using semi-gradient TD learning, the iterative update for the weight vector w becomes s\begin{align*}w_{t+1} = w_t+\alpha_t\!\left(R_{t+1}+\gamma\hat v\!\left(S_{t+1};\,w_t\right)-\hat v\!\left(S_t;\,w_t\right)\right)\nabla \hat v\!\left(S_t;\,w_t\right).\end{align*}sThis update can be used to estimate s$v_{\pi}$sfor a given policy s$\pi$s, generating transitions from state to state by taking actions according to this policy. Similarly to standard TD learning (Section 4.1), the target s$R_{t+1}+\gamma\hat v\!\left(S_{t+1};\,w_t\right)$sis an estimate of the true (unknown) s$v_{\pi}(S_{t+1})$s. The name “semi-gradient” comes from the fact that the update is not based on the true gradient of s$\left(R_{t+1}+\gamma\hat v\!\left(S_{t+1};\,w_t\right)-\hat v\!\left(S_t;\,w_t\right)\right)^2$s; instead, the target is seen as fixed when the gradient is computed, despite the fact that it depends on the weight vector s$w_t$s.sAs in the previous section, estimating the value function given a specific policy is not our final goal – instead we want to find the optimal policy. Hence, we need a TD control algorithm with function approximation. One example of such an algorithm is semi-gradient SARSA, which estimates s$q_\pi$s. The iterative update for the weight vector is s(4.3)s\begin{align} w_{t+1} = w_t+\alpha_t\!\left(R_{t+1}+\gamma\hat q\!\left(S_{t+1},A_{t+1};\,w_t\right)-\hat q\!\left(S_t,A_t;\,w_t\right)\right)\nabla \hat q\!\left(S_t,A_t;\,w_t\right).\end{align}sAs with standard SARSA, we need a behaviour policy that generates transitions from state-action pairs to state-action pairs, that both explores and exploits, for example an s$\varepsilon$s-greedy or softmax policy. Furthermore, for the algorithm to estimate s$q_{*}$swe need the behaviour policy to be changed over time towards the greedy policy. However, convergence guarantees only exist when using linear function approximation, see Section 4.2.1 below.s4.2.1.sLinear function approximationsThe simplest form of function approximation is linear function approximation. The value function is approximated by s$\hat v(s;\,w) =w^\top x(s)$s, where x(s) are basis functions. Using the Fourier basis as defined in Konidaris et al. (2011), the ith basis function for the Fourier basis of order n is (here s$\pi\approx 3.14$sis a number) s\begin{align*}x_i(s) = \cos\!\left(\pi s^\top c^{(i)}\right),\end{align*}swhere s$s=\left(s_1,s_2,\ldots,s_k\right)^\top$s, s$c^{(i)}=\left(c_1^{(i)},\ldots,c_k^{(i)}\right)^\top$s, and k is the dimension of the state space. The s$c^{(i)}$s’s are given by the k-tuples over the set s$\{0,\ldots,n\}$s, and hence, s$i=1,\ldots,(n+1)^k$s. This means that s$x(s)\in\mathbb{R}^{(n+1)^k}$s. One-step semi-gradient TD learning with linear function approximation has been shown to converge to a weight vector s$w^*$s. However, s$w^*$sis not necessarily a minimiser of J. Tsitsiklis and Van Roy (1997) derive the upper bound s\begin{align*}J(w^*)\leq\frac{1}{1-\gamma}\min_{w}J(w).\end{align*}sSince s$\gamma$sis often close to one, this bound can be quite large.sUsing linear function approximation for estimating the action-value function, we have s$\hat q(s,a;\,w) = w^\top x(s,a)$s, and the ith basis function for the Fourier basis of order n is s\begin{align*}x_i(s,a) = \cos\!\left(\pi \!\left(s^\top c^{(i)}_{1:k} + ac_{k+1}^{(i)}\right)\right),\end{align*}swhere s$s=\left(s_1,\ldots,s_k\right)^\top$s, s$c^{(i)}_{1:k}=\left(c_1^{(i)},\ldots,c_k^{(i)}\right)^\top$s, s$c_j^{(i)}\in\{0,\ldots,n\},j=1,\ldots,k+1$s, and s$i=1,\ldots,(n+1)^{k+1}$s.sThe convergence results for semi-gradient SARSA with linear function approximation depend on what type of policy is used in the algorithm. When using an s$\varepsilon$s-greedy policy, the weights have been shown to converge to a bounded region and might oscillate within that region, see Gordon (2001). Furthermore, Perkins and Precup (2003) have shown that if the policy improvement operator s$\Gamma$sis Lipschitz continuous with constant s$L>0$sand s$\varepsilon$s-soft, then SARSA will converge to a unique policy. The policy improvement operator maps every s$q\in\mathbb{R}^{|\mathcal S||\mathcal A|}$sto a stochastic policy and gives the updated policy after iteration t as s$\pi_{t+1}=\Gamma(q^{(t)})$s, where s$q^{(t)}$scorresponds to a vectorised version of the state-action values after iteration t, that is s$q^{(t)}=xw_t$sfor the case where we use linear function approximation, where s$x\in\mathbb{R}^{|\mathcal S||\mathcal A|\times d}$sis a matrix with rows s$x(s,a)^\top$s, for each s$s\in\mathcal S, a\in\mathcal A$s, and d is the number of basis functions. That s$\Gamma$sis Lipschitz continuous with constant L means that s$\lVert \Gamma(q)-\Gamma(q^{\prime}) \rVert_2\leq L\lVert q-q^{\prime} \rVert_2$s, for all s$q,q^{\prime}\in\mathbb{R}^{|\mathcal S||\mathcal A|}$s. That s$\Gamma$sis s$\varepsilon$s-soft means that it produces a policy s$\pi=\Gamma(q)$sthat is s$\varepsilon$s-soft, that is s$\pi(a|s)\geq\varepsilon/|\mathcal A|$sfor all s$s\in\mathcal S$sand s$a\in\mathcal A$s. In both Gordon (2001) and Perkins and Precup (2003), the policy improvement operator was not applied at every time step; hence, it is not the online SARSA-algorithm considered in the present paper that was investigated. The convergence of online SARSA under the assumption that the policy improvement operator is Lipschitz continuous with a small enough constant L was later shown in Melo et al. (2008). The softmax policy is Lipschitz continuous, see further the Supplemental Material, Section 3.sHowever, the value of the Lipschitz constant L that ensures convergence depends on the problem at hand, and there is no guarantee that the policy the algorithm converges to is optimal. Furthermore, for SARSA to approximate the optimal action-value function, we need the policy to get closer to the greedy policy over time, for example by decreasing the temperature parameter when using a softmax policy. Thus, the Lipschitz constant L, which is inversely proportional to the temperature parameter, will increase as the algorithm progresses, making the convergence results in Perkins and Precup (2003) and Melo et al. (2008) less likely to hold. As discussed in Melo et al. (2008), this is not an issue specific to the softmax policy. Any Lipschitz continuous policy that over time gets closer to the greedy policy will in fact approach a discontinuous policy, and hence, the Lipschitz constant of the policy might eventually become too large for the convergence result to hold. Furthermore, the results in Perkins and Precup (2003) and Melo et al. (2008) are not derived for a Markov decision process with an absorbing state. Despite this, it is clear from the numerical results in Section 5 that a softmax policy performs substantially better compared to an s$\varepsilon$s-greedy policy, and for the simple model approximates the true optimal policy well.sThe convergence results in Gordon (2001), Perkins and Precup (2003) and Melo et al. (2008) are based on the stochastic approximation conditions s(4.4)s\begin{align} \sum_{t=1}^\infty \alpha_{t} = \infty, \quad \sum_{t=1}^\infty \alpha_{t}^2<\infty,\end{align}swhere s$\alpha_t$sis the step size parameter used at time step t. Note that when using tabular methods (see Section 4.1), we had a vector of step sizes for each state-action pair. Here, this is not the case. This is a consequence of both that a vector of this size might not be possible to store in memory when the state space is large, and that we want to generalise from state-action pairs visited to other state-action pairs that are rarely/never visited, making the number of visits to each state-action pair less relevant.s5.sNumerical illustrationss5.1.sSimple modelsWe use the following parameter values: s$\gamma=0.9$s, s$N=10$s, s$\mu=5$s, s$\beta_0=10$s, s$\beta_1=1$s, s$\xi=0.05$s, and s$\nu=1$s. This means that the expected yearly total cost for the insurer is 70 and the expected yearly cost per customer is 7. We emphasise that parameter values are meant to be interpreted in suitable units to fit the application in mind. The cost function is s(5.1)s\begin{align} c(p)=p+c_1\left(c_2^{p}-1\right),\end{align}swith s$c_1=1$sand s$c_2=1.2$s. For the model with a terminal state, we use s$\eta=10>\gamma/(1-\gamma)$s, as suggested in Section 3.2. The premium level is truncated and discretised according to s$\mathcal A = \{0.2,0.4,\ldots,19.8,20.0\}$s. As the model for the MDP is formulated, s$\operatorname{P}(G_t=g)>0$sfor all integers g. However, for actions/premiums that are considered with the aim of solving the optimisation problem, it will be sufficient to only consider a finite range of integer values for s$G_t$ssince transitions to values outside this range will have negligible probability. Specifically, we will only consider s$\{-20,-19,\dots,149,150\}$sas possible surplus values. In order to ensure that transition probabilities sum to one, we must adjust the probabilities of transitions to the limiting surplus values according to the original probabilities of exiting the range of possible surplus values. For details on the transition probabilities and truncation for the simple model, see the Supplemental Material, Section 1.sRemark 5.1s(Cost function). The cost function (5.1) was suggested in Martin-Löf (1994), since it is an increasing, convex function, and thus will lead to the premium being more averaged over time. However, in Martin-Löf (1994), s$c_2=1.5$swas used in the calculations. We have chosen a slightly lower value of s$c_2$sdue to that too extreme rewards can lead to numerical problems when using SARSA with linear function approximation.sRemark 5.2s(Truncation). Truncating the surplus process at 150 does not have a material effect on the optimal policy. However, the minimum value (here -20) will have an effect on the optimal policy for the MDP with a terminal state and should be seen as another parameter value that needs to be chosen to determine the reward signal, see Section 3.2.s5.1.1.sPolicy iterationsThe top row in Figure 1 shows the optimal policy and the stationary distribution under the optimal policy, for the simple model with a constraint on the action space (Section 3.1) using policy iteration. The bottom row in Figure 1 shows the optimal policy and the fraction of time spent in each state under the optimal policy, for the simple model with terminal state (Section 3.2) using policy iteration. In both cases, the premium charged increases as the surplus or the previously charged premium decreases. Based on the fraction of time spent in each state under each of these two policies, we note that in both cases the average premium level is close to the expected cost per contract (7), but the average surplus level is slightly lower when using the policy for the model with a constraint on the action space compared to when using the policy for the model with the terminal state. However, the policies obtained for these two models are quite similar, and since (as discussed in Section 3.2) the model with the terminal state is more appropriate in more realistic settings, we focus the remainder of the analysis only on the model with the terminal state.sFigure 1.sSimple model using policy iteration. Top: with constraint. Bottom: with terminal state. First and second column: optimal policy. Third column: fraction of time spent in each state under the optimal policy.s5.1.2sLinear function approximationsWe have a 2-dimensional state space, and hence, s$k+1=3$s. When using the Fourier basis we should have s$s\in[0,1]^{k}, a\in[0,1]$s; hence, we rescale the inputs according to s\begin{align*}\tilde s_1 = \frac{s_1-\min\mathcal G}{\max\mathcal G-\min\mathcal G}, \quad\tilde s_2= \frac{s_2-\min\mathcal A}{\max\mathcal A-\min\mathcal A}, \quad\tilde a = \frac{a-\min\mathcal A}{\max\mathcal A-\min\mathcal A},\end{align*}sand use s$(\tilde s_1, \tilde s_2, \tilde a)^\top$sas input. We use a softmax policy, that is s\begin{align*}\pi(a| s) = \frac{e^{\hat q(s,a;\,\boldsymbol{{w}})/\tau }}{\sum_{a \in \mathcal{A}(s)} e^{\hat q(s,a;\,\boldsymbol{{w}})/\tau }},\end{align*}swhere s$\tau$sis slowly decreased according to s\begin{align*}\tau_t = \max\!\left\{\tau_{\min},\tau_{0}d^{t-1}\right\}, \quad \tau_0=2,\quad \tau_{\min}=0.02,\quad d=0.99999,\end{align*}swhere s$\tau_t$sis the parameter used during episode t. This schedule for decreasing the temperature parameter is somewhat arbitrary, and the parameters have not been tuned. The choice of a softmax policy is based on the results in Perkins and Precup (2003), Melo et al. (2008), discussed in Section 4.2.1. Since a softmax policy is Lipschitz continuous, convergence of SARSA to a unique policy is guaranteed, under the condition that the policy is also s$\varepsilon$s-soft and that the Lipschitz constant L is small enough. However, since the temperature parameter s$\tau$sis slowly decreased, the policy chosen is not necessarily s$\varepsilon$s-soft for all states and time steps, and the Lipschitz constant increases as s$\tau$sdecreases. Despite this, our results show that the algorithm converges to a policy that approximates the optimal policy derived with policy iteration well when using a 3rd order Fourier basis, see the first column in Figure 2. The same cannot be said for an s$\varepsilon$s-greedy policy. In this case, the algorithm converges to a policy that in general charges a higher premium than the optimal policy derived with policy iteration, see the fourth column in Figure 2. For the s$\varepsilon$s-greedy policy, we decrease the parameter according to s\begin{align*}\varepsilon_t = \max\!\left\{\varepsilon_{\min},\varepsilon_{0}d^{t-1}\right\},\quad \varepsilon_0=0.2, \quad \varepsilon_{\min}=0.01,\end{align*}swhere s$\varepsilon_t$sis the parameter used during episode t.sFigure 2.sOptimal policy for simple model with terminal state using linear function approximation. First column: 3rd order Fourier basis. Second column: 2nd order Fourier basis. Third column: 1st order Fourier basis. Fourth column: 3rd order Fourier basis with s$\varepsilon$s-greedy policy.sThe starting state is selected uniformly at random from s$\mathcal S$s. Furthermore, since discounting will lead to rewards after a large number of steps having a very limited effect on the total reward, we run each episode for at most 100 steps, before resetting to a starting state, again selected uniformly at random from s$\mathcal S$s. This has the benefit of diversifying the states experienced, enabling us to achieve an approximate policy that is closer to the policy derived with dynamic programming as seen over the whole state space. The step size parameter used is s(5.2)s\begin{align} \alpha_t = \min\!\left\{\alpha_0,\frac{1}{t^{0.5+\theta}}\right\},\end{align}swhere s$\alpha_t$sis the step size parameter used during episode t, and s$0<\theta\leq0.5$s. The largest s$\alpha_0$sthat ensures that the weights do not explode can be found via trial and error. However, the value of s$\alpha_0$sobtained in this way coincides with the “rule of thumb” for setting the step size parameter suggested in Sutton and Barto (2018, Ch. 9.6), namely s\begin{align*}\alpha_0 = \frac{1}{\operatorname{E}_{\pi}\left[x^\top x\right]}, \quad \operatorname{E}_{\pi}\left[x^\top x\right] = \sum_{s,a}\mu_{\pi}(s)\pi(a|s) x(s,a)^\top x(s,a).\end{align*}sIf s$x\!\left(S_t,A_t\right)^\top x\!\left(S_t,A_t\right)\approx \operatorname{E}_{\pi}\left[x^\top x\right]$s, then this step size ensures that the error (i.e., the difference between the updated estimate s$w_{t+1}^\top x\!\left(S_t,A_t\right)$sand the target s$R_{t+1}+\gamma w_t^\top x(S_{t+1},A_{t+1}))$sis reduced to zero after one update. Hence, using a step size larger than s$\alpha_0=\left(\operatorname{E}_{\pi}\left[x^\top x\right]\right)^{-1}$srisks overshooting the optimum, or even divergence of the algorithm. When using the Fourier basis of order n, this becomes for the examples studied here s\begin{align*}\operatorname{E}_{\pi}\left[x^\top x\right] &= \sum_{s,a}\mu_{\pi}(s)\pi(a|s)\sum_{i=1}^{(n+1)^{k+1}}\cos^2\Big(\pi(sc_{1:k}^{(i)}+ac_{k+1}^{(i)})\Big)\approx \frac{(n+1)^{k+1}}{2}.\end{align*}sFor the simple model, we have used s$\alpha_0 = 0.2$sfor s$n=1$s, s$\alpha_0=0.07$sfor s$n=2$s, and s$\alpha_0=0.03$sfor s$n=3$s. For the intermediate model, we used s$\alpha_0 = 0.06$sfor s$n=1$s, s$\alpha_0=0.008$sfor s$n=2$s, and s$\alpha_0=0.002$sfor s$n=3$s. For the realistic model, we have used s$\alpha_0=0.002$sfor s$n=3$s. In all cases s$\alpha_0\approx \left(\operatorname{E}_{\pi}\left[x^\top x\right] \right)^{-1}$s. For s$\theta,$swe tried values in the set s$\{0.001,0.1,0.2,0.3,0.4,0.5\}$s. For the simple model, the best results were obtained with s$\theta = 0.001$sirrespective of n. For the intermediate model, we used s$\theta=0.5$sfor s$n=1$s, s$\theta=0.2$sfor s$n=2$s, and s$\theta=0.3$sfor s$n=3$s. For the realistic model, we used s$\theta=0.2$sfor s$n=3$s.sRemark 5.3s(Step size). There are automatic methods for adapting the step size. One such method is the Autostep method from Mahmood et al. (2012), a tuning-free version of the Incremental Delta-Bar-Delta (IDBD) algorithm from Sutton (1992). When using this method, with parameters set as suggested by Mahmood et al. (2012), the algorithm performs marginally worse compared to the results below.sFigure 2 shows the optimal policy for the simple model with terminal state using linear function approximation with 3rd-, 2nd-, and 1st-order Fourier basis using a softmax policy, and with 3rd-order Fourier basis using an s$\varepsilon$s-greedy policy. Figure 1 shows that the approximate optimal policy using 3rd-order Fourier basis is close to the optimal policy derived with policy iteration. Using 1st- or 2nd-order Fourier basis also gives a reasonable approximation of the optimal policy, but worse performance. Combining 3rd-order Fourier basis and an s$\varepsilon$s-greedy policy gives considerably worse performance.sThe same conclusions can be drawn from Table 2, where we see the expected total discounted reward per episode for these policies, together with the results for the optimal policy derived with policy iteration, the policy derived with Q-learning, and several benchmark policies (see Section 5.1.3). Clearly, the performance of 3rd-order Fourier basis is very close to the performance of the optimal policy derived with policy iteration, and hence, we conclude that the linear function approximation with 3rd-order Fourier basis using a softmax policy appears to converge to approximately the optimal policy. The policy derived with Q-learning shows worse performance than both the 3rd- and 2nd-order Fourier basis, while the number of episodes run for the Q-learning algorithm is approximately a factor 30 bigger than the number of episodes run before convergence of SARSA with linear function approximation. Hence, even for this simple model, the number of states is too large for the Q-learning algorithm to converge within a reasonable amount of time. Furthermore, we see that all policies derived with linear function approximation using a softmax policy outperform the benchmark policies. Note that the optimal policy derived with policy iteration, the best constant policy, and the myopic policy with the terminal state require full knowledge of the underlying model and the transition probabilities, and the myopic policy with the constraint requires an estimate of the expected surplus one time step ahead, while the policies derived with function approximation or Q-learning only require real or simulated experience of the environment.sTable 2.sExpected discounted total reward (uniformly distributed starting states) for simple model with terminal state. The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.sExpected rewardsTerminationssPolicy iterations−85.91s0.006sQ-learnings−86.50s0.002sFourier 3 with softmax policys−86.11s0.006sFourier 2 with softmax policys−86.30s0.007sFourier 1 with softmax policys−86.59s0.011sFourier 3 with s$\varepsilon$s-greedy policys−92.74s0.000sBest constant policys−122.70s0.040sMyopic policy with terminal state, s$p_{\min}=0.2$s−97.06s0.133sMyopic policy with terminal state, s$p_{\min}=5.8$s−90.40s0.096sMyopic policy with constraint, s$p_{\min}=0.2$s−121.52s0.337sMyopic policy with constraint, s$p_{\min}=6.4$s−93.58s0.132sTo analyse the difference between some of the policies, we simulate 300 episodes for the policy with the 3rd-order Fourier basis, the best constant policy, and the myopic policy with terminal state, s$p_{\min}=5.8$s, for a few different starting states, two of which can be seen in Figure 3. A star in the figure corresponds to one or more terminations. The total number of terminations (of 300 episodes) are as follows: s$S_0=({-}10,2)$s: Fourier 3:1, best constant: 291, myopic s$p_{\min}=5.8$s: 20. s$S_0=(50,7)$s: Fourier 3: 1, best constant: 0, myopic s$p_{\min}=5.8$s: 20. For other starting states, the comparison is similar to that in Figure 3. We see that the policy with the 3rd order Fourier basis appears to outperform the myopic policy in all respects, that is on average the premium is lower, the premium is more stable over time, and we have very few defaults. The best constant policy naturally is the most stable, but leads to in general a higher premium compared to the other policies, and will for more strained starting states quickly lead to a large number of terminations.sFigure 3.sSimple model. Top row: policy with 3rd order Fourier basis. Bottom row: myopic policy with terminal state, s$p_{\min}=5.8$s. Left: starting state s$S_0=({-}10,2)$s. Right: starting state s$S_0=(50,7)$s. The red line shows the best constant policy. A star indicates at least one termination.s5.1.3.sBenchmark policiessBest constant policy. The best constant policy is the solution to s\begin{align*}\underset{p}{\text{minimise }}&\operatorname{E} \!\left[\sum_{t=0}^T \gamma^t h\!\left(p,S_{t+1}\right)\right].\end{align*}sFor both the simple and intermediate models, s$p=7.4$ssolves this optimisation problem. For details, see the Supplemental Material, Section 4.1.sMyopic policy for MDP with constraint. The myopic policy maximises immediate rewards. For the model with a constraint on the action space, the myopic policy solves s(5.3)s\begin{align} \underset{p}{\text{minimise }} c(p)\quad\text{subject to } \operatorname{E}\!\left[G_{1}\mid S_0=s,P_0=p\right]\geq 0.\end{align}sThe myopic policy is given by the lowest premium level that satisfies the constraint. For details on how (5.3) is solved, see the Supplemental Material, Section 4.2.sFor the simple, intermediate, and realistic model, the myopic policy charges the minimum premium level for a large number of states. Since this policy so quickly reduces the premium to the minimum level as the surplus or previously charged premium increases, it is not likely to work that well. Hence, we suggest an additional benchmark policy where we set the minimum premium level to a higher value, s$p_{\min}$s. Thus, this adjusted myopic policy is given by s$\tilde \pi(s) = \max\{\pi(s),p_{\min}\}$s, where s$\pi$sdenotes the policy that solves (5.3). Based on simulations of the total expected discounted reward per episode for different values of s$p_{\min}$s, we conclude that s$p_{\min}=6.4$sachieves the best results for both the simple and intermediate model, and s$p_{\min}=10.5$sachieves the best result for the realistic model.sMyopic policy for MDP with terminal state. For the model with a terminal state, the myopic policy is the solution to the optimisation problem s(5.4)s\begin{align} \underset{p}{\text{minimise }} &\operatorname{E} \!\left[h(p,S_{1})\mid S_0=s,P_0=p\right].\end{align}sFor details on how (5.4) is solved, see the Supplemental Material, Section 4.3. We also suggest an additional benchmark policy where the minimum premium level has been set to s$p_{\min}=5.8$s, the level that achieves the best results based on simulations of the total expected discounted reward per episode. For the intermediate and realistic model, this myopic policy is too complex and is therefore not a good benchmark.s5.2.sIntermediate modelsWe use the following parameter values: s$\gamma=0.9$s, s$\mu=5$s, s$\beta_0=10$s, s$\beta_1=1$s, s$\xi=0.05$s, s$\nu=1$s, s$\alpha_1=0.7$s, s$\alpha_2=0.3$s, s$\eta=10$s, s$a=18$s, s$b=-0.3$s, and cost function (5.1) with s$c_1=1$sand s$c_2=1.2$s. The premium level and number of contracts written are truncated and discretised according to s$\mathcal A = \{0.2,0.4,\ldots,19.8,20.0\}$sand s$\mathcal N = \{0,1,\ldots,30\}$s. The surplus process no longer only takes integer values (as in the simple model), instead the values that the surplus process can take are determined by the parameter values chosen. However, it is still truncated to lie between –20 and 150. For the parameter values above, we have s$\mathcal G = \{-20.00,-19.95,\ldots,149.95,150.00\}$s.sFigure 4 shows the optimal policy for the intermediate model using linear function approximation, with 3rd-, 2nd-, and 1st-order Fourier basis, for s$N_{t}=N_{t-1}=10$s. Comparing the policy with the 3rd-order Fourier basis with the policy with the 2nd order Fourier basis, the former appears to require a slightly lower premium when the surplus or previously charged premium is very low. The policy with the 1st-order Fourier basis appears quite extreme compared to the other two policies. Comparing the policy with the 3rd-order Fourier basis for s$N_{t}=N_{t-1}=10$swith the optimal policy for the simple model (bottom row in Figure 1), we note that s$\pi^{i}(g,p,10,10)\neq \pi^s(g,p)$s, where s$\pi^s$sand s$\pi^{i}$sdenotes, respectively, the policy for the simple and the intermediate model. There is a qualitative difference between these policies, since even given that we are in a state where s$N_t=N_{t-1}=10$susing the intermediate model, the policy from the simple model does not take into account the effect the premium charged will have on the number of contracts issued at time s$t+1$s. The policy with 3rd-order Fourier basis for s$N_t,N_{t-1}\in\{5,10,15\}$scan be seen in Figure 5.sFigure 4.sOptimal policy for intermediate model with terminal state using linear function approximation, s$N_t=N_{t-1}=10$s. Left: 3rd-order Fourier basis. Middle: 2nd-order Fourier basis. Right: 1st-order Fourier basis.sFigure 5.sOptimal policy for the intermediate model with terminal state using linear function approximation with 3rd-order Fourier basis, for s$N_t,N_{t-1}\in\{5,10,15\}$s.sTo determine the performance of the policies for the intermediate model, we simulate the expected total discounted reward per episode for these policies. The results can be seen in Table 3. Here we clearly see that the policy with 3rd-order Fourier basis outperforms the other policies and that the policy with 1st-order Fourier basis performs quite badly since is not flexible enough to be used in this more realistic setting. We also compare the policies with the optimal policy derived with policy iteration from the simple model while simulating from the intermediate model. Though this policy performs worse compared to the policy with 3rd- and 2nd-order Fourier basis, it outperforms the policy with 1st-order Fourier basis. Note that the policies derived with function approximation only require real or simulated experience of the environment. The results for the myopic policy in Table 3 use the true parameters when computing the expected value of the surplus. Despite this, the policy derived with the 3rd order Fourier basis outperforms the myopic policy.sTable 3.sExpected discounted total reward based on simulation, (uniformly distributed starting states). The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.sExpected rewardsTerminationssFourier 3s−97.17s0.015sFourier 2s−104.41s0.025sFourier 1s−128.83s0.036sPolicy from simple models−116.70s0.075sMyopic policy with constraint, s$p_{\min}=0.2$s−360.69s0.988sMyopic policy with constraint, s$p_{\min}=6.4$s−100.92s0.169sBest constant policys−131.85s0.060sTo analyse the difference between some of the policies, we simulate 300 episodes for the policy with the 3rd-order Fourier basis, the best constant policy and the policy from the simple model, for a few different starting states, two of which can be seen in Figure 6. Each star in the figures correspond to one or more terminations at that time point. The total number of terminations (of 300 episodes) are as follows: s$S_0=(0, 7, 10, 10)$s: Fourier 3: 1, best constant: 13, simple: 5. s$S_0=(100, 15, 5, 5)$s: Fourier 3: 0, best constant: 0, simple: 10. Comparing the policy with the 3rd-order Fourier basis with the policy from the simple model, we see that it tends to on average give a lower premium and leads to very few defaults, but is slightly more variable compared to the premium charged by the simple policy. This is not surprising, since the simple policy does not take the variation in the number of contracts issued into account. At the same time, this is to the detriment of the simple policy, since it cannot correctly take the risk of the varying number of contracts into account, hence leading to more defaults. For example, for the more strained starting state s$S_0=({-}10,2,20,20)$s(not shown in figure), the number of defaults for the policy with the 3rd order Fourier basis is 91 of 300, and for the simple policy it is 213 of 300. Similarly, for starting state s$S_0=(100, 15, 5, 5)$s(second column in Figure 6), the simple policy will tend set the premium much too low during the first time step, hence leading to more early defaults compared to for example starting state s$S_0=(0, 7, 20, 20)$s(first column in Figure 6), despite the fact that the latter starting state has a much lower starting surplus.sFigure 6.sIntermediate model. Top row: policy with 3rd Fourier basis. Bottom row: policy from the simple model. Left: starting state s$S_0=(0, 7, 20, 20)$s. Right: starting state s$S_0=(100, 15, 5, 5)$s. The red line shows the best constant policy. A star indicates at least one termination.s5.3.sRealistic modelsTo estimate the parameters of the model for the cumulative claims payments in (2.8), we use the motor third party liability data considered in Miranda et al. (2012). The data consist of incremental runoff triangles for number of reported accidents and incremental payments, with 10 development years. We have no information on the number of contracts. For parameter estimation only, we assume a constant number of contracts over the ten observed years, that is s$\widetilde N_{t+1}=N$s, and that the total number of claims for each accident year is approximately 5% of the number of contracts. We further assume that the total number of claims per accident year is well approximated by the number of reported claims over the first two development years. This leads to an estimate of the number of contracts s$\widetilde N_{t+1}$sin (2.8) of s$\widehat N = 2.17\cdot 10^5$s. For parameter estimation, we assume that s$\mu_{0,t}=\mu_0$sin (2.9). Hence, s$\mu_0$sand s$\nu_0^2$sare estimated as the sample mean and variance of s$\left(\log\!\left(C_{t,1}\right)\right)_{t=1}^{10}$s. Similarly, s$\mu_j$sand s$\nu_j^2$sare estimated as the sample mean and variance of s$\left(\log\!\left(C_{t,j+1}/C_{t,j}\right)\right)_{t=1}^{10-j}$sfor s$j=1,\ldots,8$s. Since we only have one observation for s$j=9$s, we let s$\widehat \mu_9=\log\!\left(C_{1,10}/C_{1,9}\right)$sand s$\widehat \nu_9=0$s. s$c_0$sis estimated by s$\widehat c_0 = \exp\!\left\{\widehat \mu_0+\widehat \nu_0^2/2\right\}/\widehat N\approx 2.64$s. The parameter estimates can be seen in Table 1 in the Supplemental Material, Section 5.sFor the model for investment earnings, the parameters are set to s$\widetilde \sigma=0.03$sand s$\widetilde \mu = \log(1.05)-\widetilde\sigma^2/2$s, which gives similar variation in investment earnings as in the intermediate model (2.7) when the surplus is approximately 50. The remaining parameters are s$\gamma=0.9$s, s$\beta_0=2\cdot 10^5$s, s$\beta_1=1$s, s$\eta=10$s, and s$b=-0.3$s. The parameter a is set so that the expected number of contracts is s$2\cdot 10^5$swhen the premium level corresponds to the expected total cost per contract, s$\beta_0/(2\cdot10^5)+\beta_1+c_0/\alpha_1 \approx 10.4$s. Hence, s$a\approx 4.03\cdot 10^5$s. The premium level is truncated and discretised according to s$\mathcal A = \{0.5,1.0,\ldots,29.5,30.0\}$sThe cost function is as before (5.1), but now adjusted to give rewards of similar size as in the simple and intermediate setting. Hence, when computing the reward, the premium is adjusted to lie in s$[0.2,20.0]$saccording to s\begin{align*}c(p)=\widetilde p +c_1\left(c_2^{\widetilde p}-1\right),\quad \text{where }\widetilde p = 0.2 + \frac{p-0.5}{30-0.5}(20-0.2).\end{align*}sThe number of contracts is truncated according to s$\mathcal N= \{144170, \ldots, 498601\}$s. This is based on the s$0.001$s-quantile of s$\textsf{Pois}(a(\!\max\mathcal A)^b)$s, and the s$0.999$s-quantile of s$\textsf{Pois}(a(\min\mathcal A)^b)$s. The truncation of the cumulative claims payments s$C_{t,j}$sis based on the same quantile levels and lognormal distributions with parameters s$\mu=\log(c_0\min\mathcal N)-\nu_0^2/2+\sum_{k=1}^{j-1}\mu_k$sand s$\sigma^2 = \sum_{k=0}^{j-1}\nu_k$s, and with parameters s$\mu=\log(c_0\max\mathcal N)-\nu_0^2/2+\sum_{k=1}^{j-1}\mu_k$sand s$\sigma^2 = \sum_{k=0}^{j-1}\nu_k$s, respectively, for s$j=1,\ldots,9$s. The truncation for the cumulative claims payments can be seen in Table 2 in the Supplemental Material, Section 5. The surplus is truncated to lie in s$[{-}0.6,4.5]\cdot 10^6$s. Note that for the realistic model, with 10 development years the state space becomes 13-dimensional. Using the full 3rd-order Fourier basis is not possible since it consists of s$4^{14}$sbasis functions. We reduce the number of basis functions allowing for more flexibility where the model is likely to need it. Specifically, s\begin{align*}&\left\{\left(c^{(i)}_1,c^{(i)}_2,c^{(i)}_3,c^{(i)}_4,c^{(i)}_{14}\right)\,:\,i=1,\dots,4^5\right\}=\{0,\dots,3\}^5, \\[4pt]&c^{(i)}_1=c^{(i)}_2=c^{(i)}_3=c^{(i)}_4=c^{(i)}_{14}=0 \quad \text{for } i=4^5+1,\dots,4^5+9,\end{align*}sand, for s$j=1,\ldots,9$s, s\begin{align*}c^{(i)}_{4+j}=\left\{\begin{array}{ll}1 & \text{for } i=4^5 + j, \\[5pt]0 & \text{for } i\neq 4^5+j.\end{array}\right.\end{align*}sThis means less flexibility for variables corresponding to the cumulative payments and no interaction terms between a variable corresponding to a cumulative payment and any other variable.sFigure 7 shows the optimal policy for the realistic model using linear function approximation for s$N_t,N_{t-1}\in\{1.75,2.00,2.50\}\cdot 10^5$s, s$C_{t-1,1} = c_0\cdot2\cdot 10^5$s, and s$C_{t-j,j} = c_0\cdot2\cdot 10^5\prod_{k=1}^{j-1}f_k$sfor s$j=2,\ldots,9$s. To determine the performance of the approximate optimal policy for the realistic model, we simulate the expected total discounted reward per episode for this policy. The results can be seen in Table 4. The approximate optimal policy outperforms all benchmark policies. The best performing benchmark policy, the “interval policy,” corresponds to choosing the premium level to be equal to the expected total cost per contract, when the number of contracts is s$2\cdot 10^5$s, as long as the surplus lies in the interval s$[1.2,2.8]\cdot 10^6$s. This is based on a target surplus of s$2\cdot 10^6$sand choosing s$\phi\in\{0.1,0.2,\ldots,1.0\}$sthat results in the best expected total reward, where the interval for the surplus is given by s$[1-\phi,1+\phi]\cdot 2\cdot 10^6$s. When the surplus s$G_t$sis below (above) this interval, the premium is increased (decreased) in order to decrease (increase) the surplus. The premium for this benchmark policy is s\begin{align*}P_t =\begin{cases}\min\!\left\{10.5+\dfrac{(1-\phi)\cdot2\cdot10^6-G_t}{2\cdot10^5},\max\mathcal A\right\},\quad &\text{if } G_t<(1-\phi)\cdot2\cdot10^6,\\[9pt] 10.5, \quad & \text{if } (1-\phi)\cdot2\cdot10^6\leq G_t\leq (1+\phi)\cdot2\cdot10^6,\\[9pt] \max\!\left\{10.5+\dfrac{(1+\phi)\cdot2\cdot10^6-G_t}{2\cdot10^5},\min\mathcal A\right\},\quad &\text{if } (1+\phi)\cdot2\cdot10^6<G_t,\end{cases}\end{align*}sand rounded to the nearest half integer, to lie in s$\mathcal A$s. As before the approximate optimal policy outperforms the benchmark policies despite the fact that both the myopic and the interval policy use the true parameters when computing the expected surplus or the expected total cost per contract.sFigure 7.sOptimal policy for realistic model with terminal state using linear function approximation, for s$N_t,N_{t-1}\in\{1.75, 2.00, 2.50\}\cdot10^5$s, s$C_{t-1,1} = c_0\cdot2\cdot 10^5$s, and s$C_{t-j,j} = c_0\cdot2\cdot 10^5\prod_{k=1}^{j-1}f_k$s.sTable 4.sExpected discounted total reward based on simulation, (uniformly distributed starting states). The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.sExpected rewardsTerminationssFourier 3s$-93.14$s0.019sInterval policy (s$p=10.5$swhen s$G_t\in[1.2,2.8]\cdot 10^6$s)s$-106.70$s0.034sMyopic policy with constraint, s$p_{\min}=10.5$s$-112.70$s0.041sBest constant policy, s$p=11.5$s$-136.60$s0.061sA comparison of the approximate optimal policy and the best benchmark policy, including a figure similar to Figure 6, is found in the Supplemental Material, Section 5.s6.sConclusionsClassical methods for solving premium control problems are suitable for simple dynamical insurance systems, and the model choice must to a large extent be based on how to make the problem solvable, rather than reflecting the real dynamics of the stochastic environment. For this reason, the practical use of the optimal premium rules derived with classical methods is often limited.sReinforcement learning methods enable us to solve premium control problems in realistic settings that adequately capture the complex dynamics of the system. Since these techniques can learn directly from real or simulated experience of the stochastic environment, they do not require explicit expressions for transition probabilities. Further, these methods can be combined with function approximation in order to overcome the curse of dimensionality as the state space tends to be large in more realistic settings. This makes it possible to take key features of real dynamical insurance systems into account, for example payment delays and how the number of contracts issued in the future will vary depending on the premium rule. Hence, the optimal policies derived with these techniques can be used as a basis for decisions on how to set the premium for insurance companies.sWe have illustrated strengths and weaknesses of different methods for solving the premium control problem for a mutual insurer and demonstrated that given a complex dynamical system, the approximate policy derived with SARSA using function approximation outperforms several benchmark policies. In particular, it clearly outperforms the policy derived with classical methods based on a more simplistic model of the stochastic environment, which fails to take important aspects of a real dynamical insurance system into account. Furthermore, the use of these methods is not specific to the model choices made in Section 2. The present paper provides guidance on how to carefully design a reinforcement learning method with function approximation for the purpose of obtaining an optimal premium rule, which together with models that fit the experience of the specific insurance company allows for optimal premium rules that can be used in practice.sThe models may be extended to include dependence on covariates. However, it should be noted that if we want to model substantial heterogeneity among policyholders and consider a large covariate set, then the action space becomes much larger and function approximation also for the policy may become necessary.s

Journal

Astin BulletinCambridge University Press

Published: May 1, 2023

Keywords: Premium control; reinforcement learning; Markov decision process; C60; G22

There are no references for this article.