Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The use of autoencoders for training neural networks with mixed categorical and numerical features

The use of autoencoders for training neural networks with mixed categorical and numerical features s1.sIntroductionsIn this paper, we deal with the following three important problems of training neural networks with mixed categorical and numerical features in supervised learning tasks:sHow to construct a numerical representation of categorical features which is fed, together with the numerical features, into the hidden layers of a neural network,sWhich architecture of a neural network we should build, in particular, how we should treat (concatenate) features of different types (in our case categorical and numerical features),sHow we should initialize the weights and the bias terms of a neural network to guide the network towards a point in the surface of parameters where the network has a high predictive power and good generalization properties.sThere are many possible approaches to these problems. We present an approach inspired by autoencoders.sNeural networks have recently gained a lot of attention in actuarial science. In particular, Noll et al. (2019) and Ferrario et al. (2020) were among the first to discuss applications of neural networks to actuarial non-life insurance pricing problems and compared neural networks with generalized linear models. The current approach to supervised learning tasks in actuarial data science is to build a neural network where entity embeddings for categorical features are used. Entity embeddings were introduced by Guo and Berkhahn (2016) in the machine learning literature to help neural networks to deal with sparse categorical data of high dimension. They were first promoted by Richman (2021), Noll et al. (2019) and Ferrario et al. (2020) in actuarial data science and they were further developed by Blier-Wong et al. (2021), Kuo and Richman (2021) and Shi and shi (2022). An entity embedding, as part of a neural network for a supervised learning task, is learned separately for each categorical feature and allows us to derive a real-value representation of a categorical feature in a low-dimensional space. The numerical representations of the categorical features are concatenated with the numerical features and they are fed together as input into hidden layers of a neural network. The weights of the entity embeddings are learned together with all other parameters of the neural network, and the objective is to minimize a loss appropriate for a target response. All weights and bias terms of the network are initialized with random values from uniform distributions, and the back propagation algorithm is used to learn the parameters of the network. To the best of our knowledge, the architecture of a neural network described above is the only architecture of a neural network investigated to date for actuarial applications. In particular, other numerical representations of categorical features have not been tested as inputs to neural networks. In addition, advances in training algorithms for neural networks have not been discussed in actuarial data science. The only distinct training algorithm was proposed by Merz and Wüthrich (2019) and Schelldorfer and Wüthrich (2019) who promoted the Combined Actuarial Neural Network (CANN). The goal of this paper is to challenge the current dominant approach to supervised learning tasks in actuarial data science with a new architecture of a neural network, in particular, with a new numerical representation of categorical features and with a new training algorithm.sIt is known that in order to achieve a high predictive power of a neural network, the input to the network should contain the most important information for the supervised learning task under consideration. A highly informative input can be effectively pre-processed in hidden layers of the network to provide good predictions of the response. It has been shown in the machine learning literature that non-linear autoencoders, in particular, denoising autoencoders, built with neural networks, can capture the main factors of variation in the input and detect key characteristics of the multivariate and high-dimensional distribution of the input. As a result, representations of features derived using (denoising) autoencoders, learned in unsupervised learning tasks, can improve the predictive power of regression models if these representations are used as inputs to neural networks to predict the response, see for example Vincent et al. (2008) and (2010).sAutoencoders without noise for numerical data have been investigated in actuarial data science. The benefits of autoencoders without noise in supervised and unsupervised learning tasks have been demonstrated by Gao and Wüthrich (2018), Hainaut (2018), Rentzmann and Wüthrich (2019), Blier-Wong et al. (2021, 2022), Miyata and Matsuyama (2022) and Grari et al. (2022). To the best of our knowledge, autoencoders for categorical features are less common in actuarial data science. The only exception we are aware of is the paper by Lee et al. (2019) in which the authors discuss how to build word embeddings, which are similar to but are not exactly autoencoders for categorical data in the meaning investigated in this paper.sThe first contribution of this paper is that we investigate different types of autoencoders for categorical data. We demonstrate that we can benefit from non-linear autoencoders built with neural networks when the purpose is to derive informative representations of categorical features. We show that an autoencoder for categorical features should be of a different type than an autoencoder for numerical features. Most importantly, we deduce that the best autoencoder for categorical features, which extracts the most important information from the vector of categorical features, implies a different numerical representation of categorical features than the representation from entity embeddings currently used in supervised learning tasks in actuarial data science. From our experiments, we conclude that we should learn one numerical representation for all categorical features, rather than multiple representations for each separate feature, to build a more robust and informative representation of the categorical data. The other, and our main contribution, is that we use a joint numerical representation of categorical features, together with all other numerical features, as the input to the hidden layers of a neural network trained to predict a target response. In other words, we introduce a new architecture of a neural network with mixed categorical and numerical features for the supervised learning tasks. The change in the architecture compared to the current approach is that all one-hot encoded categorical features are transformed with one embedding into a real-value representation in a low-dimensional space and the joint representation of the categorical features is then concatenated with the numerical features. Finally, we fine-tune the numerical representation of the categorical features from a proper autoencoder by training its weights together with all other parameters of the neural network. Hence, the autoencoder for the categorical data is only used to derive an initial representation of the categorical features. This approach is known in the machine learning literature as pre-training of layers with autoencodeders, see Vincent et al. (2008), Erhan et al. (2010, 2009) and Vincent et al. (2010). Hence, we pre-train the joint embedding for categorical features in a neural network for a supervised learning task with our autoencoder for categorical data. From Erhan et al. (2009) and (2010), we know that pre-training other layers of the neural network is also beneficial, and this can be achieved with an autoencoder for numerical data. We use both autoencoders, without noise and denosing autoencoders, in this research.sThe benefits of using (denosing) autoencoders for pre-training layers of neural networks have been demonstrated in the literature (see the papers referred to above). In comparison to these papers:sWe perform our experiments with categorical data, instead of binary data, which means that we use a different autoencoder and a different type of a corruption process for the denoising autoencoder,sWe perform our experiments with Poisson distributed data and the Poisson deviance loss, which is the most common loss function in actuarial data science, instead of the mean square loss and the cross-entropy loss commonly used in the machine learning literature,sWe propose and validate a new architecture of a neural networks, with a joint embedding for all categorical features, for the supervised learning tasks. This has not been considered to date in actuarial data science,sWe propose to scale appropriately the representation of the categorical features from an autoencoder for categorical data before an autoencoder for numerical data is built to pre-train the first hidden layer of a neural network. This improves significantly the approach to supervised learning tasks with our new architecture,sWe compare various initialization techniques and we show that pre-training of layers of a neural network with non-linear and over-complete/denoising autoencoders produces much better results than applications of classical linear and under-complete autoencoders without noise (MCA, PCA),sWe investigate the balance property, the bias and the stability of the predictions which are crucial for actuarial pricing.sThe main conclusion of this paper is that we can improve the current approach to modelling categorical features in supervised learning tasks, which uses separate entity embeddings, and the training algorithm, which randomly initializes the parameters of the neural network. The proposal is to change the architecture of the network by using a different numerical representation of the categorical features, learned with a joint embedding, and initialize the layers of the network, in particular, the joint embedding and the first hidden layer, with representations learned with (denosing) autoencoders in unsupervised learning tasks.sThis paper is structured as follows. In Section 2, we present the general setup for neural networks and our numerical experiments. In Section 3, we discuss autoencoders for categorical and numerical features. In Section 4, we focus on training neural networks with mixed categorical and numerical features. Details of our experiments and some additional results are presented in the Online Supplement. The R codes for training our categorical autoencoders are available on https://github.com/LukaszDelong/Autoencoders.s2.sGeneral setupsWe assume that we have a data set consisting of s$(y_i,{\boldsymbol{{x}}}_i)_{i=1}^n$swhere s$y_i$sdescribes the one-dimensional response for observation i and s${\boldsymbol{{x}}}_i=(x_{1,i},..,x_{j,i},...,x_{d,i})^{\prime}$sis a d-dimensional vector of features which characterizes the observation. We may omit the index i, which indicates the observation, and simply use s$(y,{\boldsymbol{{x}}})$s. The vector s${\boldsymbol{{x}}}$sconsists of mixed categorical and numerical features. We assume that we have c categorical features and s$d-c$snumerical features.sThe categorical features are first one-hot encoded. Let s$x_j$sdenote a categorical feature with s$m_j$sdifferent labels s$\left\{a^j_1,...,a^j_{m_j}\right\}$s. This categorical feature is transformed into a s$m_j$s-dimensional vector of zeros and one: s\begin{eqnarray*}x_{j} \mapsto {\boldsymbol{{x}}}^{cat}_{j}=\big(x_{j_1},...,x_{j_{m_j}}\big)^{\prime}=\big(\mathbf{1}\{x_{j}=a^j_1\},...,\mathbf{1}\{x_{j}=a^j_{m_j}\}\big)^{\prime}\in {\mathbb{R}}^{m_j}.\end{eqnarray*}sThe dimension of the vector of features s${\boldsymbol{{x}}}=\big(({\boldsymbol{{x}}}^{cat})^{\prime},({\boldsymbol{{x}}}^{num})^{\prime}\big)^{\prime}=\big(({\boldsymbol{{x}}}_{1}^{cat})^{\prime},...,({\boldsymbol{{x}}}_{c}^{cat})^{\prime},x_{c+1},...,x_{d}\big)^{\prime}$sbecomes s$\sum_{j=1}^{c}m_j+d-c$s. As far as the numerical features are concerned, we assume that each numerical feature takes its values from s$[\!-\!1,1]$s, that is min–max scaler transformations are applied to the numerical features on the original scale.sIn general, we use neural networks with s$M \in {\mathbb N}$shidden layers and s$q_m\in {\mathbb N}$sneurons in each hidden layer s$m=1,\ldots,M$s. The network layers are defined with the mappings: s(2.1)s\begin{eqnarray}{\boldsymbol{{z}}} \in {\mathbb R}^{q_{m-1}}\;\mapsto\;\theta^{m}({\boldsymbol{{z}}})&=&\big(\theta^{m}_1({\boldsymbol{{z}}}),\ldots,\theta^{m}_{q_m}({\boldsymbol{{z}}})\big)^{\prime}\in {\mathbb R}^{q_m},\quad m=1,\ldots,M, \end{eqnarray}s(2.2)s\begin{eqnarray} {\boldsymbol{{z}}} \in {\mathbb R}^{q_{m-1}}\;\mapsto\;\theta^{m}_{r}({\boldsymbol{{z}}})&=&\chi^m\big(b_r^m+\langle{\boldsymbol{{w}}}^{m}_r,{\boldsymbol{{z}}}\rangle\big),\quad r=1,\ldots,q_m,\end{eqnarray}swhere s$\chi^m\;:\;{\mathbb R} \to {\mathbb R}$sdenotes an activation function, s${\boldsymbol{{w}}}^m_r \in {\mathbb R}^{q_{m-1}}$sdenotes the network weights, s$b^m_r \in {\mathbb R}$sdenotes the bias term, and s$\langle \cdot , \cdot \rangle$sdenotes the scalar product in s${\mathbb R}^{q_{m-1}}$s. By s$q_0$swe denote the dimension of the input vector to the network. The mapping: s(2.3)s\begin{eqnarray}{\boldsymbol{{z}}} \in {\mathbb R}^{q_{0}}\; \mapsto \Theta^{M+1}({\boldsymbol{{z}}})=\big(\Theta_1^{M+1}({\boldsymbol{{z}}}),\ldots,\Theta_{q_{M+1}}^{M+1}({\boldsymbol{{z}}})\big)^{\prime} \in {\mathbb R}^{q_{M+1}},\end{eqnarray}swith a composition of the network layers s$\theta^1, \ldots, \theta^M$s, and the components: s\begin{eqnarray*}{\boldsymbol{{z}}}\;\mapsto\;\Theta_r^{M+1}({\boldsymbol{{z}}})=b_r^{M+1}+\Big\langle{\boldsymbol{{w}}}_r^{M+1},\left(\theta^{M}\circ\cdots\circ\theta^1\right)({\boldsymbol{{z}}})\Big\rangle,\quad r=1,\ldots,q_{M+1},\end{eqnarray*}sgives us the prediction from the network in the output layer s$M+1$sof dimension s$q_{M+1}$sbased on the input vector s${\boldsymbol{{z}}}$s. The output (2.3) returns the prediction with the linear activation function, and this prediction can be transformed with appropriate non-trainable and non-linear mapping if this is required for an application. If we set s$M=0$sin (2.1)–(2.2), then we assume that the input vector is just linearly transformed to give the prediction in the output layer of dimension s$q_1$sand, in this case, the components in (2.3) are given by s\begin{eqnarray*}{\boldsymbol{{z}}}\;\mapsto\;\Theta_r^{1}({\boldsymbol{{z}}})=b_r^{1}+\big\langle{\boldsymbol{{w}}}_r^{1},{\boldsymbol{{z}}}\big\rangle,\quad r=1,\ldots,q_{1}.\end{eqnarray*}sIn our numerical experiments, we use the data set freMTPL2freq, which is included in the R package CASdatasets. The data set has 678,013 observations from insurance policies. The response Y describes the number of claims per policy. Each policy has nine features and an exposure: s$({\boldsymbol{{x}}},Exp)$s. This data set is extensively studied by Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020) in the context of applications of generalized linear models and neural networks to modelling the number of claims. We perform the same data cleaning and feature pre-processing as in these papers. For the purpose of our experiments, we work with the features presented in Table 1.sTable 1.sFeatures used in our experiments.s6 categorical featuress2 numerical featuress1 binary featuresArea — 6 levelssBonusMalus (caped at 150)sVehGassVehPower — 6 levelsslog-DensitysVehAge — 3 levelssDrivAge — 7 levelssVehBrand — 11 levelssRegion — 22 levelssWe consider a supervised learning task where the goal is to predict the number of claims for a policyholder characterized by s$({\boldsymbol{{x}}},Exp)$sby estimating the regression function s${\mathbb{E}}[Y|{\boldsymbol{{x}}},Exp]$s. The prediction is constructed with the neural network described above and the one-dimensional output from the network (2.3) is transformed with the non-trainable and the non-linear exponential transformation: s(2.4)s\begin{eqnarray}{\mathbb{E}}[Y|{\boldsymbol{{x}}},Exp]=e^{\log(Exp)+\Theta_1^{M+1}({\boldsymbol{{x}}})}.\end{eqnarray}sThe parameters of the network are trained by minimizing the Poisson deviance loss function, see for example Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020). Our supervised learning task is solved with the help of unsupervised learning tasks where autoencoders are used.s3.sAutoencoderssLet s${\boldsymbol{{x}}}$sdenote a vector of (categorical, numerical, mixed) features of dimension p. An autoencoder consists of two functions: s\begin{eqnarray*}\varphi\;:\; {\mathbb{R}}^p \mapsto {\mathbb{R}}^l,\quad \text{and} \quad \psi\;:\; {\mathbb{R}}^l \mapsto {\mathbb{R}}^p.\end{eqnarray*}sThe mapping s$\varphi$sis called the encoder, and s$\psi$sis called the decoder. The mapping s${\boldsymbol{{x}}}\mapsto\varphi({\boldsymbol{{x}}})$sfrom the encoder gives an l-dimensional representation of the p-dimensional vector s${\boldsymbol{{x}}}$s. The mapping s${\boldsymbol{{z}}}\mapsto\psi({\boldsymbol{{z}}})$sfrom the decoder tries to reconstruct the p-dimensional vector s${\boldsymbol{{x}}}$sfrom its l-dimensional representation s${\boldsymbol{{z}}}=\varphi({\boldsymbol{{x}}})$s. We define the reconstruction function as s\begin{eqnarray*}\pi=\psi\circ\varphi \;:\;{\mathbb{R}}^p\mapsto{\mathbb{R}}^p.\end{eqnarray*}sFor a data set with observations s$({\boldsymbol{{x}}}_i)_{i=1}^n$s, the goal is to find the functions s$\varphi$sand s$\psi$ssuch that the reconstruction error as s\begin{eqnarray*}\frac{1}{n}\sum_{i=1}^{n}L(\pi({\boldsymbol{{x}}}_i),{\boldsymbol{{x}}}_i),\end{eqnarray*}smeasured with a loss function L, is minimized. If we can find an autoencoder for which the reconstruction error is small, then we can claim that the encoder extracts the most important information from a multi-dimensional vector of features. Consequently, we can use the representation s$\varphi({\boldsymbol{{x}}})$s, instead of s${\boldsymbol{{x}}}$s, as input to predict the response in our supervised learning task. The observed response y is not used in this approach when we train an autoencoder. We train autoencoders in a fully unsupervised fashion, but we will improve the representation based on the target y when we solve the supervised learning task.sLinear autoencoders are well-known in statistics. By a linear autoencoder, we mean an autoencoder where both the functions s$\varphi$sand s$\psi$sare linear. Classical examples of linear autoencoders include autoencoders built with Principal Component Analysis for numerical data and Multiple Correspondence Analysis for categorical data. We refer for example to Chapter 6.2 in Dixon et al. (2020) for the equivalence between the linear autoencoder built by minimizing the mean square reconstruction loss function and the representation built with the PCA algorithm. For MCA and its relation to PCA, we refer for example to Pagès (2015) and Chavent et al. (2017).sIn this paper, we are interested in non-linear autoencoders where at least one of the functions s$\varphi$sor s$\psi$sis non-linear. We use the notation (2.1)–(2.2) from the previous section. To build an autoencoder for the input s${\boldsymbol{{x}}}$s, we use a neural network with one hidden layer, that is s$M=1$s. The dimension of the single hidden layer is set to s$q_1=l$s, and the dimensions of the input and the output are set to s$q_0=q_2=p$s. The activation function for the hidden layer depends on the type of data and is discussed in the sequel. The vector s$(\theta^1_1({\boldsymbol{{x}}}),...,\theta^1_l({\boldsymbol{{x}}}))^{\prime}$sgives us the representation of the input s${\boldsymbol{{x}}}$sfrom the encoder. The vector s$(\Theta^2_1({\boldsymbol{{x}}}),...,\Theta^2_p({\boldsymbol{{x}}}))^{\prime}$s, transformed with a non-trainable and non-linear function if required for the data, gives us the reconstruction of the input s${\boldsymbol{{x}}}$spredicted with the decoder. This means that s$\psi$salso includes a non-trainable and non-linear transformation of the output (2.3) from the network if such a transformation is required for application. Clearly, we could also build deep autoencoders with more hidden layers, but shallow autoencoders with one hidden layer are sufficient for our main application in Section 4.sIf s$l<p$s, we construct under-complete autoencoders and we reduce the dimension of the input s${\boldsymbol{{x}}}$s. Linear autoencoders built with the PCA and MCA algorithms are examples of under-complete autoencoders. If we choose s$l=p$s, then we can achieve a zero reconstruction error by learning the identity mapping. Interestingly, we can also learn over-complete autoencoders with s$l>p$s, and denoising autoencoders are examples of over-complete autoencoders. In order to construct a denoising autoencoder with s$l>p$s, we corrupt the input for the network. The objective for training a denoising autoencoder is to find the functions s$\varphi$sand s$\psi$ssuch that the reconstruction error: s\begin{eqnarray*}\frac{1}{n}\sum_{i=1}^{n}L(\pi(\tilde{{\boldsymbol{{x}}}}_i),{\boldsymbol{{x}}}_i),\end{eqnarray*}smeasured with a loss function L, is minimized. This time, the input s$\tilde{{\boldsymbol{{x}}}}$sis corrupted input s${\boldsymbol{{x}}}$swhich is constructed by adding a noise to s${\boldsymbol{{x}}}$s. It has been demonstrated in the machine-learning literature that denosing autoencoders are very good at extracting the most important information from a multi-dimensional vector of features, see for example Vincent et al. (2008) and (2010). We can also construct over-complete autoencoders using data without noise if a low number of epochs is used for training the autoencoder built with a neural network.sIn the next two sections, we discuss autoencoders for numerical and categorical features.s3.1.sAutoencoders for numerical featuressAs discussed in Introduction, autoencoders without noise for numerical features have been investigated in various actuarial applications. In this paper, we adopt the approach from Rentzmann and Wüthrich (2019). We use the hyperbolic tangent activation function in the hidden layer (s$\chi^1$s), reconstruct the input using the linear prediction: s(3.1)s\begin{eqnarray}{\boldsymbol{{x}}}\in{\mathbb{R}}^p\mapsto\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})=\left(\Theta_1^2({\boldsymbol{{x}}}),...,\Theta_p^2({\boldsymbol{{x}}})\right)^{\prime}\in{\mathbb{R}}^p,\end{eqnarray}sand use the mean square error loss function L to measure the reconstruction error between the prediction s$\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})$sand the input s${\boldsymbol{{x}}}$s. We build a non-linear autoencoder since we use a non-linear activation function (the hyperbolic tangent) in the hidden layer. In contrast to Rentzmann and Wüthrich (2019), we allow for bias terms in the network since we use the min–max scaler transformation of the numerical features instead of zero mean and unit variance standardization. An example of the architecture of a neural network used in this paper to build an autoencoder for numerical features is presented in Figure 1.sFigure 1.sArchitecture of the autoencoder for numerical features used in the paper.sAs far as denoising autoencoders are concerned, we apply two types of corruption processes to distort the input, see for example Vincent et al. (2008), (2010):sGaussian disturbance (gaussian): For each observation, s$i=1,...,n$s, and each numerical feature in vector s${\boldsymbol{{x}}}_i$s, the original input is corrupted with the transformation s$x_{j,i}\mapsto\tilde{x}_{j,i}\sim N(x_{j,i},\sigma^2)$s.sMasking to zero (zero): For each observation, s$i=1,...,n$s, and the fraction v of numerical features in vector s${\boldsymbol{{x}}}_i$schosen at random, the original input is corrupted with the transformation s$x_{j,i}\mapsto\tilde{x}_{j,i}=0$s.s3.2.sAutoencoders for categorical featuressWe consider two types of architecture of autoencoders for categorical features. Neither has been explored in the actuarial literature, although they appear, and versions of them appear, in many applications of machine learning methods in various fields.s1.sSeparate autencoders for each feature (Separate AEs): For categorical feature s$x_j$swith s$m_j$sdifferent labels and its one-hot representation s${\boldsymbol{{x}}}_j^{cat}=(x_{j_1},...,x_{j_{m_j}})^{\prime}$s, we build a neural network (2.1)–(2.2) with s$M=1, q_0=m_j, q_1=l_j, q_2=m_j$s, where s$l_j$sis the required dimension of the representation of the categorical feature. Since we use the one-hot representation of s$x_j$sas the input to the network, there is no need to train bias terms in the hidden layer, so we set s$b^1_r=0$sfor s$r=1,...,l_j$s. However, it is still beneficial to train bias terms in the output layer in order to match the output expressed with probabilities (see below). The linear activation function for s$\chi^1$sin the hidden layer is a natural choice here since the linear mappings s$\langle{\boldsymbol{{w}}}^{1}_r,{\boldsymbol{{x}}}^{cat}_j\rangle$s, for neurons s$r=1,...,l_j$s, yield unique constants for each label of the categorical feature, so there is no need to apply non-linear transformations to these constants. We reconstruct the input using the prediction: s(3.2)s\begin{eqnarray}{\boldsymbol{{x}}}_j^{cat}\in{\mathbb{R}}^{m_j}\mapsto\hat{{\boldsymbol{{x}}}}^{cat}_j=\pi({\boldsymbol{{x}}}_j^{cat})=\big(\pi_1({\boldsymbol{{x}}}_j^{cat}),...,\pi_{m_j}({\boldsymbol{{x}}}_j^{cat})\big)^{\prime}\in{\mathbb{R}}^{m_j},\end{eqnarray}swhere s\begin{eqnarray*}\pi_r({\boldsymbol{{x}}}_j^{cat})=\frac{e^{\Theta^2_r({\boldsymbol{{x}}}_j^{cat})}}{\sum_{u=1}^{m_j}e^{\Theta^2_u({\boldsymbol{{x}}}_j^{cat})}},\quad r=1,...,m_j.\end{eqnarray*}sThe soft-max activation function is applied to the output from the network (2.3) to derive the reconstructed input. The reconstruction function returns the probabilities that the reconstructed feature takes a particular label. The label with the highest predicted probability is the label predicted for the reconstructed feature. Since we now deal with a classification problem for the single categorical feature s$x_j$s, it is natural to use the cross entropy loss function L to measure the reconstruction error between the prediction s$\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})$sand the input s${\boldsymbol{{x}}}$s: s(3.3)s\begin{eqnarray}L(\pi({\boldsymbol{{x}}}^{cat}_{j,i}),{\boldsymbol{{x}}}^{cat}_{j,i})=-\sum_{r=1}^{m_j}x^{cat}_{j_{r},i}\log\!\big(\pi_r({\boldsymbol{{x}}}^{cat}_{j,i})\big),\quad i=1,...,n.\end{eqnarray}sWe build a non-linear autoencoder since we use a non-linear activation function (the soft-max function) in the output layer. The approach described above is applied to all categorical features in the data set. An example of the architecture of a neural network used in this paper to build the autoencoder of type Separate AEs for categorical features (with 2 and 3 labels) is presented in Figure 2.s2.sJoint autoencoder all features (Joint AE): We consider a vector of categorical features s$(x_1,...,x_c)$swith s$(m_1,...,m_c)$sdifferent labels and their one-hot representations s${\boldsymbol{{x}}}^{cat}=\big(({\boldsymbol{{x}}}_1^{cat})^{\prime},...,$s$({\boldsymbol{{x}}}_c^{cat})^{\prime}\big)^{\prime}$s. Let s$\bar{m}_0=0$sand s$\bar{m}_j=\sum_{u=1}^j m_u$sfor s$j=1,...,c$s. This time we build a neural network (2.1)–(2.2) with s$M=1, q_0=\bar{m}_c, q_1=l, q_2=\bar{m}_c$s, where l is the dimension of the required joint representation of all categorical features. We still set s$b^1_r=0$sfor s$r=1,...,l$s, train bias terms in the output layer and apply the linear activation function in s$\chi^1$s. We reconstruct the input using the prediction: s(3.4)s\begin{eqnarray}{\boldsymbol{{x}}}^{cat}\in{\mathbb{R}}^{\bar{m}_c}\mapsto\hat{{\boldsymbol{{x}}}}^{cat}=\pi({\boldsymbol{{x}}}^{cat})&=&\big(\pi_1({\boldsymbol{{x}}}^{cat}),...,\pi_{\bar{m}_1}({\boldsymbol{{x}}}^{cat}), ...,\nonumber\\[5pt] &&\pi_{\bar{m}_{j-1}+1}({\boldsymbol{{x}}}^{cat}),...,\pi_{\bar{m}_j}({\boldsymbol{{x}}}^{cat}),...,\nonumber\\[5pt] &&\pi_{\bar{m}_{c-1}+1}({\boldsymbol{{x}}}^{cat}),...,\pi_{\bar{m}_c}({\boldsymbol{{x}}}^{cat})\big)^{\prime}\in{\mathbb{R}}^{\bar{m}_c},\end{eqnarray}swhere s\begin{eqnarray*}\pi_r({\boldsymbol{{x}}}^{cat})=\frac{e^{\Theta^2_r({\boldsymbol{{x}}}^{cat})}}{\sum_{u=\bar{m}_{j-1}+1}^{\bar{m}_j}e^{\Theta^2_u({\boldsymbol{{x}}}^{cat})}},\quad r=\bar{m}_{j-1}+1,...,\bar{m}_j,\quad j=1,...,c,\end{eqnarray*}sand s$\pi_r({\boldsymbol{{x}}}^{cat})$s, for s$r=\bar{m}_{j-1}+1,...,\bar{m}_j$s, return probabilities that the categorical feature s$x_j$stakes a particular label among its s$m_j$slabels. The prediction of the label for s$x_j$sis the label with the highest predicted probability among s$\pi_r({\boldsymbol{{x}}}^{cat})$s. Clearly, we build a non-linear autoencoder. We remark that the soft-max activations functions are now applied to groups of neurons in the output layer from the network (2.3) which correspond to the labels of the categorical features. Hence, the decoder here returns probabilities in classification problems for all categorical features. This time all neurons in the layers of the autoencoder (before the soft-max transformations are applied) share the parameters of one neural network. By applying the Separate AEs, we independently solve multiple classification problems for our categorical features with separate autoencoders, whereas by applying the Joint AE, we jointly solve multiple classification problems for our categorical features with one autoencoder. Such an approach is called multi-task learning in machine learning, see for example Caruana (1997) and Ruder (2017). We use the cross entropy loss function L to measure the reconstruction error between prediction s$\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})$sand input s${\boldsymbol{{x}}}$s: s(3.5)s\begin{eqnarray}L(\pi({\boldsymbol{{x}}}_i^{cat}),{\boldsymbol{{x}}}_i^{cat})=-\sum_{j=1}^c\sum_{r=1}^{m_j}x^{cat}_{j_{r},i}\log\!\big(\pi_r({\boldsymbol{{x}}}^{cat}_i)\big),\quad i=1,...,n.\end{eqnarray}sAn example of the architecture of type Joint AE is presented in Figure 3.sFigure 2.sArchitecture of the autoencoder of type Separate AEs for categorical features.sFigure 3.sArchitecture of the autoencoder of type Joint AE for categorical features.sIn order to build denosing autoencoders, we apply the following corruption processes for categorical features:sFor each observation, s$i=1,...,n$s, and a fraction v of categorical features in vector s${\boldsymbol{{x}}}_i$schosen at random, the original input is corrupted with the transformations:sSampling a new label (sample): The original input is corrupted with the transformation s$x_{j,i}\mapsto\tilde{x}_{j,i}\sim \hat{F}_{x_j}$sand one-hot encoded with s$\tilde{x}_{j,i}\mapsto \tilde{{\boldsymbol{{x}}}}^{cat}_{j,i}$s, where s$\hat{F}_{x_j}$sis the empirical distribution of the feature s$x_j$sin the data set. This corruption process can be seen as an extension of the salt-and-pepper noise for binary data to categorical data, see for example Vincent et al. (2008, 2010) for the salt-and-pepper noise for binary data.sMasking to zero (zero): The original input and its one-hot encoding are corrupted with the transformation s${\boldsymbol{{x}}}^{cat}_{j,i}\mapsto\tilde{{\boldsymbol{{x}}}}^{cat}_{j,i}=\mathbf{0}'$s, where s$\mathbf{0}$sis a vector of zeros. This corruption process is an analogue to the technique of masking applied to numerical features from Section 3.1.sWe conclude this section with some remarks on the types of architecture of our autoencoders for categorical data:s(a)sThe approach with the Separate AEs has at least two disadvantages compared to the Joint AE. First, we have to train a number of autoencoders equal to the number of categorical features, which may be time-consuming. Secondly, and more importantly, we neglect possible dependencies between different categorical features when creating representations with separate and independent autoencoders. The second disadvantage is explored in Experiment 1 below. We consider the approach with the Separate AEs as a benchmark since it gives us a representation of categorical data which matches the representation of categorical data learned with entity embeddings in supervised learning tasks.s(b)sIf the categorical features are binary features, then our approach with the Joint AE is aligned with the approach for binary data used by Vincent et al. (2008, 2010) in their experiments. For binary data, autoencoders which coincide with our Joint AE are also used in Generative Adversarial Imputation Nets, see for example Yoon et al. (2018).s(c)sHespe (2020) recommends a multi-task learning autoencoder for categorical data which agrees with our Joint AE. He also describes single-task learning autoencoders learned with loss functions different from the cross-entropy.s3.3.sExperiment 1: the reconstruction ability of autoencoderssWe compare the following four autoencoders for categorical data:sSeparate AEs,sJoint AE,sMCA — we build a linear autoencoder with the classical MCA algorithm, that is instead of training a neural network, we apply Generalized Singular Value Decomposition (GSVD) to a matrix with centered one-hot encoded categorical features, see Pagès (2015) and Chavent et al. (2017),sMCA as non-linear PCA — we build a non-linear autoencoder for numerical data, the one described in Section 3.1, on linearly transformed one-hot encoded categorial features. From Pagès (2015) and Chavent et al. (2017), MCA is PCA on centered one-hot encoded categorical data transformed with linear mappings (GSVD). Instead of building a linear autoencoder, which is equivalent to the PCA algorithm, on linearly transformed centered one-hot encoded categorial features, we build a non-linear autoencoder with the hyperbolic tangent activation function in the single hidden layer by minimizing the mean square reconstruction error.sFrom the data set freMTPL2freq with 678,013 observations, we sample 100,000 observations. We work with a smaller data set to speed up the calculations. We limit our attention to categorical features and we consider the six categorical features from Table 1. Our data set with 100,000 observations is next split randomly into five data sets with 20,000 observations. We build our autoencoders on each of these five sets and report the average metric for these five sets evaluated at the training set. As the metric, we use the cosine similarity measure, but the findings also hold for example for the number of correct predictions. In this experiment, we only build under-complete autoencoders without noise, as this is sufficient to derive the key conclusions. We train our autoencoders with 15, 100 and 500 epochs. We do not differentiate between a training, a validation and a test set (we do not discuss possible over-fitting) since we are only interested in evaluating the reconstruction errors of the autoencoders. Details are presented in Section 1 in the Online Supplement.sThe dimension of the data matrix with the one-hot encoded categorical features is 54. We consider a range of dimensions of the representation of the categorical features: s$q_1=l=6,8,10,12,15,20,30$s. For the Separate AEs, we have to specify the number of neurons s$l_j$s(the dimension of representation) for each feature j. We assume that the number of neurons l, which defines the global dimension of the representation for all categorical features, is split across the individual categorical features evenly, if possible, and if not possible, a larger number of neurons is allocated to a feature with a larger number of labels. For example if we choose s$l=6$s, then we build representations of dimension 1 for each feature, if we choose s$l=12$s, then we build representations of dimension 2 for each feature, but if we choose s$l=8$s, then we build representations of dimension 2 for Region and VehBrand (these two features have the two largest number of labels in the data set) and representations of dimension 1 for the remaining features.sWe present the results in Figure 4. It is obvious that the cosine similarity increases with the number of neurons and the number of epochs. For the large number of epochs (500), for which we achieve the smallest reconstruction errors for all our autoencoders in terms of the loss functions minimized in the training process, the autoencoders Separate AEs and Joint AE are very similar in terms of their reconstruction power measured with the cosine similarity and they are much better than the remaining two autoencoders. The first conclusion confirms that categorical data have different intrinsic properties than numerical data, which are explored when a low-dimensional representation is built with an autoencoder, and categorical data should not be compressed with algorithms derived for numerical data (MCA is just PCA on linearly transformed data). The second conclusion is that for the low and the medium number of epochs (15, 100), especially for the low number of epochs, the performance of Joint AE is superior in terms of its ability to reconstruct the input from a low dimensional representation. In particular, our experiment shows that there are dependencies between the categorical features in the data set which are efficiently captured by the Joint AE at initial epochs (15 epochs) of the learning process of the autoencoder, and which cannot be captured by learning independent Separate AEs. Intuitively, dependencies between categorical features should allow the Joint AE to learn more robust and informative representations of categorical features and the Joint AE should lead to better reconstruction errors compared to the Separate AEs. For the low number of epochs (15) and a low dimension of the representation (6, 8, 10), the Joint AE is very similar to the MCA, but the performance of the Joint AE improves quickly when we increase the number of epochs. Clearly, we can benefit from non-linear autoencoders when the purpose is to derive informative representations of categorical features.sFigure 4.sCosine similarity measures for autoencoders for categorical data.sAs discussed in Section 3, if we can find an autoencoder for which the reconstruction error is small, then we can say that the encoder extracts the most important information from a multi-dimensional vector of features. Our example points out that representations of categorical features built with the Separate AEs may not be optimal in terms of their robustness and informativeness, especially if we do not want to spend much time on training autoencoders with a large number of epochs. It is known that the predictive power of neural networks and their generalization properties in supervised learning tasks depend on providing a good representation of the available information for its efficient pre-processing in hidden layers before the final prediction of the response is constructed with the output. Since the Joint AE performs better than the Separate AEs in terms of providing a more robust and informative representation of categorical features, we may prefer to use the numerical representation of categorical features implied by the Joint AE, rather than the Separate AEs, as the input to neural networks built for supervised learning tasks. However, in all practical examples in actuarial data science to date, the numerical representation of categorical features which is fed into hidden layers of a neural network matches the representation from the Separate AEs. We have to use a different type of architecture of a neural network to use the representation from the Joint AE. This experiment may serve as a motivating example for what we present in the sequel.s4.sTraining neural networks with mixed categorical and numerical featuressWe now move to the main topic of this paper. Below, we discuss different approaches to training neural networks with mixed categorical and numerical features in supervised learning tasks. These approaches differ in the architecture of the neural network and initialization of the parameters of the neural network.s4.1.sArchitecture A1 with separate entity embeddingssLet us start by recalling the concept of an entity embedding developed by Guo and Berkhahn (2016). An entity embedding for categorical feature s$x_j$sis a neural network which maps the categorical feature s$x_j$s, with its one-hot representation s${\boldsymbol{{x}}}_j^{cat}$s, into a vector of dimension s$l_j$s: s\begin{eqnarray*}{\boldsymbol{{x}}}^{cat}_{j}\in{\mathbb{R}}^{m_j} \mapsto {\boldsymbol{{x}}}^{ee}_{j}=(x^{ee}_{j_1},...,x^{ee}_{j_{l_j}})^{\prime}\in{\mathbb{R}}^{l_j},\end{eqnarray*}swhere s\begin{eqnarray*}x^{ee}_{j_r}=\langle{\boldsymbol{{w}}}^{ee}_r,{\boldsymbol{{x}}}^{cat}_j\rangle,\quad r=1,...,l_j.\end{eqnarray*}sWith an entity embedding, each label, from the set of s$m_j$spossible labels s$\{a_1,...,a_{m_j}\}$sof the categorical feature s$x_j$s, can be represented with a vector in the space s${\mathbb{R}}^{l_j}$s. The parameter s$l_j$sis the dimension of the embedding for the categorical feature s$x_j$s.sIn Figure 5, we provide an example of the architecture of a neural network with mixed categorical and numerical features used in supervised learning tasks in actuarial data science. This architecture uses entity embeddings for categorial features and has been promoted by Richman (2021), Noll et al. (2019) and Ferrario et al. (2020). We present a simple example with two categorical features s${\boldsymbol{{x}}}^{cat}_1, {\boldsymbol{{x}}}^{cat}_2$s, with 3 and 2 levels, and two numerical features s$x_3$sand s$x_4$s. For s${\boldsymbol{{x}}}^{cat}_1$s, we implement the entity embedding of dimension 2, and for s${\boldsymbol{{x}}}^{cat}_2$s— the entity embedding of dimension 1. More generally, within (2.1)–(2.2), we define a neural network with Architecture 1 (A1):sFor each categorical feature s$x_j$s, s$j=1,..,c$s, we build an entity embedding — a sub-network without hidden layers, that is s$M=0$s, where the input s${\boldsymbol{{z}}}={\boldsymbol{{x}}}^{cat}_j$s, s$q_0=m_j$sand the output s$q_1=l_j$s,sOnce all one-hot encoded categorical features are transformed with linear mappings of the entity embeddings, the outputs from the entity embeddings, that is the numerical representations of the categorical features, are concatenated with the numerical features to yield a new numerical vector of all features. This new vector is fed as the input into another sub-network with M hidden layers,sWe build a sub-network with M hidden layers with neurons s$q_1,...,q_M$sand the hyperbolic tangent activation functions s$\chi^1,...,\chi^M$sin the hidden layers, where the input s${\boldsymbol{{z}}}=\big(({\boldsymbol{{x}}}^{ee}_1)^{\prime},...,({\boldsymbol{{x}}}^{ee}_c)^{\prime},x_{c+1},...,x_d\big)^{\prime}$sand s$q_0=\sum_{j=1}^cl_j+d-c$s,sAll weights of the network (including the weights of the entity embeddings) are initialized with values sampled from uniform distributions with the Xavier initialization, see Glorot and Bengio (2010), and the bias terms are initialized with zero.sFigure 5.sArchitecture of type A1 with separate entity embeddings.sThe goal of this paper is to challenge A1 with a new architecture and a new training process of a neural network. The results from Experiment 1 provide us with arguments regarding how we could change A1. We can now clearly observe that the numerical representations of the categorical features learned with the entity embeddings in A1 matches, in their architectures, the numerical representations learned with the Separate AEs. From Section 3.3, we conclude that we could replace the numerical representations of the categorical features in A1 with the representation learned with the Joint AE. This leads us to introduce Architecture 2.s4.2.sArchitecture A2 with joint embeddingsInstead of applying separate entity embeddings to each categorical feature, we now use a joint embedding for all categorical features. A joint embedding is understood here as a neural network with the following mapping: s\begin{eqnarray*}{\boldsymbol{{x}}}^{cat}=(({\boldsymbol{{x}}}^{cat}_1)^{\prime},...,({\boldsymbol{{x}}}^{cat}_c)^{\prime})^{\prime}\in{\mathbb{R}}^{\bar{m}_c} \mapsto {\boldsymbol{{x}}}^{\tilde{ee}}=(x^{\tilde{ee}}_{1},...,x^{\tilde{ee}}_{l})^{\prime}\in{\mathbb{R}}^{l},\end{eqnarray*}swhere s\begin{eqnarray*}x^{\tilde{ee}}_{r}=\langle{\boldsymbol{{w}}}^{\tilde{ee}}_r,{\boldsymbol{{x}}}^{cat}\rangle,\quad r=1,...,l.\end{eqnarray*}sParameter l is the dimension of the embedding for all categorical features s$(x_1,...,x_c)$s. We expect that s$l<l_1+...+l_c$s.sOur new architecture of a neural network with mixed categorical and numerical features where the categorical features are modelled with a joint embedding is presented in Figure 6. For s${\boldsymbol{{x}}}^{cat}_1,{\boldsymbol{{x}}}^{cat}_2$s, we implement a joint embedding of dimension 3—this is the only, but also a significant difference between the architectures in Figures 5 and 6. Within (2.1) and (2.2), we define a neural network with Architecture 2 (A2):sFor all categorical features s$(x_1,...,x_c)$s, we build a joint embedding — a neural sub-network without hidden layers, that is s$M=0$s, where the input s${\boldsymbol{{z}}}={\boldsymbol{{x}}}^{cat}$s, s$q_0=\bar{m}_c$sand the output s$q_1=l$s,sThe next steps of building the network for predicting the response are the same as for A1.sWe initialize all weights with the Xavier initialization and set the bias terms equal to zero, as for A1.sFigure 6.sArchitecture of type A2 with joint embedding.sWe can observe that the numerical representation of the categorical features learned with the joint embedding in A2 matches, in its architecture, the representation learned with the Joint AE. We have already discussed the advantages of this representation in unsupervised learning tasks, which should also hold in supervised learning tasks. In addition, we can expect that by learning a joint embedding for all categorical features, we allow all categorical features, not only labels for a single categorical feature, to share the information about their impact on the response. As a result, we should be able to improve predictions of the response based on the experience collected from similar categorical features and their similar labels. Hence, the switch from A1 to A2 has intuitive foundations. To the best of our knowledge, A2 has not been considered to date in any actuarial data science problem.s4.3.sInitialization of A1 and A2sThe issue of initialization of parameters of neural networks has been noticed in actuarial data science. Under A1, Merz and Wüthrich (2019) and Schelldorfer and Wüthrich (2019) propose the Combined Actuarial Neural Network (CANN) approach to initialize a neural network with predictions from a GLM — we call this architecture and the training process A1_CANN. The idea is to add a skip connection to the output from the network with architecture A1. In mathematical terms, in A1 we use the prediction: s(4.1)s\begin{eqnarray}\lambda_i=e^{\log(Exp_i)+\Theta_1^{M+1}\big((({\boldsymbol{{x}}}^{ee}_{1,i})^{\prime},...,({\boldsymbol{{x}}}^{ee}_{c,i})^{\prime},x_{c+1,i},...,x_{d,i})^{\prime}\big)},\quad i=1,...,n,\end{eqnarray}swhereas in A1_CANN we use the prediction: s(4.2)s\begin{eqnarray}\lambda_i=e^{\log(Exp_i)+\eta^{GLM}_i+\Theta_1^{M+1}\big((({\boldsymbol{{x}}}^{ee}_{1,i})^{\prime},...,({\boldsymbol{{x}}}^{ee}_{c,i})^{\prime},x_{c+1,i},...,x_{d,i})^{\prime}\big)},\quad i=1,...,n,\end{eqnarray}swhere s$\eta^{GLM}_i$sdenotes the prediction, on the linear scale, of the unit intensity (for exposure equal to one) from a Poisson GLM with a log link for observation i.sIn A1 and A2, we initialize the weights of the embeddings for the categorical features with the Xavier initialization. However, since autoencoders extract important information about features, we could initialize the weights of the embeddings with the weights from the encoder of the appropriate autoencoder and define the weights of the embeddings as non-trainable in the training process. This is reasonable, but may be sub-optimal for a supervised learning task since the representation of the categorical features learned with an autoencoder without the information about the response would be kept fixed. To improve the representation from an autoencoder, we should fine-tune it in a supervised learning task with a target response. In the machine learning literature, Erhan et al. (2010, 2009) propose to pre-train layers of neural networks with denoising autoencoders, that is initialize neurons in layers of a neural network for a supervised learning task with representations of the neurons from denoising autoencoders built in unsupervised learning tasks for the input to the layers. We recover and modify their approach in this paper.sApart from changing the architecture from A1 to A2, we initialize the weights and the bias terms in the joint embedding for the categorical feature and the first hidden layer in A2 with the weights and the bias terms from the representations of the neurons in the layers learned with autoencoders. From Erhan et al. (2010, 2009), we know that the initialization procedure with autoencoders gives the largest gains in predictive power of a neural network when it is applied to initial layers of the network. We proceed in the following way:sWe build an autoencoder of type Joint AE (denoted as the 1st AE) for the categorical input s$(x_1,...,x_c)$susing its one-hot representation s${\boldsymbol{{x}}}^{cat}=\big(({\boldsymbol{{x}}}^{cat}_1)^{\prime},...,({\boldsymbol{{x}}}^{cat}_c)^{\prime}\big)^{\prime}$s. To build a denoising autoencoder, we corrupt the categorical input with the sample or the zero transformation, see Section 3.2,sWe take the weights from the encoder of the 1st AE, denoted by s${\boldsymbol{{w}}}^{enc}_r=(w^{enc}_{r,1},...,w^{enc}_{r,\bar{m}_c}), r=1,...,l$s, and initialize the weights s${\boldsymbol{{w}}}^{\tilde{ee}}_r, r=1,...,l,$sof the joint embedding in A2 with these weights,sWe take the representation of the categorical features predicted by the 1st AE: s${\boldsymbol{{x}}}^{enc}=(x^{enc}_1,...,x^{enc}_l)^{\prime}$swith s$x^{enc}_{r}=\langle{\boldsymbol{{w}}}^{enc}_r,{\boldsymbol{{x}}}^{cat}\rangle, r=1,...,l$s, concatenate this vector with the vector of the numerical features s$(x_{c+1},...,x_d)^{\prime}$sand create a new vector of numerical features s${\boldsymbol{{z}}}=(({\boldsymbol{{x}}}^{enc})^{\prime},x_{c+1},...,x_d)^{\prime}$s,sWe build an autoencoder from Section 3.1 (denoted as the 2nd AE) for the numerical input s${\boldsymbol{{z}}}$s. The dimension of the representation to be learned for the s$(l+d-c)$s-dimensional vector s${\boldsymbol{{z}}}$sis equal to s$q_1$s, where s$q_1$sdenotes the number of neurons used in the first hidden layer in the sub-network with M hidden layers, which is built for the input constructed by concatenating the representation from the joint embedding with the numerical features. To build a denoising autoencoder, we corrupt the numerical input with the gaussian or the zero transformation, see Section 3.1,sWe take the weights and the bias terms from the encoder of the 2nd AE and initialize the weights and the bias terms s$b^1_r, {\boldsymbol{{w}}}^1_r, r=1,...,q_1,$sin the first hidden layer in the sub-network with M hidden layers with these weights and the bias terms.sAll other weights are initialized with the Xavier initialization and the bias terms are initialized with zero.sThe initialization procedure applied here also clarifies why we were only interested in building autoencoders with one hidden layer in Section 3. For A2, and any initialization of layers, we use the predictions: s(4.3)s\begin{eqnarray}\lambda_i=e^{\log(Exp_i)+\Theta_1^{M+1}\big((({\boldsymbol{{x}}}_i^{\tilde{ee}})^{\prime},x_{c+1,i},...,x_{d,i})^{\prime}\big)},\quad i=1,...,n.\end{eqnarray}sThe autoencoders, which are trained without the information about the response, are only used to derive initial values of the neurons in the two layers of A2. These initial values are next fine-tuned by training the whole neural network to predict the target response. When training an autoencoder, we are only interested in extracting the most important discriminatory factors in the multi-dimensional input vector, which are next improved and optimally transformed by taking into account the target response. Since in this application autoencoders are trained for a low number of epochs, in Experiment 1, we should only look at the results for epochs 15 and 100, which show clear advantages of the representation of categorical features learned with the Joint AE compared to the Separate AEs.sThe third step above where we concatenate the numerical representation of the categorical features from the 1st AE with the other numerical features deserves attention. We propose a modification of the pre-training strategy of layers with autoencoders which has not been considered by Erhan et al. (2010, 2009). It is known that the features, fed into a neural network, should live on the same scale in order to perform effective training of the network. We can easily control the numerical features and scale them to s$[-1,1]$s, which is done before the training process is started. However, we cannot expect the numerical representation of the categorical features learned with the 1st AE, that is the values given by s$x_r^{enc}=\langle{\boldsymbol{{w}}}^{enc}_r,{\boldsymbol{{x}}}^{cat}\rangle, r=1,...,l$s, to yield predictions in s$[-1,1]$s. If the predictions from the encoder of the 1st AE live on a scale different from s$[-1,1]$s, which is the scale where the numerical features live, then the input to the 2nd AE and the input to the first hidden layer of the sub-network with M hidden layers will have features on different scales, and the training process of the neural networks may suffer from this inconsistency in scales. Fortunately, we can modify the weights and the bias terms of the encoder and the decoder of the 1st AE to keep the reconstruction error unchanged and have the representation of the categorical features in the desired scale. This is possible due to the linear activations functions assumed and the bias terms chosen to be trained in the output layer in the autoencoder of type Joint AE, before the soft-max functions are applied. In the encoder part of the Joint AE, we re-define the weights: s\begin{eqnarray*}w^{enc}_{r,k}\mapsto w^{enc,*}_{r,k}=\frac{2}{\max_i\{x^{enc}_{r,i}\}-\min_i\{x^{enc}_{r,i}\}}w^{enc}_{r,k}-\frac{1}{c}\left(\frac{2\min_i\{x^{enc}_{r,i}\}}{\max_i\{x^{enc}_{r,i}\}-\min_i\{x^{enc}_{r,i}\}}+1\right),\end{eqnarray*}sfor s$r=1,...,l$sand s$k=1,...,\bar{m}_c$s. We can deduce that for these new representations we have s$\langle{\boldsymbol{{w}}}^{enc,*}_r,{\boldsymbol{{x}}}_i^{cat}\rangle\in[-1,1]$sfor all s$r=1,...,l$sand all observations s$i=1,...,n$s. Since s${\boldsymbol{{x}}}_i^{cat}$sis always a vector with c elements equal to 1 and the remaining elements are equal to zero, the constant term s$-2\min_i\{x^{enc}_{r,i}\}/\big(\max_i\{x^{enc}_{r,i}\}-\min_i\{x^{enc}_{r,i}\}\big)-1$sfrom the min–max scaler transformation of the original predictions from the encoder s$\langle{\boldsymbol{{w}}}^{enc}_r,{\boldsymbol{{x}}}_i^{cat}\rangle$s, for each neuron r, can be absorbed by the new weights of the encoder by dividing the constant by c. Let s$(b^{dec}_r, {\boldsymbol{{w}}}^{dec}_r)_{r=1}^{\bar{m}_c}$sdenote the weights and the bias terms from the decoder. In the decoder part of the Joint AE, we now re-define: s\begin{eqnarray*}w^{dec}_{r,k}&\mapsto& w^{dec,*}_{r,k}=\frac{\max_i\{x^{enc}_{k,i}\}-\min_i\{x^{enc}_{k,i}\}}{2}w^{dec}_{r,k},\\[5pt] b^{dec}_{r}&\mapsto& b^{dec,*}_{r}=b^{dec}_r+\sum_{k=1}^l\Big(w^{dec,*}_{r,k}+\min_i\{x^{enc}_{k,i}\}w_{r,k}^{dec}\Big),\end{eqnarray*}sfor s$r=1,...,\bar{m}_c$sand s$k=1,...,l$s. We can conclude that the predictions in the output layer from the autoencoder with the modified weights and bias terms remain exactly the same as in the original autoencoder, hence the reconstruction error remains unchanged. Since the bias terms are needed in the decoder to adjust the representation, in Section 3.2, we decided to train the bias terms in the decoder of the autoencoder for categorical data.sLet us conclude with remarks on our architectures A1–A2:s(a)sWe could initialize the representations of the categorical features in A1 with the weights from the encoders from the Separate AEs. Based on the results from Experiments 1, we expect that this type of initialization of A1 would not be an efficient solution for improving predictive power of neural networks, and we decided not to proceed with this approach in this paper. Moreover, training multiple autoencoders in unsupervised learning tasks for initialization of a neural network for a supervised learning task would be time-consuming and would be unlikely to gain popularity in practical applications.s(b)sOther architectures of neural networks are also possible. For example, we could consider Architecture 3. First, the one-hot encoded categorical features are centered and linearly transformed with non-trainable mappings defined as in the MCA algorithm before the PCA algorithm is applied. Then, they could be treated as numerical data together with the other numerical features. Such an approach is proposed in the Factor Analysis of Mixed Data, see Pagès (2015) and Chavent et al. (2017). In other words, we could define neurons in the first hidden layer of a neural network as linear transformations of linearly transformed one-hot encoded categorical features and numerical features. In Figure 6, we would remove the intermediate layer with grey neurons. We would only need the autoencoder for numerical data to pre-train the first hidden layer of the network and the autoencoder for the categorical data would not be needed at all. Based on the results from Experiment 1, we reject such architecture because we believe that categorical data should be treated differently from numerical data. This view is also supported with experiments presented by Brouwer (2004) and Yuan et al. (2020).s4.4.sExperiment 2 — the predictive power of A1 and A2sWe study architectures and training processes of neural networks denoted by A1, A1_CANN, A2, A2_MCA, A2_1AE, A2_2AEs, where A1, A1_CANN, A2 are defined above and we introduce:sA2_MCA — we only pre-train the joint embedding of A2 with a linear autoencoder, that is we initialize the weights of the joint embedding for the categorical features with the weights from the encoder from a linear autoencoder built with the MCA algorithm. In our experiment, and also in general, we cannot apply a linear autoencoder built with the PCA algorithm as the 2nd AE since PCA only allows us to build under-complete autoencoders, whereas the number of neurons in the first hidden layer of a sub-network with M hidden layers is usually much larger than the dimension of the input to the layer — this remark can serve as an additional argument for using over-complete autoencoders for pre-training layers of neural networks rather than classical under-complete autoencoders,sA2_1AE — we only pre-train the joint embedding of A2 with a non-linear autoencoder, that is we initialize the weights of the joint embedding for the categorical features with the weights from the encoder from an autoencoder of type Joint AE. We use only one non-linear autoencoder since we want to directly validate MCA with a non-linear autoencoder,sA2_2AEs — our main approach, in which we pre-train the joint embedding and the first hidden layer of A2 with two non-linear autoencoders, that is we initialize the weights of the joint embedding for the categorical features with parameters from the encoder from an autoencoder for categorical data of type Joint AE and we initialize the weights and bias terms of the first hidden layer in the sub-network with M hidden layers with parameters from the encoder from an autoencoder for numerical data from Section 3.1.sA1_CANN is initialized with GLM1 from Schelldorfer and Wüthrich (2019), which is a Poisson GLM with log link function where the features in Table 1 are used as regressors and the categorical features are coded with dummy variables.sThe dimension of the categorical input, which consists of the one-hot encoded categorical features, is equal to 54. We set the dimension of the representation of the categorical features to 8. For A1, we build separate representations of dimension 2 for Region and VehBrand and separate representations of dimension 1 for all other features—Area, VehPower, VehAge and DrivAge. This choice is compatible with Experiment 1 and the choice made by Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020). For A2, we build a joint representation of dimension 8 for all categorical features. The dimension of the input to the first hidden layer, which consists of the numerical features and the numerical representation of the categorical features, is equal to 11, since we concatenate the representation of the categorical features learned with the embeddings with the three numerical features — BonusMalus, Density and VehGas. The number of neurons in the single hidden layer in the 1st AE is 8, as this number must coincide with the dimension of the representation of the categorical features for our supervised learning task. The number of neurons in the single hidden layer in the 2nd AE is equal to the number of neurons in the first hidden layer of the sub-network with M hidden layers. We consider sub-networks with s$M=3$shidden layers in our experiments below. We consider three possible choices for the numbers of neurons in the hidden layers in A2, similar to Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020), and we define the numbers of neurons for A1 so that the number of trainable parameters in A1 and A2 are equal, see Table 2.1 in Section 2 in the Online Supplement for the numbers of neurons.sExperiment 2 is conducted on the same 100,000 observations as Experiment 1. Since the predictive power of neural networks depends on their hyperparameters, in the first step of this experiment, we perform hyperparameter optimization. With hyperparameter optimization, we also control over-fitting of the networks. We try to identify the best hyperparameters for the two autoencoders (the 1st AE and the 2nd AE) trained in an unsupervised process and the best hyperparameters for the neural networks with Architectures 1 and 2 trained in a supervised process. The hyperparameters optimized in the experiment, together with their best values, are presented in Section 2 in the Online Supplement, where the hyperparameter optimization process is described in detail. As a part of the hyperparameter optimization, we choose between denoising autoencoders and autoencoders without noise. We point out that we prefer a denoising autoencoder for pre-training the joint embedding of A2, see Table 2.3 in Section 2 in the Online Supplement.sIn the second step of this experiment, we study in more detail the predictive power of the best neural networks identified in the first step of the experiment for each architecture and training process. The set of 100,000 observations is split into a training, a validation and a test set to the proportions 3:1:1. We perform 100 calibrations for each best approach for A1, A1_CANN, A2, A2_MCA, A2_1AE, A2_2AEs. In each calibration, we train the autoencoders in unsupervised learning tasks, if required, and the neural network for the supervised learning task. The training process is the same as in the hyperparameter optimization. We train the networks on the training set by minimizing the Poisson loss, early stop the algorithm on the validation set and evaluate the predictive power of the trained networks by calculating the Poisson loss on the test set.sThe box plots of the Poisson loss values on the test set in 100 calibrations are presented in Figure 7, and their key characteristics in Table 2. By initializing A1 with GLM1, we gain on average a small predictive power of 0.0052 and we increase the standard deviation of the loss from 0.0384 to 0.0708. In general, the predictive power of A1_CANN depends on the GLM used for initialization of A1 and here we use one of the simplest GLMs investigated by Noll et al. (2019). As discussed by Schelldorfer and Wüthrich (2019), A1_CANN could only benefit from a very good initial GLM. If we switch from A1 to A2, then the predictive power of the network increases slightly on average by 0.0095, but at the same time, A2 has standard deviation of the loss twice as high as A1. A1, A1_CANN and A2 are all close in terms of their predictive power and we do not find strong evidence that A1_CANN and A2 are better than A1. In fact, we observe that the Poisson loss values achieved in calibrations are dispersed more under A1_CANN and A2 than A1. If we improve the training process of A2 by initializing its parameters with the parameters from the autoencoders, then the performance of A2 improves in terms of the predictive power, the standard deviation and quantiles of the Poisson loss. By initializing the joint embedding of A2 with the linear autoencoder, we gain on average a small amount of predictive power of 0.0087. If we replace the linear autoencoder of type MCA with the non-linear denoising autoencoder of type Joint AE, then we can observe on average a significant gain in the predictive power of 0.0256 (A2_1AE vs. A2). This shows that linear autoencoders are not sufficient for pre-training layers of neural networks for supervised learning tasks and we have to rely on non-linear denoising autoencoders to initialize neural networks (it is not possible to use PCA for pre-training the first hidden layer of the sub-network with three hidden layers and only the joint embedding can be pre-trained with MCA). If we pre-train A2 with the two autoencoders for the categorical and the numerical input, then we reduce the Poisson loss on average by 0.0620 (A2_2AEs vs. A2). When we move from A2_1AE to A2_2AEs, the improvement in the predictive power on average of 0.0364 is possible only if we re-scale the weights from the autoencoder for the categorical input before we train the autonencoder for the numerical input (see Section 4.3 for details on this step). Without this step, A2_2AEs would fail to provide superior results. The size of the improvement in the predictive power when we switch from A2_1AE to A2_2AEs also depends on the choice of the autoencoder for the categorical input. Hence, the choice of the 1st AE (recall that we choose a denoising autoencoder) is important even though the 2nd AE leads to a larger decrease in the average value of the Poisson loss. By pre-training A2 with our two autoencoders, we also decrease the standard deviation of the loss. Most importantly, we finally compare A2_2AEs with A1. We achieve an improvement in the Poisson loss on average of 0.0525 for A2_2AEs compared to A1. All reported quantiles are lower for A2_2AEs than for A1, and the distribution of the loss from A2_2AEs is shifted to the left compared to A1, but the standard deviation of the loss from A2_2AEs is slightly larger than the standard deviation of the loss from A1. We can conclude that our new architecture with a joint embedding for all categorical features and initialized with parameters from (denosing) autoencoders is better, in terms of its predictive power, than the classical architecture nowadays commonly used in actuarial data science with separate entity embeddings for categorical features and random initialization of parameters. The improvement of the predictive power from 30.3950 to 30.3425 can indeed be interpreted as significant for this data set—Schelldorfer and Wüthrich (2019) demonstrate for example that the Poisson loss can decrease from 31.5064 to 31.4532 by optimizing the dimensions of the entity embedding, or the Poisson loss can decrease from 32.1490 to 32.10286 by boosting a GLM with one regressor transformed with a neural network (for the BonusMalus which achieves the largest improvement).sFigure 7.sDistributions of the Poisson loss on the test set (for each network the dotted line represents the average loss in 100 calibrations).sTable 2.sDistributions of the poisson loss on the test set, their quantiles (q), average values (avg) and standard deviations (SD).sStatisticssA1sA1_CANNsA2sA2_MCAsA2_1AEsA2_2AEssq0.05s30.3362s30.2875s30.3133s30.3142s30.2863s30.2767sq0.25s30.3684s30.3417s30.3481s30.3495s30.3322s30.3138savgs30.3950s30.3898s30.4045s30.3958s30.3789s30.3425sq0.75s30.4130s30.4261s30.4338s30.4386s30.4157s30.3719sq0.95s30.4688s30.5101s30.5444s30.5104s30.4833s30.4126ssds0.0384s0.0708s0.0758s0.0597s0.0624s0.0433sThe bias of the predictors and the performance of auto-calibrated predictors is investigated in Section 3 in the Online Supplement.s5.sConclusionsWe have presented a new approach to training neural networks with mixed categorical and numerical features for supervised learning tasks. We have illustrated that our new architecture of a network with a joint embedding for all categorical features and network parameters properly initialized with parameters from (denosing) autoencoders, learned in an unsupervised manner, performs better, in terms of the predictive power and the stability of the predictions, than the classical architecture, used nowadays in actuarial data science, with separate entity embeddings for categorical features and random initialization of parameters. We hope that the results described in this paper will draw attention in actuarial data science to a new possible architecture of a neural network for supervised learning tasks and benefits of autoencoders in deriving representations of features for supervised learning tasks. In fact, we are already aware of new (unpublished) experiments with autoencoders used for initialization of neural networks for actuarial pricing, see Holvoet et al. (2022).sDespite the results presented in the paper, there is one more advantage of our new architecture. As far as hyperparameter optimization is concerned for the classical architecture, we should optimize the loss function with respect to multiple hyperparameters which describe the dimensions of the entity embeddings for categorical features. In our new architecture, we only search for the optimal value of only one hyperparameter which specifies the dimension of the joint embedding for all categorical features. Consequently, our new architecture enables faster and more convenient optimization of the dimension of the representation of categorical features, see Section 4 in the Online Supplement. There is also a disadvantage of our approach. We lose simple graphical interpretation of the joint embedding of categorical features due to a higher dimension of the joint representation compared to one- or two-dimensional representations of entity embeddings. Hopefully, we should be able to use rich methods of explainable AI to interpret the impact of categorical features, modelled with a joint embedding, on the response.sFinally, we could modify our approach by learning autoencoders, used for pre-training layers of a network, jointly with a network with a target response, which uses the representations from the autoencoders as the input (at the additional cost of fine-tuning the weight between the unsupervised and the supervised loss). Such an approach is also postulated in the machine learning literature, see for example Ranzato and Szummer (2008), Lei et al. (2018). This last remark reinforces the conclusion stated above that autoencoders should be included in the toolbox of data science actuaries who build predictive models.s http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Astin Bulletin Cambridge University Press

The use of autoencoders for training neural networks with mixed categorical and numerical features

Astin Bulletin , Volume 53 (2): 20 – May 1, 2023

Loading next page...
 
/lp/cambridge-university-press/the-use-of-autoencoders-for-training-neural-networks-with-mixed-Uf9QaJMX4j

References (37)

Publisher
Cambridge University Press
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of The International Actuarial Association
ISSN
0515-0361
eISSN
1783-1350
DOI
10.1017/asb.2023.15
Publisher site
See Article on Publisher Site

Abstract

s1.sIntroductionsIn this paper, we deal with the following three important problems of training neural networks with mixed categorical and numerical features in supervised learning tasks:sHow to construct a numerical representation of categorical features which is fed, together with the numerical features, into the hidden layers of a neural network,sWhich architecture of a neural network we should build, in particular, how we should treat (concatenate) features of different types (in our case categorical and numerical features),sHow we should initialize the weights and the bias terms of a neural network to guide the network towards a point in the surface of parameters where the network has a high predictive power and good generalization properties.sThere are many possible approaches to these problems. We present an approach inspired by autoencoders.sNeural networks have recently gained a lot of attention in actuarial science. In particular, Noll et al. (2019) and Ferrario et al. (2020) were among the first to discuss applications of neural networks to actuarial non-life insurance pricing problems and compared neural networks with generalized linear models. The current approach to supervised learning tasks in actuarial data science is to build a neural network where entity embeddings for categorical features are used. Entity embeddings were introduced by Guo and Berkhahn (2016) in the machine learning literature to help neural networks to deal with sparse categorical data of high dimension. They were first promoted by Richman (2021), Noll et al. (2019) and Ferrario et al. (2020) in actuarial data science and they were further developed by Blier-Wong et al. (2021), Kuo and Richman (2021) and Shi and shi (2022). An entity embedding, as part of a neural network for a supervised learning task, is learned separately for each categorical feature and allows us to derive a real-value representation of a categorical feature in a low-dimensional space. The numerical representations of the categorical features are concatenated with the numerical features and they are fed together as input into hidden layers of a neural network. The weights of the entity embeddings are learned together with all other parameters of the neural network, and the objective is to minimize a loss appropriate for a target response. All weights and bias terms of the network are initialized with random values from uniform distributions, and the back propagation algorithm is used to learn the parameters of the network. To the best of our knowledge, the architecture of a neural network described above is the only architecture of a neural network investigated to date for actuarial applications. In particular, other numerical representations of categorical features have not been tested as inputs to neural networks. In addition, advances in training algorithms for neural networks have not been discussed in actuarial data science. The only distinct training algorithm was proposed by Merz and Wüthrich (2019) and Schelldorfer and Wüthrich (2019) who promoted the Combined Actuarial Neural Network (CANN). The goal of this paper is to challenge the current dominant approach to supervised learning tasks in actuarial data science with a new architecture of a neural network, in particular, with a new numerical representation of categorical features and with a new training algorithm.sIt is known that in order to achieve a high predictive power of a neural network, the input to the network should contain the most important information for the supervised learning task under consideration. A highly informative input can be effectively pre-processed in hidden layers of the network to provide good predictions of the response. It has been shown in the machine learning literature that non-linear autoencoders, in particular, denoising autoencoders, built with neural networks, can capture the main factors of variation in the input and detect key characteristics of the multivariate and high-dimensional distribution of the input. As a result, representations of features derived using (denoising) autoencoders, learned in unsupervised learning tasks, can improve the predictive power of regression models if these representations are used as inputs to neural networks to predict the response, see for example Vincent et al. (2008) and (2010).sAutoencoders without noise for numerical data have been investigated in actuarial data science. The benefits of autoencoders without noise in supervised and unsupervised learning tasks have been demonstrated by Gao and Wüthrich (2018), Hainaut (2018), Rentzmann and Wüthrich (2019), Blier-Wong et al. (2021, 2022), Miyata and Matsuyama (2022) and Grari et al. (2022). To the best of our knowledge, autoencoders for categorical features are less common in actuarial data science. The only exception we are aware of is the paper by Lee et al. (2019) in which the authors discuss how to build word embeddings, which are similar to but are not exactly autoencoders for categorical data in the meaning investigated in this paper.sThe first contribution of this paper is that we investigate different types of autoencoders for categorical data. We demonstrate that we can benefit from non-linear autoencoders built with neural networks when the purpose is to derive informative representations of categorical features. We show that an autoencoder for categorical features should be of a different type than an autoencoder for numerical features. Most importantly, we deduce that the best autoencoder for categorical features, which extracts the most important information from the vector of categorical features, implies a different numerical representation of categorical features than the representation from entity embeddings currently used in supervised learning tasks in actuarial data science. From our experiments, we conclude that we should learn one numerical representation for all categorical features, rather than multiple representations for each separate feature, to build a more robust and informative representation of the categorical data. The other, and our main contribution, is that we use a joint numerical representation of categorical features, together with all other numerical features, as the input to the hidden layers of a neural network trained to predict a target response. In other words, we introduce a new architecture of a neural network with mixed categorical and numerical features for the supervised learning tasks. The change in the architecture compared to the current approach is that all one-hot encoded categorical features are transformed with one embedding into a real-value representation in a low-dimensional space and the joint representation of the categorical features is then concatenated with the numerical features. Finally, we fine-tune the numerical representation of the categorical features from a proper autoencoder by training its weights together with all other parameters of the neural network. Hence, the autoencoder for the categorical data is only used to derive an initial representation of the categorical features. This approach is known in the machine learning literature as pre-training of layers with autoencodeders, see Vincent et al. (2008), Erhan et al. (2010, 2009) and Vincent et al. (2010). Hence, we pre-train the joint embedding for categorical features in a neural network for a supervised learning task with our autoencoder for categorical data. From Erhan et al. (2009) and (2010), we know that pre-training other layers of the neural network is also beneficial, and this can be achieved with an autoencoder for numerical data. We use both autoencoders, without noise and denosing autoencoders, in this research.sThe benefits of using (denosing) autoencoders for pre-training layers of neural networks have been demonstrated in the literature (see the papers referred to above). In comparison to these papers:sWe perform our experiments with categorical data, instead of binary data, which means that we use a different autoencoder and a different type of a corruption process for the denoising autoencoder,sWe perform our experiments with Poisson distributed data and the Poisson deviance loss, which is the most common loss function in actuarial data science, instead of the mean square loss and the cross-entropy loss commonly used in the machine learning literature,sWe propose and validate a new architecture of a neural networks, with a joint embedding for all categorical features, for the supervised learning tasks. This has not been considered to date in actuarial data science,sWe propose to scale appropriately the representation of the categorical features from an autoencoder for categorical data before an autoencoder for numerical data is built to pre-train the first hidden layer of a neural network. This improves significantly the approach to supervised learning tasks with our new architecture,sWe compare various initialization techniques and we show that pre-training of layers of a neural network with non-linear and over-complete/denoising autoencoders produces much better results than applications of classical linear and under-complete autoencoders without noise (MCA, PCA),sWe investigate the balance property, the bias and the stability of the predictions which are crucial for actuarial pricing.sThe main conclusion of this paper is that we can improve the current approach to modelling categorical features in supervised learning tasks, which uses separate entity embeddings, and the training algorithm, which randomly initializes the parameters of the neural network. The proposal is to change the architecture of the network by using a different numerical representation of the categorical features, learned with a joint embedding, and initialize the layers of the network, in particular, the joint embedding and the first hidden layer, with representations learned with (denosing) autoencoders in unsupervised learning tasks.sThis paper is structured as follows. In Section 2, we present the general setup for neural networks and our numerical experiments. In Section 3, we discuss autoencoders for categorical and numerical features. In Section 4, we focus on training neural networks with mixed categorical and numerical features. Details of our experiments and some additional results are presented in the Online Supplement. The R codes for training our categorical autoencoders are available on https://github.com/LukaszDelong/Autoencoders.s2.sGeneral setupsWe assume that we have a data set consisting of s$(y_i,{\boldsymbol{{x}}}_i)_{i=1}^n$swhere s$y_i$sdescribes the one-dimensional response for observation i and s${\boldsymbol{{x}}}_i=(x_{1,i},..,x_{j,i},...,x_{d,i})^{\prime}$sis a d-dimensional vector of features which characterizes the observation. We may omit the index i, which indicates the observation, and simply use s$(y,{\boldsymbol{{x}}})$s. The vector s${\boldsymbol{{x}}}$sconsists of mixed categorical and numerical features. We assume that we have c categorical features and s$d-c$snumerical features.sThe categorical features are first one-hot encoded. Let s$x_j$sdenote a categorical feature with s$m_j$sdifferent labels s$\left\{a^j_1,...,a^j_{m_j}\right\}$s. This categorical feature is transformed into a s$m_j$s-dimensional vector of zeros and one: s\begin{eqnarray*}x_{j} \mapsto {\boldsymbol{{x}}}^{cat}_{j}=\big(x_{j_1},...,x_{j_{m_j}}\big)^{\prime}=\big(\mathbf{1}\{x_{j}=a^j_1\},...,\mathbf{1}\{x_{j}=a^j_{m_j}\}\big)^{\prime}\in {\mathbb{R}}^{m_j}.\end{eqnarray*}sThe dimension of the vector of features s${\boldsymbol{{x}}}=\big(({\boldsymbol{{x}}}^{cat})^{\prime},({\boldsymbol{{x}}}^{num})^{\prime}\big)^{\prime}=\big(({\boldsymbol{{x}}}_{1}^{cat})^{\prime},...,({\boldsymbol{{x}}}_{c}^{cat})^{\prime},x_{c+1},...,x_{d}\big)^{\prime}$sbecomes s$\sum_{j=1}^{c}m_j+d-c$s. As far as the numerical features are concerned, we assume that each numerical feature takes its values from s$[\!-\!1,1]$s, that is min–max scaler transformations are applied to the numerical features on the original scale.sIn general, we use neural networks with s$M \in {\mathbb N}$shidden layers and s$q_m\in {\mathbb N}$sneurons in each hidden layer s$m=1,\ldots,M$s. The network layers are defined with the mappings: s(2.1)s\begin{eqnarray}{\boldsymbol{{z}}} \in {\mathbb R}^{q_{m-1}}\;\mapsto\;\theta^{m}({\boldsymbol{{z}}})&=&\big(\theta^{m}_1({\boldsymbol{{z}}}),\ldots,\theta^{m}_{q_m}({\boldsymbol{{z}}})\big)^{\prime}\in {\mathbb R}^{q_m},\quad m=1,\ldots,M, \end{eqnarray}s(2.2)s\begin{eqnarray} {\boldsymbol{{z}}} \in {\mathbb R}^{q_{m-1}}\;\mapsto\;\theta^{m}_{r}({\boldsymbol{{z}}})&=&\chi^m\big(b_r^m+\langle{\boldsymbol{{w}}}^{m}_r,{\boldsymbol{{z}}}\rangle\big),\quad r=1,\ldots,q_m,\end{eqnarray}swhere s$\chi^m\;:\;{\mathbb R} \to {\mathbb R}$sdenotes an activation function, s${\boldsymbol{{w}}}^m_r \in {\mathbb R}^{q_{m-1}}$sdenotes the network weights, s$b^m_r \in {\mathbb R}$sdenotes the bias term, and s$\langle \cdot , \cdot \rangle$sdenotes the scalar product in s${\mathbb R}^{q_{m-1}}$s. By s$q_0$swe denote the dimension of the input vector to the network. The mapping: s(2.3)s\begin{eqnarray}{\boldsymbol{{z}}} \in {\mathbb R}^{q_{0}}\; \mapsto \Theta^{M+1}({\boldsymbol{{z}}})=\big(\Theta_1^{M+1}({\boldsymbol{{z}}}),\ldots,\Theta_{q_{M+1}}^{M+1}({\boldsymbol{{z}}})\big)^{\prime} \in {\mathbb R}^{q_{M+1}},\end{eqnarray}swith a composition of the network layers s$\theta^1, \ldots, \theta^M$s, and the components: s\begin{eqnarray*}{\boldsymbol{{z}}}\;\mapsto\;\Theta_r^{M+1}({\boldsymbol{{z}}})=b_r^{M+1}+\Big\langle{\boldsymbol{{w}}}_r^{M+1},\left(\theta^{M}\circ\cdots\circ\theta^1\right)({\boldsymbol{{z}}})\Big\rangle,\quad r=1,\ldots,q_{M+1},\end{eqnarray*}sgives us the prediction from the network in the output layer s$M+1$sof dimension s$q_{M+1}$sbased on the input vector s${\boldsymbol{{z}}}$s. The output (2.3) returns the prediction with the linear activation function, and this prediction can be transformed with appropriate non-trainable and non-linear mapping if this is required for an application. If we set s$M=0$sin (2.1)–(2.2), then we assume that the input vector is just linearly transformed to give the prediction in the output layer of dimension s$q_1$sand, in this case, the components in (2.3) are given by s\begin{eqnarray*}{\boldsymbol{{z}}}\;\mapsto\;\Theta_r^{1}({\boldsymbol{{z}}})=b_r^{1}+\big\langle{\boldsymbol{{w}}}_r^{1},{\boldsymbol{{z}}}\big\rangle,\quad r=1,\ldots,q_{1}.\end{eqnarray*}sIn our numerical experiments, we use the data set freMTPL2freq, which is included in the R package CASdatasets. The data set has 678,013 observations from insurance policies. The response Y describes the number of claims per policy. Each policy has nine features and an exposure: s$({\boldsymbol{{x}}},Exp)$s. This data set is extensively studied by Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020) in the context of applications of generalized linear models and neural networks to modelling the number of claims. We perform the same data cleaning and feature pre-processing as in these papers. For the purpose of our experiments, we work with the features presented in Table 1.sTable 1.sFeatures used in our experiments.s6 categorical featuress2 numerical featuress1 binary featuresArea — 6 levelssBonusMalus (caped at 150)sVehGassVehPower — 6 levelsslog-DensitysVehAge — 3 levelssDrivAge — 7 levelssVehBrand — 11 levelssRegion — 22 levelssWe consider a supervised learning task where the goal is to predict the number of claims for a policyholder characterized by s$({\boldsymbol{{x}}},Exp)$sby estimating the regression function s${\mathbb{E}}[Y|{\boldsymbol{{x}}},Exp]$s. The prediction is constructed with the neural network described above and the one-dimensional output from the network (2.3) is transformed with the non-trainable and the non-linear exponential transformation: s(2.4)s\begin{eqnarray}{\mathbb{E}}[Y|{\boldsymbol{{x}}},Exp]=e^{\log(Exp)+\Theta_1^{M+1}({\boldsymbol{{x}}})}.\end{eqnarray}sThe parameters of the network are trained by minimizing the Poisson deviance loss function, see for example Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020). Our supervised learning task is solved with the help of unsupervised learning tasks where autoencoders are used.s3.sAutoencoderssLet s${\boldsymbol{{x}}}$sdenote a vector of (categorical, numerical, mixed) features of dimension p. An autoencoder consists of two functions: s\begin{eqnarray*}\varphi\;:\; {\mathbb{R}}^p \mapsto {\mathbb{R}}^l,\quad \text{and} \quad \psi\;:\; {\mathbb{R}}^l \mapsto {\mathbb{R}}^p.\end{eqnarray*}sThe mapping s$\varphi$sis called the encoder, and s$\psi$sis called the decoder. The mapping s${\boldsymbol{{x}}}\mapsto\varphi({\boldsymbol{{x}}})$sfrom the encoder gives an l-dimensional representation of the p-dimensional vector s${\boldsymbol{{x}}}$s. The mapping s${\boldsymbol{{z}}}\mapsto\psi({\boldsymbol{{z}}})$sfrom the decoder tries to reconstruct the p-dimensional vector s${\boldsymbol{{x}}}$sfrom its l-dimensional representation s${\boldsymbol{{z}}}=\varphi({\boldsymbol{{x}}})$s. We define the reconstruction function as s\begin{eqnarray*}\pi=\psi\circ\varphi \;:\;{\mathbb{R}}^p\mapsto{\mathbb{R}}^p.\end{eqnarray*}sFor a data set with observations s$({\boldsymbol{{x}}}_i)_{i=1}^n$s, the goal is to find the functions s$\varphi$sand s$\psi$ssuch that the reconstruction error as s\begin{eqnarray*}\frac{1}{n}\sum_{i=1}^{n}L(\pi({\boldsymbol{{x}}}_i),{\boldsymbol{{x}}}_i),\end{eqnarray*}smeasured with a loss function L, is minimized. If we can find an autoencoder for which the reconstruction error is small, then we can claim that the encoder extracts the most important information from a multi-dimensional vector of features. Consequently, we can use the representation s$\varphi({\boldsymbol{{x}}})$s, instead of s${\boldsymbol{{x}}}$s, as input to predict the response in our supervised learning task. The observed response y is not used in this approach when we train an autoencoder. We train autoencoders in a fully unsupervised fashion, but we will improve the representation based on the target y when we solve the supervised learning task.sLinear autoencoders are well-known in statistics. By a linear autoencoder, we mean an autoencoder where both the functions s$\varphi$sand s$\psi$sare linear. Classical examples of linear autoencoders include autoencoders built with Principal Component Analysis for numerical data and Multiple Correspondence Analysis for categorical data. We refer for example to Chapter 6.2 in Dixon et al. (2020) for the equivalence between the linear autoencoder built by minimizing the mean square reconstruction loss function and the representation built with the PCA algorithm. For MCA and its relation to PCA, we refer for example to Pagès (2015) and Chavent et al. (2017).sIn this paper, we are interested in non-linear autoencoders where at least one of the functions s$\varphi$sor s$\psi$sis non-linear. We use the notation (2.1)–(2.2) from the previous section. To build an autoencoder for the input s${\boldsymbol{{x}}}$s, we use a neural network with one hidden layer, that is s$M=1$s. The dimension of the single hidden layer is set to s$q_1=l$s, and the dimensions of the input and the output are set to s$q_0=q_2=p$s. The activation function for the hidden layer depends on the type of data and is discussed in the sequel. The vector s$(\theta^1_1({\boldsymbol{{x}}}),...,\theta^1_l({\boldsymbol{{x}}}))^{\prime}$sgives us the representation of the input s${\boldsymbol{{x}}}$sfrom the encoder. The vector s$(\Theta^2_1({\boldsymbol{{x}}}),...,\Theta^2_p({\boldsymbol{{x}}}))^{\prime}$s, transformed with a non-trainable and non-linear function if required for the data, gives us the reconstruction of the input s${\boldsymbol{{x}}}$spredicted with the decoder. This means that s$\psi$salso includes a non-trainable and non-linear transformation of the output (2.3) from the network if such a transformation is required for application. Clearly, we could also build deep autoencoders with more hidden layers, but shallow autoencoders with one hidden layer are sufficient for our main application in Section 4.sIf s$l<p$s, we construct under-complete autoencoders and we reduce the dimension of the input s${\boldsymbol{{x}}}$s. Linear autoencoders built with the PCA and MCA algorithms are examples of under-complete autoencoders. If we choose s$l=p$s, then we can achieve a zero reconstruction error by learning the identity mapping. Interestingly, we can also learn over-complete autoencoders with s$l>p$s, and denoising autoencoders are examples of over-complete autoencoders. In order to construct a denoising autoencoder with s$l>p$s, we corrupt the input for the network. The objective for training a denoising autoencoder is to find the functions s$\varphi$sand s$\psi$ssuch that the reconstruction error: s\begin{eqnarray*}\frac{1}{n}\sum_{i=1}^{n}L(\pi(\tilde{{\boldsymbol{{x}}}}_i),{\boldsymbol{{x}}}_i),\end{eqnarray*}smeasured with a loss function L, is minimized. This time, the input s$\tilde{{\boldsymbol{{x}}}}$sis corrupted input s${\boldsymbol{{x}}}$swhich is constructed by adding a noise to s${\boldsymbol{{x}}}$s. It has been demonstrated in the machine-learning literature that denosing autoencoders are very good at extracting the most important information from a multi-dimensional vector of features, see for example Vincent et al. (2008) and (2010). We can also construct over-complete autoencoders using data without noise if a low number of epochs is used for training the autoencoder built with a neural network.sIn the next two sections, we discuss autoencoders for numerical and categorical features.s3.1.sAutoencoders for numerical featuressAs discussed in Introduction, autoencoders without noise for numerical features have been investigated in various actuarial applications. In this paper, we adopt the approach from Rentzmann and Wüthrich (2019). We use the hyperbolic tangent activation function in the hidden layer (s$\chi^1$s), reconstruct the input using the linear prediction: s(3.1)s\begin{eqnarray}{\boldsymbol{{x}}}\in{\mathbb{R}}^p\mapsto\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})=\left(\Theta_1^2({\boldsymbol{{x}}}),...,\Theta_p^2({\boldsymbol{{x}}})\right)^{\prime}\in{\mathbb{R}}^p,\end{eqnarray}sand use the mean square error loss function L to measure the reconstruction error between the prediction s$\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})$sand the input s${\boldsymbol{{x}}}$s. We build a non-linear autoencoder since we use a non-linear activation function (the hyperbolic tangent) in the hidden layer. In contrast to Rentzmann and Wüthrich (2019), we allow for bias terms in the network since we use the min–max scaler transformation of the numerical features instead of zero mean and unit variance standardization. An example of the architecture of a neural network used in this paper to build an autoencoder for numerical features is presented in Figure 1.sFigure 1.sArchitecture of the autoencoder for numerical features used in the paper.sAs far as denoising autoencoders are concerned, we apply two types of corruption processes to distort the input, see for example Vincent et al. (2008), (2010):sGaussian disturbance (gaussian): For each observation, s$i=1,...,n$s, and each numerical feature in vector s${\boldsymbol{{x}}}_i$s, the original input is corrupted with the transformation s$x_{j,i}\mapsto\tilde{x}_{j,i}\sim N(x_{j,i},\sigma^2)$s.sMasking to zero (zero): For each observation, s$i=1,...,n$s, and the fraction v of numerical features in vector s${\boldsymbol{{x}}}_i$schosen at random, the original input is corrupted with the transformation s$x_{j,i}\mapsto\tilde{x}_{j,i}=0$s.s3.2.sAutoencoders for categorical featuressWe consider two types of architecture of autoencoders for categorical features. Neither has been explored in the actuarial literature, although they appear, and versions of them appear, in many applications of machine learning methods in various fields.s1.sSeparate autencoders for each feature (Separate AEs): For categorical feature s$x_j$swith s$m_j$sdifferent labels and its one-hot representation s${\boldsymbol{{x}}}_j^{cat}=(x_{j_1},...,x_{j_{m_j}})^{\prime}$s, we build a neural network (2.1)–(2.2) with s$M=1, q_0=m_j, q_1=l_j, q_2=m_j$s, where s$l_j$sis the required dimension of the representation of the categorical feature. Since we use the one-hot representation of s$x_j$sas the input to the network, there is no need to train bias terms in the hidden layer, so we set s$b^1_r=0$sfor s$r=1,...,l_j$s. However, it is still beneficial to train bias terms in the output layer in order to match the output expressed with probabilities (see below). The linear activation function for s$\chi^1$sin the hidden layer is a natural choice here since the linear mappings s$\langle{\boldsymbol{{w}}}^{1}_r,{\boldsymbol{{x}}}^{cat}_j\rangle$s, for neurons s$r=1,...,l_j$s, yield unique constants for each label of the categorical feature, so there is no need to apply non-linear transformations to these constants. We reconstruct the input using the prediction: s(3.2)s\begin{eqnarray}{\boldsymbol{{x}}}_j^{cat}\in{\mathbb{R}}^{m_j}\mapsto\hat{{\boldsymbol{{x}}}}^{cat}_j=\pi({\boldsymbol{{x}}}_j^{cat})=\big(\pi_1({\boldsymbol{{x}}}_j^{cat}),...,\pi_{m_j}({\boldsymbol{{x}}}_j^{cat})\big)^{\prime}\in{\mathbb{R}}^{m_j},\end{eqnarray}swhere s\begin{eqnarray*}\pi_r({\boldsymbol{{x}}}_j^{cat})=\frac{e^{\Theta^2_r({\boldsymbol{{x}}}_j^{cat})}}{\sum_{u=1}^{m_j}e^{\Theta^2_u({\boldsymbol{{x}}}_j^{cat})}},\quad r=1,...,m_j.\end{eqnarray*}sThe soft-max activation function is applied to the output from the network (2.3) to derive the reconstructed input. The reconstruction function returns the probabilities that the reconstructed feature takes a particular label. The label with the highest predicted probability is the label predicted for the reconstructed feature. Since we now deal with a classification problem for the single categorical feature s$x_j$s, it is natural to use the cross entropy loss function L to measure the reconstruction error between the prediction s$\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})$sand the input s${\boldsymbol{{x}}}$s: s(3.3)s\begin{eqnarray}L(\pi({\boldsymbol{{x}}}^{cat}_{j,i}),{\boldsymbol{{x}}}^{cat}_{j,i})=-\sum_{r=1}^{m_j}x^{cat}_{j_{r},i}\log\!\big(\pi_r({\boldsymbol{{x}}}^{cat}_{j,i})\big),\quad i=1,...,n.\end{eqnarray}sWe build a non-linear autoencoder since we use a non-linear activation function (the soft-max function) in the output layer. The approach described above is applied to all categorical features in the data set. An example of the architecture of a neural network used in this paper to build the autoencoder of type Separate AEs for categorical features (with 2 and 3 labels) is presented in Figure 2.s2.sJoint autoencoder all features (Joint AE): We consider a vector of categorical features s$(x_1,...,x_c)$swith s$(m_1,...,m_c)$sdifferent labels and their one-hot representations s${\boldsymbol{{x}}}^{cat}=\big(({\boldsymbol{{x}}}_1^{cat})^{\prime},...,$s$({\boldsymbol{{x}}}_c^{cat})^{\prime}\big)^{\prime}$s. Let s$\bar{m}_0=0$sand s$\bar{m}_j=\sum_{u=1}^j m_u$sfor s$j=1,...,c$s. This time we build a neural network (2.1)–(2.2) with s$M=1, q_0=\bar{m}_c, q_1=l, q_2=\bar{m}_c$s, where l is the dimension of the required joint representation of all categorical features. We still set s$b^1_r=0$sfor s$r=1,...,l$s, train bias terms in the output layer and apply the linear activation function in s$\chi^1$s. We reconstruct the input using the prediction: s(3.4)s\begin{eqnarray}{\boldsymbol{{x}}}^{cat}\in{\mathbb{R}}^{\bar{m}_c}\mapsto\hat{{\boldsymbol{{x}}}}^{cat}=\pi({\boldsymbol{{x}}}^{cat})&=&\big(\pi_1({\boldsymbol{{x}}}^{cat}),...,\pi_{\bar{m}_1}({\boldsymbol{{x}}}^{cat}), ...,\nonumber\\[5pt] &&\pi_{\bar{m}_{j-1}+1}({\boldsymbol{{x}}}^{cat}),...,\pi_{\bar{m}_j}({\boldsymbol{{x}}}^{cat}),...,\nonumber\\[5pt] &&\pi_{\bar{m}_{c-1}+1}({\boldsymbol{{x}}}^{cat}),...,\pi_{\bar{m}_c}({\boldsymbol{{x}}}^{cat})\big)^{\prime}\in{\mathbb{R}}^{\bar{m}_c},\end{eqnarray}swhere s\begin{eqnarray*}\pi_r({\boldsymbol{{x}}}^{cat})=\frac{e^{\Theta^2_r({\boldsymbol{{x}}}^{cat})}}{\sum_{u=\bar{m}_{j-1}+1}^{\bar{m}_j}e^{\Theta^2_u({\boldsymbol{{x}}}^{cat})}},\quad r=\bar{m}_{j-1}+1,...,\bar{m}_j,\quad j=1,...,c,\end{eqnarray*}sand s$\pi_r({\boldsymbol{{x}}}^{cat})$s, for s$r=\bar{m}_{j-1}+1,...,\bar{m}_j$s, return probabilities that the categorical feature s$x_j$stakes a particular label among its s$m_j$slabels. The prediction of the label for s$x_j$sis the label with the highest predicted probability among s$\pi_r({\boldsymbol{{x}}}^{cat})$s. Clearly, we build a non-linear autoencoder. We remark that the soft-max activations functions are now applied to groups of neurons in the output layer from the network (2.3) which correspond to the labels of the categorical features. Hence, the decoder here returns probabilities in classification problems for all categorical features. This time all neurons in the layers of the autoencoder (before the soft-max transformations are applied) share the parameters of one neural network. By applying the Separate AEs, we independently solve multiple classification problems for our categorical features with separate autoencoders, whereas by applying the Joint AE, we jointly solve multiple classification problems for our categorical features with one autoencoder. Such an approach is called multi-task learning in machine learning, see for example Caruana (1997) and Ruder (2017). We use the cross entropy loss function L to measure the reconstruction error between prediction s$\hat{{\boldsymbol{{x}}}}=\pi({\boldsymbol{{x}}})$sand input s${\boldsymbol{{x}}}$s: s(3.5)s\begin{eqnarray}L(\pi({\boldsymbol{{x}}}_i^{cat}),{\boldsymbol{{x}}}_i^{cat})=-\sum_{j=1}^c\sum_{r=1}^{m_j}x^{cat}_{j_{r},i}\log\!\big(\pi_r({\boldsymbol{{x}}}^{cat}_i)\big),\quad i=1,...,n.\end{eqnarray}sAn example of the architecture of type Joint AE is presented in Figure 3.sFigure 2.sArchitecture of the autoencoder of type Separate AEs for categorical features.sFigure 3.sArchitecture of the autoencoder of type Joint AE for categorical features.sIn order to build denosing autoencoders, we apply the following corruption processes for categorical features:sFor each observation, s$i=1,...,n$s, and a fraction v of categorical features in vector s${\boldsymbol{{x}}}_i$schosen at random, the original input is corrupted with the transformations:sSampling a new label (sample): The original input is corrupted with the transformation s$x_{j,i}\mapsto\tilde{x}_{j,i}\sim \hat{F}_{x_j}$sand one-hot encoded with s$\tilde{x}_{j,i}\mapsto \tilde{{\boldsymbol{{x}}}}^{cat}_{j,i}$s, where s$\hat{F}_{x_j}$sis the empirical distribution of the feature s$x_j$sin the data set. This corruption process can be seen as an extension of the salt-and-pepper noise for binary data to categorical data, see for example Vincent et al. (2008, 2010) for the salt-and-pepper noise for binary data.sMasking to zero (zero): The original input and its one-hot encoding are corrupted with the transformation s${\boldsymbol{{x}}}^{cat}_{j,i}\mapsto\tilde{{\boldsymbol{{x}}}}^{cat}_{j,i}=\mathbf{0}'$s, where s$\mathbf{0}$sis a vector of zeros. This corruption process is an analogue to the technique of masking applied to numerical features from Section 3.1.sWe conclude this section with some remarks on the types of architecture of our autoencoders for categorical data:s(a)sThe approach with the Separate AEs has at least two disadvantages compared to the Joint AE. First, we have to train a number of autoencoders equal to the number of categorical features, which may be time-consuming. Secondly, and more importantly, we neglect possible dependencies between different categorical features when creating representations with separate and independent autoencoders. The second disadvantage is explored in Experiment 1 below. We consider the approach with the Separate AEs as a benchmark since it gives us a representation of categorical data which matches the representation of categorical data learned with entity embeddings in supervised learning tasks.s(b)sIf the categorical features are binary features, then our approach with the Joint AE is aligned with the approach for binary data used by Vincent et al. (2008, 2010) in their experiments. For binary data, autoencoders which coincide with our Joint AE are also used in Generative Adversarial Imputation Nets, see for example Yoon et al. (2018).s(c)sHespe (2020) recommends a multi-task learning autoencoder for categorical data which agrees with our Joint AE. He also describes single-task learning autoencoders learned with loss functions different from the cross-entropy.s3.3.sExperiment 1: the reconstruction ability of autoencoderssWe compare the following four autoencoders for categorical data:sSeparate AEs,sJoint AE,sMCA — we build a linear autoencoder with the classical MCA algorithm, that is instead of training a neural network, we apply Generalized Singular Value Decomposition (GSVD) to a matrix with centered one-hot encoded categorical features, see Pagès (2015) and Chavent et al. (2017),sMCA as non-linear PCA — we build a non-linear autoencoder for numerical data, the one described in Section 3.1, on linearly transformed one-hot encoded categorial features. From Pagès (2015) and Chavent et al. (2017), MCA is PCA on centered one-hot encoded categorical data transformed with linear mappings (GSVD). Instead of building a linear autoencoder, which is equivalent to the PCA algorithm, on linearly transformed centered one-hot encoded categorial features, we build a non-linear autoencoder with the hyperbolic tangent activation function in the single hidden layer by minimizing the mean square reconstruction error.sFrom the data set freMTPL2freq with 678,013 observations, we sample 100,000 observations. We work with a smaller data set to speed up the calculations. We limit our attention to categorical features and we consider the six categorical features from Table 1. Our data set with 100,000 observations is next split randomly into five data sets with 20,000 observations. We build our autoencoders on each of these five sets and report the average metric for these five sets evaluated at the training set. As the metric, we use the cosine similarity measure, but the findings also hold for example for the number of correct predictions. In this experiment, we only build under-complete autoencoders without noise, as this is sufficient to derive the key conclusions. We train our autoencoders with 15, 100 and 500 epochs. We do not differentiate between a training, a validation and a test set (we do not discuss possible over-fitting) since we are only interested in evaluating the reconstruction errors of the autoencoders. Details are presented in Section 1 in the Online Supplement.sThe dimension of the data matrix with the one-hot encoded categorical features is 54. We consider a range of dimensions of the representation of the categorical features: s$q_1=l=6,8,10,12,15,20,30$s. For the Separate AEs, we have to specify the number of neurons s$l_j$s(the dimension of representation) for each feature j. We assume that the number of neurons l, which defines the global dimension of the representation for all categorical features, is split across the individual categorical features evenly, if possible, and if not possible, a larger number of neurons is allocated to a feature with a larger number of labels. For example if we choose s$l=6$s, then we build representations of dimension 1 for each feature, if we choose s$l=12$s, then we build representations of dimension 2 for each feature, but if we choose s$l=8$s, then we build representations of dimension 2 for Region and VehBrand (these two features have the two largest number of labels in the data set) and representations of dimension 1 for the remaining features.sWe present the results in Figure 4. It is obvious that the cosine similarity increases with the number of neurons and the number of epochs. For the large number of epochs (500), for which we achieve the smallest reconstruction errors for all our autoencoders in terms of the loss functions minimized in the training process, the autoencoders Separate AEs and Joint AE are very similar in terms of their reconstruction power measured with the cosine similarity and they are much better than the remaining two autoencoders. The first conclusion confirms that categorical data have different intrinsic properties than numerical data, which are explored when a low-dimensional representation is built with an autoencoder, and categorical data should not be compressed with algorithms derived for numerical data (MCA is just PCA on linearly transformed data). The second conclusion is that for the low and the medium number of epochs (15, 100), especially for the low number of epochs, the performance of Joint AE is superior in terms of its ability to reconstruct the input from a low dimensional representation. In particular, our experiment shows that there are dependencies between the categorical features in the data set which are efficiently captured by the Joint AE at initial epochs (15 epochs) of the learning process of the autoencoder, and which cannot be captured by learning independent Separate AEs. Intuitively, dependencies between categorical features should allow the Joint AE to learn more robust and informative representations of categorical features and the Joint AE should lead to better reconstruction errors compared to the Separate AEs. For the low number of epochs (15) and a low dimension of the representation (6, 8, 10), the Joint AE is very similar to the MCA, but the performance of the Joint AE improves quickly when we increase the number of epochs. Clearly, we can benefit from non-linear autoencoders when the purpose is to derive informative representations of categorical features.sFigure 4.sCosine similarity measures for autoencoders for categorical data.sAs discussed in Section 3, if we can find an autoencoder for which the reconstruction error is small, then we can say that the encoder extracts the most important information from a multi-dimensional vector of features. Our example points out that representations of categorical features built with the Separate AEs may not be optimal in terms of their robustness and informativeness, especially if we do not want to spend much time on training autoencoders with a large number of epochs. It is known that the predictive power of neural networks and their generalization properties in supervised learning tasks depend on providing a good representation of the available information for its efficient pre-processing in hidden layers before the final prediction of the response is constructed with the output. Since the Joint AE performs better than the Separate AEs in terms of providing a more robust and informative representation of categorical features, we may prefer to use the numerical representation of categorical features implied by the Joint AE, rather than the Separate AEs, as the input to neural networks built for supervised learning tasks. However, in all practical examples in actuarial data science to date, the numerical representation of categorical features which is fed into hidden layers of a neural network matches the representation from the Separate AEs. We have to use a different type of architecture of a neural network to use the representation from the Joint AE. This experiment may serve as a motivating example for what we present in the sequel.s4.sTraining neural networks with mixed categorical and numerical featuressWe now move to the main topic of this paper. Below, we discuss different approaches to training neural networks with mixed categorical and numerical features in supervised learning tasks. These approaches differ in the architecture of the neural network and initialization of the parameters of the neural network.s4.1.sArchitecture A1 with separate entity embeddingssLet us start by recalling the concept of an entity embedding developed by Guo and Berkhahn (2016). An entity embedding for categorical feature s$x_j$sis a neural network which maps the categorical feature s$x_j$s, with its one-hot representation s${\boldsymbol{{x}}}_j^{cat}$s, into a vector of dimension s$l_j$s: s\begin{eqnarray*}{\boldsymbol{{x}}}^{cat}_{j}\in{\mathbb{R}}^{m_j} \mapsto {\boldsymbol{{x}}}^{ee}_{j}=(x^{ee}_{j_1},...,x^{ee}_{j_{l_j}})^{\prime}\in{\mathbb{R}}^{l_j},\end{eqnarray*}swhere s\begin{eqnarray*}x^{ee}_{j_r}=\langle{\boldsymbol{{w}}}^{ee}_r,{\boldsymbol{{x}}}^{cat}_j\rangle,\quad r=1,...,l_j.\end{eqnarray*}sWith an entity embedding, each label, from the set of s$m_j$spossible labels s$\{a_1,...,a_{m_j}\}$sof the categorical feature s$x_j$s, can be represented with a vector in the space s${\mathbb{R}}^{l_j}$s. The parameter s$l_j$sis the dimension of the embedding for the categorical feature s$x_j$s.sIn Figure 5, we provide an example of the architecture of a neural network with mixed categorical and numerical features used in supervised learning tasks in actuarial data science. This architecture uses entity embeddings for categorial features and has been promoted by Richman (2021), Noll et al. (2019) and Ferrario et al. (2020). We present a simple example with two categorical features s${\boldsymbol{{x}}}^{cat}_1, {\boldsymbol{{x}}}^{cat}_2$s, with 3 and 2 levels, and two numerical features s$x_3$sand s$x_4$s. For s${\boldsymbol{{x}}}^{cat}_1$s, we implement the entity embedding of dimension 2, and for s${\boldsymbol{{x}}}^{cat}_2$s— the entity embedding of dimension 1. More generally, within (2.1)–(2.2), we define a neural network with Architecture 1 (A1):sFor each categorical feature s$x_j$s, s$j=1,..,c$s, we build an entity embedding — a sub-network without hidden layers, that is s$M=0$s, where the input s${\boldsymbol{{z}}}={\boldsymbol{{x}}}^{cat}_j$s, s$q_0=m_j$sand the output s$q_1=l_j$s,sOnce all one-hot encoded categorical features are transformed with linear mappings of the entity embeddings, the outputs from the entity embeddings, that is the numerical representations of the categorical features, are concatenated with the numerical features to yield a new numerical vector of all features. This new vector is fed as the input into another sub-network with M hidden layers,sWe build a sub-network with M hidden layers with neurons s$q_1,...,q_M$sand the hyperbolic tangent activation functions s$\chi^1,...,\chi^M$sin the hidden layers, where the input s${\boldsymbol{{z}}}=\big(({\boldsymbol{{x}}}^{ee}_1)^{\prime},...,({\boldsymbol{{x}}}^{ee}_c)^{\prime},x_{c+1},...,x_d\big)^{\prime}$sand s$q_0=\sum_{j=1}^cl_j+d-c$s,sAll weights of the network (including the weights of the entity embeddings) are initialized with values sampled from uniform distributions with the Xavier initialization, see Glorot and Bengio (2010), and the bias terms are initialized with zero.sFigure 5.sArchitecture of type A1 with separate entity embeddings.sThe goal of this paper is to challenge A1 with a new architecture and a new training process of a neural network. The results from Experiment 1 provide us with arguments regarding how we could change A1. We can now clearly observe that the numerical representations of the categorical features learned with the entity embeddings in A1 matches, in their architectures, the numerical representations learned with the Separate AEs. From Section 3.3, we conclude that we could replace the numerical representations of the categorical features in A1 with the representation learned with the Joint AE. This leads us to introduce Architecture 2.s4.2.sArchitecture A2 with joint embeddingsInstead of applying separate entity embeddings to each categorical feature, we now use a joint embedding for all categorical features. A joint embedding is understood here as a neural network with the following mapping: s\begin{eqnarray*}{\boldsymbol{{x}}}^{cat}=(({\boldsymbol{{x}}}^{cat}_1)^{\prime},...,({\boldsymbol{{x}}}^{cat}_c)^{\prime})^{\prime}\in{\mathbb{R}}^{\bar{m}_c} \mapsto {\boldsymbol{{x}}}^{\tilde{ee}}=(x^{\tilde{ee}}_{1},...,x^{\tilde{ee}}_{l})^{\prime}\in{\mathbb{R}}^{l},\end{eqnarray*}swhere s\begin{eqnarray*}x^{\tilde{ee}}_{r}=\langle{\boldsymbol{{w}}}^{\tilde{ee}}_r,{\boldsymbol{{x}}}^{cat}\rangle,\quad r=1,...,l.\end{eqnarray*}sParameter l is the dimension of the embedding for all categorical features s$(x_1,...,x_c)$s. We expect that s$l<l_1+...+l_c$s.sOur new architecture of a neural network with mixed categorical and numerical features where the categorical features are modelled with a joint embedding is presented in Figure 6. For s${\boldsymbol{{x}}}^{cat}_1,{\boldsymbol{{x}}}^{cat}_2$s, we implement a joint embedding of dimension 3—this is the only, but also a significant difference between the architectures in Figures 5 and 6. Within (2.1) and (2.2), we define a neural network with Architecture 2 (A2):sFor all categorical features s$(x_1,...,x_c)$s, we build a joint embedding — a neural sub-network without hidden layers, that is s$M=0$s, where the input s${\boldsymbol{{z}}}={\boldsymbol{{x}}}^{cat}$s, s$q_0=\bar{m}_c$sand the output s$q_1=l$s,sThe next steps of building the network for predicting the response are the same as for A1.sWe initialize all weights with the Xavier initialization and set the bias terms equal to zero, as for A1.sFigure 6.sArchitecture of type A2 with joint embedding.sWe can observe that the numerical representation of the categorical features learned with the joint embedding in A2 matches, in its architecture, the representation learned with the Joint AE. We have already discussed the advantages of this representation in unsupervised learning tasks, which should also hold in supervised learning tasks. In addition, we can expect that by learning a joint embedding for all categorical features, we allow all categorical features, not only labels for a single categorical feature, to share the information about their impact on the response. As a result, we should be able to improve predictions of the response based on the experience collected from similar categorical features and their similar labels. Hence, the switch from A1 to A2 has intuitive foundations. To the best of our knowledge, A2 has not been considered to date in any actuarial data science problem.s4.3.sInitialization of A1 and A2sThe issue of initialization of parameters of neural networks has been noticed in actuarial data science. Under A1, Merz and Wüthrich (2019) and Schelldorfer and Wüthrich (2019) propose the Combined Actuarial Neural Network (CANN) approach to initialize a neural network with predictions from a GLM — we call this architecture and the training process A1_CANN. The idea is to add a skip connection to the output from the network with architecture A1. In mathematical terms, in A1 we use the prediction: s(4.1)s\begin{eqnarray}\lambda_i=e^{\log(Exp_i)+\Theta_1^{M+1}\big((({\boldsymbol{{x}}}^{ee}_{1,i})^{\prime},...,({\boldsymbol{{x}}}^{ee}_{c,i})^{\prime},x_{c+1,i},...,x_{d,i})^{\prime}\big)},\quad i=1,...,n,\end{eqnarray}swhereas in A1_CANN we use the prediction: s(4.2)s\begin{eqnarray}\lambda_i=e^{\log(Exp_i)+\eta^{GLM}_i+\Theta_1^{M+1}\big((({\boldsymbol{{x}}}^{ee}_{1,i})^{\prime},...,({\boldsymbol{{x}}}^{ee}_{c,i})^{\prime},x_{c+1,i},...,x_{d,i})^{\prime}\big)},\quad i=1,...,n,\end{eqnarray}swhere s$\eta^{GLM}_i$sdenotes the prediction, on the linear scale, of the unit intensity (for exposure equal to one) from a Poisson GLM with a log link for observation i.sIn A1 and A2, we initialize the weights of the embeddings for the categorical features with the Xavier initialization. However, since autoencoders extract important information about features, we could initialize the weights of the embeddings with the weights from the encoder of the appropriate autoencoder and define the weights of the embeddings as non-trainable in the training process. This is reasonable, but may be sub-optimal for a supervised learning task since the representation of the categorical features learned with an autoencoder without the information about the response would be kept fixed. To improve the representation from an autoencoder, we should fine-tune it in a supervised learning task with a target response. In the machine learning literature, Erhan et al. (2010, 2009) propose to pre-train layers of neural networks with denoising autoencoders, that is initialize neurons in layers of a neural network for a supervised learning task with representations of the neurons from denoising autoencoders built in unsupervised learning tasks for the input to the layers. We recover and modify their approach in this paper.sApart from changing the architecture from A1 to A2, we initialize the weights and the bias terms in the joint embedding for the categorical feature and the first hidden layer in A2 with the weights and the bias terms from the representations of the neurons in the layers learned with autoencoders. From Erhan et al. (2010, 2009), we know that the initialization procedure with autoencoders gives the largest gains in predictive power of a neural network when it is applied to initial layers of the network. We proceed in the following way:sWe build an autoencoder of type Joint AE (denoted as the 1st AE) for the categorical input s$(x_1,...,x_c)$susing its one-hot representation s${\boldsymbol{{x}}}^{cat}=\big(({\boldsymbol{{x}}}^{cat}_1)^{\prime},...,({\boldsymbol{{x}}}^{cat}_c)^{\prime}\big)^{\prime}$s. To build a denoising autoencoder, we corrupt the categorical input with the sample or the zero transformation, see Section 3.2,sWe take the weights from the encoder of the 1st AE, denoted by s${\boldsymbol{{w}}}^{enc}_r=(w^{enc}_{r,1},...,w^{enc}_{r,\bar{m}_c}), r=1,...,l$s, and initialize the weights s${\boldsymbol{{w}}}^{\tilde{ee}}_r, r=1,...,l,$sof the joint embedding in A2 with these weights,sWe take the representation of the categorical features predicted by the 1st AE: s${\boldsymbol{{x}}}^{enc}=(x^{enc}_1,...,x^{enc}_l)^{\prime}$swith s$x^{enc}_{r}=\langle{\boldsymbol{{w}}}^{enc}_r,{\boldsymbol{{x}}}^{cat}\rangle, r=1,...,l$s, concatenate this vector with the vector of the numerical features s$(x_{c+1},...,x_d)^{\prime}$sand create a new vector of numerical features s${\boldsymbol{{z}}}=(({\boldsymbol{{x}}}^{enc})^{\prime},x_{c+1},...,x_d)^{\prime}$s,sWe build an autoencoder from Section 3.1 (denoted as the 2nd AE) for the numerical input s${\boldsymbol{{z}}}$s. The dimension of the representation to be learned for the s$(l+d-c)$s-dimensional vector s${\boldsymbol{{z}}}$sis equal to s$q_1$s, where s$q_1$sdenotes the number of neurons used in the first hidden layer in the sub-network with M hidden layers, which is built for the input constructed by concatenating the representation from the joint embedding with the numerical features. To build a denoising autoencoder, we corrupt the numerical input with the gaussian or the zero transformation, see Section 3.1,sWe take the weights and the bias terms from the encoder of the 2nd AE and initialize the weights and the bias terms s$b^1_r, {\boldsymbol{{w}}}^1_r, r=1,...,q_1,$sin the first hidden layer in the sub-network with M hidden layers with these weights and the bias terms.sAll other weights are initialized with the Xavier initialization and the bias terms are initialized with zero.sThe initialization procedure applied here also clarifies why we were only interested in building autoencoders with one hidden layer in Section 3. For A2, and any initialization of layers, we use the predictions: s(4.3)s\begin{eqnarray}\lambda_i=e^{\log(Exp_i)+\Theta_1^{M+1}\big((({\boldsymbol{{x}}}_i^{\tilde{ee}})^{\prime},x_{c+1,i},...,x_{d,i})^{\prime}\big)},\quad i=1,...,n.\end{eqnarray}sThe autoencoders, which are trained without the information about the response, are only used to derive initial values of the neurons in the two layers of A2. These initial values are next fine-tuned by training the whole neural network to predict the target response. When training an autoencoder, we are only interested in extracting the most important discriminatory factors in the multi-dimensional input vector, which are next improved and optimally transformed by taking into account the target response. Since in this application autoencoders are trained for a low number of epochs, in Experiment 1, we should only look at the results for epochs 15 and 100, which show clear advantages of the representation of categorical features learned with the Joint AE compared to the Separate AEs.sThe third step above where we concatenate the numerical representation of the categorical features from the 1st AE with the other numerical features deserves attention. We propose a modification of the pre-training strategy of layers with autoencoders which has not been considered by Erhan et al. (2010, 2009). It is known that the features, fed into a neural network, should live on the same scale in order to perform effective training of the network. We can easily control the numerical features and scale them to s$[-1,1]$s, which is done before the training process is started. However, we cannot expect the numerical representation of the categorical features learned with the 1st AE, that is the values given by s$x_r^{enc}=\langle{\boldsymbol{{w}}}^{enc}_r,{\boldsymbol{{x}}}^{cat}\rangle, r=1,...,l$s, to yield predictions in s$[-1,1]$s. If the predictions from the encoder of the 1st AE live on a scale different from s$[-1,1]$s, which is the scale where the numerical features live, then the input to the 2nd AE and the input to the first hidden layer of the sub-network with M hidden layers will have features on different scales, and the training process of the neural networks may suffer from this inconsistency in scales. Fortunately, we can modify the weights and the bias terms of the encoder and the decoder of the 1st AE to keep the reconstruction error unchanged and have the representation of the categorical features in the desired scale. This is possible due to the linear activations functions assumed and the bias terms chosen to be trained in the output layer in the autoencoder of type Joint AE, before the soft-max functions are applied. In the encoder part of the Joint AE, we re-define the weights: s\begin{eqnarray*}w^{enc}_{r,k}\mapsto w^{enc,*}_{r,k}=\frac{2}{\max_i\{x^{enc}_{r,i}\}-\min_i\{x^{enc}_{r,i}\}}w^{enc}_{r,k}-\frac{1}{c}\left(\frac{2\min_i\{x^{enc}_{r,i}\}}{\max_i\{x^{enc}_{r,i}\}-\min_i\{x^{enc}_{r,i}\}}+1\right),\end{eqnarray*}sfor s$r=1,...,l$sand s$k=1,...,\bar{m}_c$s. We can deduce that for these new representations we have s$\langle{\boldsymbol{{w}}}^{enc,*}_r,{\boldsymbol{{x}}}_i^{cat}\rangle\in[-1,1]$sfor all s$r=1,...,l$sand all observations s$i=1,...,n$s. Since s${\boldsymbol{{x}}}_i^{cat}$sis always a vector with c elements equal to 1 and the remaining elements are equal to zero, the constant term s$-2\min_i\{x^{enc}_{r,i}\}/\big(\max_i\{x^{enc}_{r,i}\}-\min_i\{x^{enc}_{r,i}\}\big)-1$sfrom the min–max scaler transformation of the original predictions from the encoder s$\langle{\boldsymbol{{w}}}^{enc}_r,{\boldsymbol{{x}}}_i^{cat}\rangle$s, for each neuron r, can be absorbed by the new weights of the encoder by dividing the constant by c. Let s$(b^{dec}_r, {\boldsymbol{{w}}}^{dec}_r)_{r=1}^{\bar{m}_c}$sdenote the weights and the bias terms from the decoder. In the decoder part of the Joint AE, we now re-define: s\begin{eqnarray*}w^{dec}_{r,k}&\mapsto& w^{dec,*}_{r,k}=\frac{\max_i\{x^{enc}_{k,i}\}-\min_i\{x^{enc}_{k,i}\}}{2}w^{dec}_{r,k},\\[5pt] b^{dec}_{r}&\mapsto& b^{dec,*}_{r}=b^{dec}_r+\sum_{k=1}^l\Big(w^{dec,*}_{r,k}+\min_i\{x^{enc}_{k,i}\}w_{r,k}^{dec}\Big),\end{eqnarray*}sfor s$r=1,...,\bar{m}_c$sand s$k=1,...,l$s. We can conclude that the predictions in the output layer from the autoencoder with the modified weights and bias terms remain exactly the same as in the original autoencoder, hence the reconstruction error remains unchanged. Since the bias terms are needed in the decoder to adjust the representation, in Section 3.2, we decided to train the bias terms in the decoder of the autoencoder for categorical data.sLet us conclude with remarks on our architectures A1–A2:s(a)sWe could initialize the representations of the categorical features in A1 with the weights from the encoders from the Separate AEs. Based on the results from Experiments 1, we expect that this type of initialization of A1 would not be an efficient solution for improving predictive power of neural networks, and we decided not to proceed with this approach in this paper. Moreover, training multiple autoencoders in unsupervised learning tasks for initialization of a neural network for a supervised learning task would be time-consuming and would be unlikely to gain popularity in practical applications.s(b)sOther architectures of neural networks are also possible. For example, we could consider Architecture 3. First, the one-hot encoded categorical features are centered and linearly transformed with non-trainable mappings defined as in the MCA algorithm before the PCA algorithm is applied. Then, they could be treated as numerical data together with the other numerical features. Such an approach is proposed in the Factor Analysis of Mixed Data, see Pagès (2015) and Chavent et al. (2017). In other words, we could define neurons in the first hidden layer of a neural network as linear transformations of linearly transformed one-hot encoded categorical features and numerical features. In Figure 6, we would remove the intermediate layer with grey neurons. We would only need the autoencoder for numerical data to pre-train the first hidden layer of the network and the autoencoder for the categorical data would not be needed at all. Based on the results from Experiment 1, we reject such architecture because we believe that categorical data should be treated differently from numerical data. This view is also supported with experiments presented by Brouwer (2004) and Yuan et al. (2020).s4.4.sExperiment 2 — the predictive power of A1 and A2sWe study architectures and training processes of neural networks denoted by A1, A1_CANN, A2, A2_MCA, A2_1AE, A2_2AEs, where A1, A1_CANN, A2 are defined above and we introduce:sA2_MCA — we only pre-train the joint embedding of A2 with a linear autoencoder, that is we initialize the weights of the joint embedding for the categorical features with the weights from the encoder from a linear autoencoder built with the MCA algorithm. In our experiment, and also in general, we cannot apply a linear autoencoder built with the PCA algorithm as the 2nd AE since PCA only allows us to build under-complete autoencoders, whereas the number of neurons in the first hidden layer of a sub-network with M hidden layers is usually much larger than the dimension of the input to the layer — this remark can serve as an additional argument for using over-complete autoencoders for pre-training layers of neural networks rather than classical under-complete autoencoders,sA2_1AE — we only pre-train the joint embedding of A2 with a non-linear autoencoder, that is we initialize the weights of the joint embedding for the categorical features with the weights from the encoder from an autoencoder of type Joint AE. We use only one non-linear autoencoder since we want to directly validate MCA with a non-linear autoencoder,sA2_2AEs — our main approach, in which we pre-train the joint embedding and the first hidden layer of A2 with two non-linear autoencoders, that is we initialize the weights of the joint embedding for the categorical features with parameters from the encoder from an autoencoder for categorical data of type Joint AE and we initialize the weights and bias terms of the first hidden layer in the sub-network with M hidden layers with parameters from the encoder from an autoencoder for numerical data from Section 3.1.sA1_CANN is initialized with GLM1 from Schelldorfer and Wüthrich (2019), which is a Poisson GLM with log link function where the features in Table 1 are used as regressors and the categorical features are coded with dummy variables.sThe dimension of the categorical input, which consists of the one-hot encoded categorical features, is equal to 54. We set the dimension of the representation of the categorical features to 8. For A1, we build separate representations of dimension 2 for Region and VehBrand and separate representations of dimension 1 for all other features—Area, VehPower, VehAge and DrivAge. This choice is compatible with Experiment 1 and the choice made by Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020). For A2, we build a joint representation of dimension 8 for all categorical features. The dimension of the input to the first hidden layer, which consists of the numerical features and the numerical representation of the categorical features, is equal to 11, since we concatenate the representation of the categorical features learned with the embeddings with the three numerical features — BonusMalus, Density and VehGas. The number of neurons in the single hidden layer in the 1st AE is 8, as this number must coincide with the dimension of the representation of the categorical features for our supervised learning task. The number of neurons in the single hidden layer in the 2nd AE is equal to the number of neurons in the first hidden layer of the sub-network with M hidden layers. We consider sub-networks with s$M=3$shidden layers in our experiments below. We consider three possible choices for the numbers of neurons in the hidden layers in A2, similar to Noll et al. (2019), Schelldorfer and Wüthrich (2019) and Ferrario et al. (2020), and we define the numbers of neurons for A1 so that the number of trainable parameters in A1 and A2 are equal, see Table 2.1 in Section 2 in the Online Supplement for the numbers of neurons.sExperiment 2 is conducted on the same 100,000 observations as Experiment 1. Since the predictive power of neural networks depends on their hyperparameters, in the first step of this experiment, we perform hyperparameter optimization. With hyperparameter optimization, we also control over-fitting of the networks. We try to identify the best hyperparameters for the two autoencoders (the 1st AE and the 2nd AE) trained in an unsupervised process and the best hyperparameters for the neural networks with Architectures 1 and 2 trained in a supervised process. The hyperparameters optimized in the experiment, together with their best values, are presented in Section 2 in the Online Supplement, where the hyperparameter optimization process is described in detail. As a part of the hyperparameter optimization, we choose between denoising autoencoders and autoencoders without noise. We point out that we prefer a denoising autoencoder for pre-training the joint embedding of A2, see Table 2.3 in Section 2 in the Online Supplement.sIn the second step of this experiment, we study in more detail the predictive power of the best neural networks identified in the first step of the experiment for each architecture and training process. The set of 100,000 observations is split into a training, a validation and a test set to the proportions 3:1:1. We perform 100 calibrations for each best approach for A1, A1_CANN, A2, A2_MCA, A2_1AE, A2_2AEs. In each calibration, we train the autoencoders in unsupervised learning tasks, if required, and the neural network for the supervised learning task. The training process is the same as in the hyperparameter optimization. We train the networks on the training set by minimizing the Poisson loss, early stop the algorithm on the validation set and evaluate the predictive power of the trained networks by calculating the Poisson loss on the test set.sThe box plots of the Poisson loss values on the test set in 100 calibrations are presented in Figure 7, and their key characteristics in Table 2. By initializing A1 with GLM1, we gain on average a small predictive power of 0.0052 and we increase the standard deviation of the loss from 0.0384 to 0.0708. In general, the predictive power of A1_CANN depends on the GLM used for initialization of A1 and here we use one of the simplest GLMs investigated by Noll et al. (2019). As discussed by Schelldorfer and Wüthrich (2019), A1_CANN could only benefit from a very good initial GLM. If we switch from A1 to A2, then the predictive power of the network increases slightly on average by 0.0095, but at the same time, A2 has standard deviation of the loss twice as high as A1. A1, A1_CANN and A2 are all close in terms of their predictive power and we do not find strong evidence that A1_CANN and A2 are better than A1. In fact, we observe that the Poisson loss values achieved in calibrations are dispersed more under A1_CANN and A2 than A1. If we improve the training process of A2 by initializing its parameters with the parameters from the autoencoders, then the performance of A2 improves in terms of the predictive power, the standard deviation and quantiles of the Poisson loss. By initializing the joint embedding of A2 with the linear autoencoder, we gain on average a small amount of predictive power of 0.0087. If we replace the linear autoencoder of type MCA with the non-linear denoising autoencoder of type Joint AE, then we can observe on average a significant gain in the predictive power of 0.0256 (A2_1AE vs. A2). This shows that linear autoencoders are not sufficient for pre-training layers of neural networks for supervised learning tasks and we have to rely on non-linear denoising autoencoders to initialize neural networks (it is not possible to use PCA for pre-training the first hidden layer of the sub-network with three hidden layers and only the joint embedding can be pre-trained with MCA). If we pre-train A2 with the two autoencoders for the categorical and the numerical input, then we reduce the Poisson loss on average by 0.0620 (A2_2AEs vs. A2). When we move from A2_1AE to A2_2AEs, the improvement in the predictive power on average of 0.0364 is possible only if we re-scale the weights from the autoencoder for the categorical input before we train the autonencoder for the numerical input (see Section 4.3 for details on this step). Without this step, A2_2AEs would fail to provide superior results. The size of the improvement in the predictive power when we switch from A2_1AE to A2_2AEs also depends on the choice of the autoencoder for the categorical input. Hence, the choice of the 1st AE (recall that we choose a denoising autoencoder) is important even though the 2nd AE leads to a larger decrease in the average value of the Poisson loss. By pre-training A2 with our two autoencoders, we also decrease the standard deviation of the loss. Most importantly, we finally compare A2_2AEs with A1. We achieve an improvement in the Poisson loss on average of 0.0525 for A2_2AEs compared to A1. All reported quantiles are lower for A2_2AEs than for A1, and the distribution of the loss from A2_2AEs is shifted to the left compared to A1, but the standard deviation of the loss from A2_2AEs is slightly larger than the standard deviation of the loss from A1. We can conclude that our new architecture with a joint embedding for all categorical features and initialized with parameters from (denosing) autoencoders is better, in terms of its predictive power, than the classical architecture nowadays commonly used in actuarial data science with separate entity embeddings for categorical features and random initialization of parameters. The improvement of the predictive power from 30.3950 to 30.3425 can indeed be interpreted as significant for this data set—Schelldorfer and Wüthrich (2019) demonstrate for example that the Poisson loss can decrease from 31.5064 to 31.4532 by optimizing the dimensions of the entity embedding, or the Poisson loss can decrease from 32.1490 to 32.10286 by boosting a GLM with one regressor transformed with a neural network (for the BonusMalus which achieves the largest improvement).sFigure 7.sDistributions of the Poisson loss on the test set (for each network the dotted line represents the average loss in 100 calibrations).sTable 2.sDistributions of the poisson loss on the test set, their quantiles (q), average values (avg) and standard deviations (SD).sStatisticssA1sA1_CANNsA2sA2_MCAsA2_1AEsA2_2AEssq0.05s30.3362s30.2875s30.3133s30.3142s30.2863s30.2767sq0.25s30.3684s30.3417s30.3481s30.3495s30.3322s30.3138savgs30.3950s30.3898s30.4045s30.3958s30.3789s30.3425sq0.75s30.4130s30.4261s30.4338s30.4386s30.4157s30.3719sq0.95s30.4688s30.5101s30.5444s30.5104s30.4833s30.4126ssds0.0384s0.0708s0.0758s0.0597s0.0624s0.0433sThe bias of the predictors and the performance of auto-calibrated predictors is investigated in Section 3 in the Online Supplement.s5.sConclusionsWe have presented a new approach to training neural networks with mixed categorical and numerical features for supervised learning tasks. We have illustrated that our new architecture of a network with a joint embedding for all categorical features and network parameters properly initialized with parameters from (denosing) autoencoders, learned in an unsupervised manner, performs better, in terms of the predictive power and the stability of the predictions, than the classical architecture, used nowadays in actuarial data science, with separate entity embeddings for categorical features and random initialization of parameters. We hope that the results described in this paper will draw attention in actuarial data science to a new possible architecture of a neural network for supervised learning tasks and benefits of autoencoders in deriving representations of features for supervised learning tasks. In fact, we are already aware of new (unpublished) experiments with autoencoders used for initialization of neural networks for actuarial pricing, see Holvoet et al. (2022).sDespite the results presented in the paper, there is one more advantage of our new architecture. As far as hyperparameter optimization is concerned for the classical architecture, we should optimize the loss function with respect to multiple hyperparameters which describe the dimensions of the entity embeddings for categorical features. In our new architecture, we only search for the optimal value of only one hyperparameter which specifies the dimension of the joint embedding for all categorical features. Consequently, our new architecture enables faster and more convenient optimization of the dimension of the representation of categorical features, see Section 4 in the Online Supplement. There is also a disadvantage of our approach. We lose simple graphical interpretation of the joint embedding of categorical features due to a higher dimension of the joint representation compared to one- or two-dimensional representations of entity embeddings. Hopefully, we should be able to use rich methods of explainable AI to interpret the impact of categorical features, modelled with a joint embedding, on the response.sFinally, we could modify our approach by learning autoencoders, used for pre-training layers of a network, jointly with a network with a target response, which uses the representations from the autoencoders as the input (at the additional cost of fine-tuning the weight between the unsupervised and the supervised loss). Such an approach is also postulated in the machine learning literature, see for example Ranzato and Szummer (2008), Lei et al. (2018). This last remark reinforces the conclusion stated above that autoencoders should be included in the toolbox of data science actuaries who build predictive models.s

Journal

Astin BulletinCambridge University Press

Published: May 1, 2023

Keywords: Autoencoders; representation learning; categorical and numerical features; embeddings; initialization of neural networks

There are no references for this article.