Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Generative model‐enhanced human motion prediction

Generative model‐enhanced human motion prediction INTRODUCTIONHuman motion is naturally intelligible as a time‐varying graph of connected joints constrained by locomotor anatomy and physiology. Its prediction allows the anticipation of actions with applications across healthcare,1,2 physical rehabilitation and training,3,4 robotics,5‐7 navigation,8‐11 manufacture,12 entertainment,13‐15 and security.16‐18The favoured approach to predicting movements over time has been purely inductive, relying on the history of a specific class of movement to predict its future. For example, state‐space models19 enjoyed early success for simple, common, or cyclic motions.20‐22 The range, diversity and complexity of human motion has encouraged a shift to more expressive, deep neural network architectures,23‐30 but still within a simple inductive framework.This approach would be adequate were actions both sharply distinct and highly stereotyped. But their complex, compositional nature means that within one category of action the kinematics may vary substantially, while between two categories they may barely differ. Moreover, few real‐world tasks restrict the plausible repertoire to a small number of classes—distinct or otherwise—that could be explicitly learnt. Rather, any action may be drawn from a great diversity of possibilities—both kinematic and teleological—that shape the characteristics of the underlying movements. This has two crucial implications. First, any modelling approach that lacks awareness of the full space of motion possibilities will be vulnerable to poor generalisation and brittle performance in the face of kinematic anomalies. Second, the very notion of in‐distribution (ID) testing becomes moot, for the relations between different actions and their kinematic signatures are plausibly determinable only across the entire domain of action. A test here arguably needs to be out‐of‐distribution (OoD) if it is to be considered a robust test at all.These considerations are amplified by the nature of real‐world applications of kinematic modelling, such as anticipating arbitrary deviations from expected motor behaviour early enough for an automatic intervention to mitigate them. Most urgent in the domain of autonomous driving,9,11 such safety concerns are of the highest importance, and are best addressed within the fundamental modelling framework. Indeed,31 cites the ability to recognise our own ignorance as a safety mechanism that must be a core component in safe AI. Nonetheless, to our knowledge, current predictive models of human kinematics neither quantify OoD performance nor are designed with it in mind. There is therefore a need for two frameworks, applicable across the domain of action modelling: one for hardening a predictive model to anomalous cases, and another for quantifying OoD performance with established benchmark datasets. General frameworks are here desirable in preference to new models, for the field is evolving so rapidly greater impact can be achieved by introducing mechanisms that can be applied to a breadth of candidate architectures, even if they are demonstrated in only a subset. Our approach here is founded on combining a latent variable generative model with a standard predictive model, illustrated with the current state‐of‐the‐art discriminative architecture,29,32 a strategy that has produced state‐of‐the‐art in the medical imaging domain.33 Our aim is to achieve robust performance within a realistic, low‐volume, high‐heterogeneity data regime by providing a general mechanism for enhancing a discriminative architecture with a generative model.In short, our contributions to the problem of achieving robustness to distributional shift in human motion prediction are as follows:1. We provide a framework to benchmark OoD performance on the most widely used open‐source motion capture datasets: Human3.6M,34 and Carnegie Mellon University (CMU)‐Mocap (http://mocap.cs.cmu.edu/) and evaluate state‐of‐the‐art models on it.2. We present a framework for hardening deep feed‐forward models to OoD samples. We show that the hardened models are fast to train, and exhibit substantially improved OoD performance with minimal impact on ID performance.We begin Section 2 with a brief review of human motion prediction with deep neural networks, and of OoD generalisation using generative models. In Section 3, we define a framework for benchmarking OoD performance using open‐source multi‐action datasets. We introduce in Section 4 the discriminative models that we harden using a generative branch to achieve a state‐of‐the‐art (SOTA) OoD benchmark. We then turn in Section 5 to the architecture of the generative model and the overall objective function. Section 6 presents our experiments and results. We conclude in Section 7 with a summary of our results, current limitations, and caveats, and future directions for developing robust and reliable OoD performance and a quantifiable awareness of unfamiliar behaviour.RELATED WORKDeep‐network‐based human motion predictionHistorically, sequence‐to‐sequence prediction using recurrent neural networks (RNNs) have been the de facto standard for human motion prediction.26,28,30,35‐39 Currently, the SOTA is dominated by feed‐forward models.24,27,29,32 These are inherently faster and easier to train than RNNs. The jury is still out, however, on the optimal way to handle temporality for human motion prediction. Meanwhile, recent trends have overwhelmingly shown that graph‐based approaches are an effective means to encode the spatial dependencies between joints,29,32 or sets of joints.28 In this study, we consider the SOTA models that have graph‐based approaches with a feed‐forward mechanism as presented by,29 and the subsequent extension which leverages motion attention,.32 Further attention‐based approaches may indicate an upcoming trend.40 We show that these may be augmented to improve robustness to OoD samples.Generative models for out‐of‐distribution prediction and detectionDespite the power of deep neural networks for prediction in complex domains,41 they face several challenges that limit their suitability for safety‐critical applications. Amodei et al31 list robustness to distributional shift as one of the five major challenges to AI safety. Deep generative models, have been used extensively for the detection of OoD inputs and have been shown to generalise well in such scenarios.42‐44 While recent work has shown some failures in simple OoD detection using density estimates from deep generative models,45,46 they remain a prime candidate for anomaly detection.45,47,48Myronenko33 use a variational autoencoder (VAE)49 to regularise an encoder‐decoder architecture with the specific aim of better generalisation. By simultaneously using the encoder as the recognition model of the VAE, the model is encouraged to base its segmentations on a complete picture of the data, rather than on a reductive representation that is more likely to be fitted to the training data. Furthermore, the original loss and the VAE's loss are combined as a weighted sum such that the discriminator's objective still dominates. Further work may also reveal useful interpretability of behaviour (via visualisation of the latent space as in Reference [50]), generation of novel motion,51 or reconstruction of missing joints as in Reference [52].QUANTIFYING OUT‐OF‐DISTRIBUTION PERFORMANCE OF HUMAN MOTION PREDICTORSEven a very compact representation of the human body such as OpenPose's 17 joint parameterisation53 explodes to unmanageable complexity when a temporal dimension is introduced of the scale and granularity necessary to distinguish between different kinds of action: typically many seconds, sampled at hundredths of a second. Moreover, though there are anatomical and physiological constraints on the space of licit joint configurations, and their trajectories, the repertoire of possibility remains vast and the kinematic demarcations of teleologically different actions remain indistinct. Thus, no practically obtainable dataset may realistically represent the possible distance between instances. To simulate OoD data, we first need ID data that can be varied in its quantity and heterogeneity, closely replicating cases where a particular kinematic morphology may be rare, and therefore undersampled, and cases where kinematic morphologies are both highly variable within a defined class and similar across classes. Such replication needs to accentuate the challenging aspects of each scenario.We therefore propose to evaluate OoD performance where only a single action, drawn from a single action distribution, is available for training and hyperparameter search, and testing is carried out on the remaining classes. To determine which actions can be clearly separated from the other actions we train a classifier of action category based on the motion inputs. We select the action “walking” from H3.6M, and “basketball” from CMU. Where the classifier can identify these actions with a precision and recall of 0.95 and 0.81, respectively for walking, in H3.6M, and 1.0, and 1.0 for basketball, in CMU. This is discussed further in Appendix A.BACKGROUNDHere, we describe the current SOTA model proposed by Mao et al29 (graph convolutional network [GCN]). We then describe the extension by Mao et al32 (attention‐GCN) which antecedes the GCN prediction model with motion attention.Problem formulationWe are given a motion sequence X1:N=x1x2x3…xN consisting of N consecutive human poses, where xi∈ℝK, with K the number of parameters describing each pose. The goal is to predict the poses XN+1:N+T for the subsequent T time steps.Discrete cosine transformations‐based temporal encodingThe input is transformed using discrete cosine transformations (DCT). In this way, each resulting coefficient encodes information of the entire sequence at a particular temporal frequency. Furthermore, the option to remove high or low frequencies is provided. Given a joint, k, the position of k over N time steps is given by the trajectory vector: xk=xk,1…xk,N where we convert to a DCT vector of the form: Ck=Ck,1…Ck,N where Ck,l represents the lth DCT coefficient. For δl1∈ℝN=1,0,…,0, these coefficients may be computed as1Ck,l=2N∑n=1Nxk,n11+δl1cosπ2N2n−1l−1.If no frequencies are cropped, the DCT is invertible via the inverse discrete cosine transform (IDCT):2xk,l=2N∑l=1NCk,l11+δl1cosπ2N2n−1l−1.Mao et al. use the DCT transform with a GCN architecture to predict the output sequence. This is achieved by having an equal length input‐output sequence, where the input is the DCT transformation of xk=xk,1…xk,Nxk,N+1…xk,N+T, here xk,1…xk,N is the observed sequence and xk,N+1…xk,N+T are replicas of xk,N (ie, xk,n=xk,N for n≥N). The target is now simply the ground truth xk.Graph convolutional networkSuppose C∈ℝK×N+T is defined on a graph with k nodes and N+T dimensions, then we define a GCN to respect this structure. First, we define a graph convolutional layer (GCL) that, as input, takes the activation of the previous layer (Al−1), where l is the current layer.3GCLAl−1=SAl−1W+bwhere A0=C∈ℝK×N+T and S∈ℝK×K is a layer‐specific learnable normalised graph laplacian that represents connections between joints, W∈ℝnl−1×nl are the learnable inter‐layer weightings and b∈ℝnl are the learnable biases where nl are the number of hidden units in layer l.Network structure and lossThe network consists of 12 graph convolutional blocks (GCBs), each containing two GCLs with skip (or residual) connections, see Figures A6 and A7. In addition, there is one GCL at the beginning of the network, and one at the end. nl=256, for each layer, l. There is one final skip connection from the DCT inputs to the DCT outputs, which greatly reduces train time. The model has around 2.6M parameters. Hyperbolic tangent functions are used as the activation function. Batch normalisation is applied before each activation.The outputs are converted back to their original coordinate system using the IDCT (Equation (2)) to be compared to the ground truth. The loss used for joint angles is the average l1 distance between the ground‐truth joint angles, and the predicted ones. Thus, the joint angle loss is:4ℓa=1KN+T∑n=1N+T∑k=1Kx̂k,n−xk,nwhere x̂k,n is the predicted kth joint at timestep n and xk,n is the corresponding ground truth.This is separately trained on three‐dimensional (3D) joint coordinate prediction making use of the mean per joint position error (MPJPE), as proposed in Reference [34] and used in References [29,32]. This is defined, for each training example, as5ℓm=1JN+T∑n=1N+T∑j=1Jp̂j,n−pj,n2where p̂j,n∈ℝ3 denotes the predicted jth joint position in frame n. And pj,n is the corresponding ground truth, while J is the number of joints in the skeleton.Motion attention extensionMao et al.32 extend this model by summing multiple DCT transformations from different sections of the motion history with weightings learned via an attention mechanism. For this extension, the above model (the GCN) along with the anteceding motion attention is trained end‐to‐end. We refer to this as the attention‐GCN.OUR APPROACHMyronenko33 augment an encoder‐decoder discriminative model by using the encoder as a recognition model for a VAE.49,54 Myronenko33 show this to be a very effective regulariser. Here, we also use a VAE, but for conjugacy with the discriminator, we use graph convolutional layers in the decoder. This can be compared to the Variational Graph Autoencoder (VGAE), proposed by Kipf and Welling55 However, Kipf and Welling's application is a link prediction task in citation networks and thus it is desired to model only connectivity in the latent space. Here we model connectivity, position and temporal frequency. To reflect this distinction, the layers immediately before, and after, the latent space are fully connected creating a homogenous latent space.The generative model sets a precedence for information that can be modelled causally, while leaving elements of the discriminative machinery, such as skip connections, to capture correlations that remain useful for prediction but are not necessarily persuant to the objective of the generative model. In addition to performing the role of regularisation in general, we show that we gain robustness to distributional shift across similar, but different, actions that are likely to share generative properties. The architecture may be considered with the visual aid in Figure 1.1FIGUREGraph convolutional network (GCN) network architecture with variational autoencoder (VAE) branch. Here, nz=16 is the number of latent variables per jointVAE branch and lossHere we define the first 6 GCB blocks as our VAE recognition model, with a latent variable z∈ℝK×nz=Nμzσz, where μz∈ℝK×nz,σz∈ℝK×nz. nz = 8, or 32 depending on training stability.The KL divergence between the latent space distribution and a spherical Gaussian N0,I is given by:6ℓl=KLqZC‖qZ=12∑1nzμz2+σz2−1−logσz2.The decoder part of the VAE has the same structure as the discriminative branch; 6 GCBs. We parametrise the output neurons as μ∈ℝK×N+T, and logσ2∈ℝK×N+T. We can now model the reconstruction of inputs as samples of a maximum likelihood of a Gaussian distribution which constitutes the second term of the negative variational lower bound (VLB) of the VAE:7ℓG=logpCZ=−12∑n=1N+T∑l=1Klogσk,l2+log2π+Ck,l−μk,l2elogσk,l2,where Ck,l are the DCT coefficients of the ground truth.TrainingWe train the entire network together with the addition of the negative VLB:8ℓ=1N+TK∑n=1N+T∑k=1Kx̂k,n−xk,n⏟Discriminitiveloss−λℓG−ℓl⏟VLB.Here, λ is a hyperparameter of the model. The overall network is ≈3.4M parameters. The number of parameters varies slightly as per the number of joints, K, since this is reflected in the size of the graph in each layer (k=48 for H3.6M, K=64 for CMU joint angles, and K=J=75 for CMU Cartesian coordinates). Furthermore, once trained, the generative model is not required for prediction and hence for this purpose is as compact as the original models.EXPERIMENTSDatasets and experimental setupHuman3.6M (H3.6M)The H3.6M dataset,34,56 so called as it contains a selection of 3.6 million 3D human poses and corresponding images, consists of seven actors each performing 15 actions, such as walking, eating, discussion, sitting and talking on the phone. Li et al,28 Mao et al,29 and Martinez et al30 all follow the same training and evaluation procedure: training their motion prediction model on 6 (5 for train and 1 for cross‐validation) of the actors, for each action, and evaluate metrics on the final actor, subject 5. For easy comparison to these ID baselines, we maintain the same train; cross‐validation; and test splits. However, we use the single, most well‐defined action (see Appendix A), walking, for train and cross‐validation, and we report test error on all the remaining actions from subject 5. In this way, we conduct all parameter selections based on ID performance.CMU motion capture(CMU‐mocap) The CMU dataset consists of five general classes of actions. Similar to References [27,29,57], we use eight detailed actions from these classes: “basketball,” “basketball signal,” “directing traffic,” “jumping,” “running,” “soccer,” “walking,” and “window washing.” We use two representations, a 64‐dimensional vector that gives an exponential map representation58 of the joint angle, and a 75‐dimensional vector that gives the 3D Cartesian coordinates of 25 joints. We do not tune any hyperparameters on this dataset and use only a train and test set with the same split as is common in the literature.29,30Model configurationWe implemented the model in PyTorch59 using the Adam optimiser.60 The learning rate was set to 0.0005 for all experiments where, unlike Mao et al.,29,32 we did not decay the learning rate as it was hypothesised that the dynamic relationship between the discriminative and generative loss would make this redundant. The batch size was 16. For numerical stability, gradients were clipped to a maximum ℓ2‐norm of 1 and logσ̂2 and values were clamped between −20 and 3. Code for all experiments is available at: https://github.com/bouracha/OoDMotionBaseline comparisonBoth Mao et al29 (GCN), and Mao et al32 (attention‐GCN) use this same GCN architecture with DCT inputs. In particular, Mao et al32 increase the amount of history accounted for by the GCN by adding a motion attention mechanism to weight the DCT coefficients from different sections of the history prior to being input to the GCN. We compare against both of these baselines on OoD actions. For attention‐GCN, we leave the attention mechanism preceding the GCN unchanged such that the generative branch of the model is reconstructing the weighted DCT inputs to the GCN, and the whole network is end‐to‐end differentiable.Hyperparameter searchSince a new term has been introduced to the loss function, it was necessary to determine a sensible weighting between the discriminative and generative models. In Reference [33], this weighting was arbitrarily set to 0.1. It is natural that the optimum value here will relate to the other regularisation parameters in the model. Thus, we conducted random hyperparameter search for pdrop and λ in the ranges pdrop=0,0.5 on a linear scale, and λ=10,0.00001 on a logarithmic scale. For fair comparison, we also conducted hyperparameter search on GCN, for values of the dropout probability (pdrop) between 0.1 and 0.9. For each model, 25 experiments were run and the optimum values were selected on the lowest ID validation error. The hyperparameter search was conducted only for the GCN model on short‐term predictions for the H3.6M dataset and used for all future experiments hence demonstrating generalisability of the architecture.ResultsConsistent with the literature, we report short‐term (<500ms) and long‐term (>500ms) predictions. In comparison to GCN, we take short‐term history into account (10 frames, 400ms) for both datasets to predict both short‐ and long‐term motion. In comparison to attention‐GCN, we take long‐term history (50 frames, 2 seconds) to predict the next 10 frames, and predict further into the future by recursively applying the predictions as input to the model as in Reference[32]. In this way, a single short‐term prediction model may produce long‐term predictions.We use Euclidean distance between the predicted and ground‐truth joint angles for the Euler angle representation. For 3D joint coordinate representation, we use the MPJPE as used for training (Equation (5)). Table 1 reports the joint angle error for the short‐term predictions on the H3.6M dataset. Here, we found the optimum hyperparameters to be pdrop=0.5 for GCN, and λ=0.003, with pdrop=0.3 for our augmentation of GCN. The latter of which was used for all future experiments, where for our augmentation of attention‐GCN we removed dropout altogether. On average, our model performs convincingly better both ID and OoD. Here, the generative branch works well as both a regulariser for small datasets and by creating robustness to distributional shifts. We see similar and consistent results for long‐term predictions in Table 2.1TABLEShort‐term prediction of Euclidean distance between predicted and ground truth joint angles on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Average (of 14 for OoD)Milliseconds160320400160320400160320400160320400GCN (OoD)0.370.600.650.380.650.790.551.081.100.691.091.27SD0.0080.0080.010.010.030.040.010.020.020.020.040.04Ours (OoD)0.370.590.640.370.590.720.541.010.990.681.071.21SD0.0040.030.030.010.030.040.010.010.020.010.010.02Note: Each experiment conducted three times. We report the mean and SD. Note that we have lower variance in our results. Full table is given in Table A1. Bold values correspond to the best score for the respective simulation across the different models.Abbreviations: GCN, graph convolutional network; OoD, out‐of‐distribution.2TABLELong‐term prediction of Eucildean distance between predicted and ground truth joint angles on H3.6MWalkingEatingSmokingDiscussionAverageMilliseconds56010005601000560100056010005601000GCN (OoD)0.800.800.891.201.261.851.451.881.101.43Ours (OoD)0.660.720.901.191.171.781.441.901.041.40Note: Bold correspond to lowest values.Abbreviations: GCN, graph convolutional network; OoD, out‐of‐distribution.From Tables 3 and 4, we can see that the superior OoD performance generalises to the CMU dataset with the same hyperparameter settings with a similar trend of the difference being larger for longer predictions for both joint angles and 3D joint coordinates. For each of these experiments nz=8.3TABLEEuclidean distance between predicted and ground truth joint angles on CMUBasketball (ID)Basketball signal (OoD)Average (of 7 for OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.400.671.111.251.630.270.551.141.422.180.360.651.411.492.17Ours0.400.661.121.291.760.280.571.151.432.070.340.621.351.412.10Note: Full table is given in Table A2. Bold values correspond to the best score for the respective simulation across the different models.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.4TABLEMean joint per position error (MPJPE) between predicted and ground truth three‐dimensional Cartesian coordinates of joints on CMUBasketballBasketball signalAverage (of 7 for OoD)Milliseconds801603204001000801603204001000801603204001000GCN (OoD)15.728.954.165.4108.414.430.463.578.7114.820.043.886.3105.8169.2Ours (OoD)16.030.054.565.598.112.826.053.767.6103.221.642.384.2103.8164.3Note: Full table is given in Table A3.Abbreviations: GCN, graph convolutional network; OoD, out‐of‐distribution.Table 5, shows that the effectiveness of the generative branch generalises to the very recent motion attention architecture. For attention‐GCN we used nz=32. Here, interestingly short‐term predictions are poor but long‐term predictions are consistently better. This supports our assertion that information relevant to generative mechanisms are more intrinsic to the causal model and thus, here, when the predicted output is recursively used, more useful information is available for the future predictions.5TABLELong‐term prediction of three‐dimensional joint positions on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Average (of 14 for OoD)Milliseconds5607208801000560720880100056072088010005607208801000att‐GCN (OoD)55.460.565.268.787.6103.6113.2120.381.793.7102.9108.7112.1129.6140.3147.8Ours (OoD)58.760.665.569.181.794.4102.7109.380.689.999.2104.1113.1127.7137.9145.3Note: Here ours is also trained with the attention‐GCN model. Full table is given in Table A4. Bold values correspond to the best score for the respective simulation across the different models.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.CONCLUSIONWe draw attention to the need for robustness to distributional shifts in predicting human motion, and propose a framework for its evaluation based on major open‐source datasets. We demonstrate that state‐of‐the‐art discriminative architectures can be hardened to extreme distributional shifts by augmentation with a generative model, combining low in‐distribution predictive error with maximal generalisability. Our investigation argues for wider use of generative models in behavioural modelling, and shows it can be performed with minimal or no performance penalty, within hybrid architectures of potentially diverse constitution. Further work could examine the survey ability of latent space introduced by the VAE.ACKNOWLEDGEMENTSAnthony Bourached is funded by the UKRI UCL Centre for Doctoral Training in AI‐enabled Healthcare Systems. Robert Gray, Ashwani Jha and Parashkev Nachev are funded by the Wellcome Trust (213038) and the NHR UCL Biomedical Research Centre.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available, instructions at https://github.com/bouracha/OoDMotion.REFERENCESGeertsema EE, Thijs RD, Gutter T, et al. Automated video‐based detection of nocturnal convulsive seizures in a residential care setting. Epilepsia. 2018;59:53‐60.Kakar M, Nyström H, Aarup LR, Nøttrup TJ, Olsen DR. Respiratory motion prediction by using the adaptive neuro fuzzy inference system (anfis). Phys Med Biol. 2005;50(19):4721‐4728.Chang C‐Y, Lange B, Zhang M, et al. Towards pervasive physical rehabilitation using microsoft kinect. 2012 6th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops. IEEE; 2012:159‐162. https://ieeexplore.ieee.org/abstract/document/6240377Webster D, Celik O. Systematic review of kinect applications in elderly care and stroke rehabilitation. J Neuroeng Rehabil. 2014;11(1):108.Gui L‐Y, Zhang K, Wang Y‐X, Liang X, Moura JM, Veloso M. Teaching robots to predict human motion. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2018:562‐567. https://ieeexplore.ieee.org/abstract/document/8594452Koppula H, Saxena A. Learning spatio‐temporal structure from RGB‐D videos for human activity detection and anticipation. International Conference on Machine Learning; 2013:792‐800. https://proceedings.mlr.press/v28/koppula13.htmlKoppula HS, Saxena A. Anticipating human activities for reactive robotic response. Tokyo: IROS; 2013:2071.Alahi A, Goel K, Ramanathan V, Robicquet A, Fei‐Fei L, Savarese S. Social lstm: human trajectory prediction in crowded spaces. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:961‐971. https://openaccess.thecvf.com/content_cvpr_2016/html/Alahi_Social_LSTM_Human_CVPR_2016_paper.htmlBhattacharyya A, Fritz M, Schiele B. Long‐term on‐board prediction of people in traffic scenes under uncertainty. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018:4194‐4202. https://openaccess.thecvf.com/content_cvpr_2018/html/Bhattacharyya_Long-Term_On-Board_Prediction_CVPR_2018_paper.htmlPaden B, Čáp M, Yong SZ, Yershov D, Frazzoli E. A survey of motion planning and control techniques for self‐driving urban vehicles. IEEE Trans Intell Veh. 2016;1(1):33‐55. https://ieeexplore.ieee.org/abstract/document/7490340Wang Y, Liu Z, Zuo Z, Li Z, Wang L, Luo X. Trajectory planning and safety assessment of autonomous vehicles based on motion prediction and model predictive control. IEEE Trans Veh Technol. 2019;68(9):8546‐8556.Švec P, Thakur A, Raboin E, Shah BC, Gupta SK. Target following with motion prediction for unmanned surface vehicle operating in cluttered environments. Autonomous Robots. 2014;36(4):383‐405.Lau RW, Chan A. Motion prediction for online gaming. International Workshop on Motion in Games. Berlin/Heidelberg, Germany: Springer; 2008:104‐114.A. R. Rofougaran, M. Rofougaran, N. Seshadri, B. B. Ibrahim, J. Walley, and J. Karaoguz. Game console and gaming object with motion prediction modeling and methods for use therewith, 2018. US Patent 9,943,760.Shirai A, Geslin E, Richir S. Wiimedia: motion analysis methods and applications using a consumer video game controller. Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. New York, NY: Association for Computing Machinery; 2007:133‐140.Grant J, Boukouvalas A, Griffiths R‐R, Leslie D, Vakili S, De Cote EM. Adaptive sensor placement for continuous spaces. International Conference on Machine Learning. PMLR; 2019:2385‐2393. https://proceedings.mlr.press/v97/grant19a.htmlKim D, Paik J. Gait recognition using active shape model and motion prediction. IET Compu Vis. 2010;4(1):25‐36.Ma Z, Wang X, Ma R, Wang Z, Ma J. Integrating gaze tracking and head‐motion prediction for mobile device authentication: a proof of concept. Sensors. 2018;18(9):2894.Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: MIT Press; 2009.Lehrmann AM, Gehler PV, Nowozin S. Efficient nonlinear markov models for human motion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014:1314‐1321. https://openaccess.thecvf.com/content_cvpr_2014/html/Lehrmann_Efficient_Nonlinear_Markov_2014_CVPR_paper.htmlSutskever I, Hinton GE, Taylor GW. The recurrent temporal restricted boltzmann machine. Advances in Neural Information Processing Systems; 2009:1601‐1608. https://www.cs.utoronto.ca/~hinton/absps/rtrbm.pdfTaylor GW, Hinton GE, Roweis ST. Modeling human motion using binary latent variables. Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press; 2007:1345‐1352. https://proceedings.neurips.cc/paper/2006/file/1091660f3dff84fd648efe31391c5524‐Paper.pdfAksan E, Kaufmann M, Hilliges O. Structured prediction helps 3d human motion modelling. Proceedings of the IEEE International Conference on Computer Vision; 2019:7144‐7153.Butepage J, Black MJ, Kragic D, Kjellstrom H. Deep representation learning for human motion prediction and classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:6158‐6166.Cai Y, Huang L, Wang Y, et al. Learning progressive joint propagation for human motion prediction. Proceedings of the European Conference on Computer Vision (ECCV); 2020.Fragkiadaki K, Levine S, Felsen P, Malik J. Recurrent network models for human dynamics. Proceedings of the IEEE International Conference on Computer Vision; 2015:4346‐4354.Li C, Zhang Z, Sun Lee W, Hee Lee G. Convolutional sequence to sequence model for human dynamics. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018:5226‐5234.Li M, Chen S, Zhao Y, Zhang Y, Wang Y, Tian Q. Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020:214‐223.Mao W, Liu M, Salzmann M, Li H. Learning trajectory dependencies for human motion prediction. Proceedings of the IEEE International Conference on Computer Vision; 2019:9489‐9497.Martinez J, Black MJ, Romero J. On human motion prediction using recurrent neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:2891‐2900.Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. 2016.Mao W, Miaomiao L, Mathieu S. History repeats itself: human motion prediction via motion attention. ECCV. 2020.Myronenko A. 3D MRI brain tumor segmentation using autoencoder regularization. International MICCAI Brainlesion Workshop. Cham, Switzerland: Springer; 2018:311‐320.Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell. 2013;36(7):1325‐1339.Gopalakrishnan A, Mali A, Kifer D, Giles L, Ororbia AG. A neural temporal model for human motion prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019:12116‐12125.Gui L‐Y, Wang Y‐X, Liang X, Moura JM. Adversarial geometry‐aware human motion prediction. Proceedings of the European Conference on Computer Vision (ECCV); 2018:786‐803.Guo X, Choi J. Human motion prediction via learning local structure representations and temporal dependencies. Proc AAAI Conf Artif Intel. 2019;33:2580‐2587.Jain A, Zamir AR, Savarese S, Saxena A. Structural‐rnn: deep learning on spatio‐temporal graphs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:5308‐5317.Pavllo D, Grangier D, Auli M. Quaternet: a quaternion‐based recurrent model for human motion. arXiv preprint arXiv:1805.06485. 2018.Gossner O, Steiner J, Stewart C. Attention please! Econometrica. 2021;89(4):1717‐1751.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436‐444.Hendrycks D, Gimpel K. A baseline for detecting misclassified and out‐of‐distribution examples in neural networks. arXiv preprint arXiv: 1610.02136. 2016.Hendrycks D, Mazeika M, Dietterich T. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. 2018.Liang S, Li Y, Srikant R. Enhancing the reliability of out‐of‐distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. 2017.Daxberger E, Hernández‐Lobato JM. Bayesian variational autoencoders for unsupervised out‐of‐distribution detection. arXiv preprint arXiv:1912.05651. 2019.Nalisnick E, Matsukawa A, Teh YW, Gorur D, Lakshminarayanan B. Do deep generative models know what they don't know? arXiv preprint arXiv:1810.09136. 2018.Grathwohl W, Wang K‐C, Jacobsen J‐H, Duvenaud D, Norouzi M, Swersky K. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263. 2019.Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems; 2017:5574‐5584.Kingma DP, Welling M. Auto‐encoding variational bayes. arXiv preprint arXiv:1312.6114. 2013.Bourached A, Nachev P. Unsupervised videographic analysis of rodent behaviour. arXiv preprint arXiv:1910.11065. 2019.Motegi Y, Hijioka Y, Murakami M. Human motion generative model using variational autoencoder. Int J Model Optim. 2018;8(1):8‐12.Chen N, Bayer J, Urban S, P. Van Der Smagt. Efficient movement representation by embedding dynamic movement primitives in deep autoencoders. 2015 IEEE‐RAS 15th International Conference on Humanoid Robots (Humanoids). IEEE; 2015:434‐440.Cao Z, Hidalgo G, Simon T, Wei S‐E, Sheikh Y. Openpose: realtime multi‐person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008. 2018.Rezende DJ, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning; 2014:1278‐1286.Kipf TN, Welling M. Variational graph auto‐encoders. arXiv preprint arXiv:1611.07308. 2016.Ionescu C, Li F, Sminchisescu C. Latent structured models for human pose estimation. 2011 International Conference on Computer Vision. IEEE; 2011:2220‐2227.Li D, Rodriguez C, Yu X, Li H. Word‐level deep sign language recognition from video: a new large‐scale dataset and methods comparison. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2020.Grassia FS. Practical parameterization of rotations using the exponential map. J Graph Tools. 1998;3(3):29‐48.A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic Differentiation in Pytorch. 2017. https://openreview.net/forum?id=BJJsrmfCZKingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 2018.AAPPENDIXThe appendix consists of four parts. We provide a brief summary of each section below.Appendix A: We provide results from our experimentation to determine the optimum way of defining separable distributions on the H3.6M, and the CMU datasets.Appendix B: We provide the full results of tables which are shown in part in the main text.Appendix C: We inspect the generative model by examining its latent space and use it to consider the role that the generative model plays in learning as well as possible directions of future work.Appendix D: We provide larger diagrams of the architecture of the augmented GCN.A.1.Appendix A: Discussion of the definition of out‐of‐distributionHere, we describe in more detail the empirical motivation for our definition of out‐of‐distribution (OoD) on the H3.6M and CMU datasets.Figure A1 shows the distribution of actions for the H3.6M and CMU datasets. We want our ID data to be small in quantity, and narrow in domain. Since this dataset is labelled by action, we are provided with a natural choice of distribution being one of these actions. Moreover, it is desirable that the action be quantifiably distinct from the other actionsTo determine which action supports these properties we train a simple classifier to determine which action is most easily distinguished from the others based on the DCT inputs: DCTx→k=DCTxk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. We make no assumption on the architecture that would be optimum to determine the separation, and so use a simple fully connected model with 4 layers. Layer 1: inputdimensions×1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy is used as a loss function on these logits during training. We used ReLU activations with a dropout probability of 0.5.We trained this model using the last 10 historic frames (N=10, T=10) with 20 DCT coefficients for both the H3.6M and CMU datasets, as well as (N=50, T=10) with 20 DCT coefficients additionally for H3.6M (here we select only the 20 lowest frequency DCT coefficients). We trained each model for 10 epochs with a batch size of 2048, and a learning rate of 0.00001. The confusion matrices for the H3.6M dataset are shown in Figures A2 and A3, respectively. Here, we use the same train set as outlined in Section 6.1. However, we report results on subject 11‐ which for motion prediction was used as the validation set. We did this because the number of instances are much greater than subject 5, and no hyperparameter tuning was necessary. For the CMU dataset, we used the same train and test split as for all other experiments.In both cases, for the H3.6M dataset, the classifier achieves the highest precision score (0.91 and 0.95, respectively) for the action walking as well as a recall score of 0.83 and 0.81, respectively. Furthermore, in both cases walking together dominates the false negatives for walking (50%, and 44% in each case) as well as the false positives (33% in each case).A1FIGURE(A) Distribution of short‐term training instances for actions in H3.6M. (B) Distribution of training instances for actions in CMUThe general increase in the distinguishability that can be seen in Figure A3 increases the demand to be able to robustly handle distributional shifts as the distribution of values that represent different actions only gets more pronounced as the time scale is increased. This is true with even the näive DCT transformation to capture longer time scales without increasing vector size.As we can see from the confusion matrix in Figure A4, the actions in the CMU dataset are even more easily separable. In particular, our selected ID action in the paper, Basketball, can be identified with 100% precision and recall on the test set.A2FIGUREConfusion matrix for a multiclass classifier for action labels. In each case, we use the same input convention x→k=xk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. Such that in each case input to the classifier is 48×20=960. The classifier has four fully connected layers. Layer 1: input dimensions × 1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5. We used a batch size of 2048, and a learning rate of 0.00001.H3.6M dataset. N=10, T=10. Number of discrete cosine transformations (DCT) coefficients = 20 (lossesless transformation)A3FIGUREConfusion matrix for a multiclass classifier for action labels. In each case, we use the same input convention x→k=xk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. Such that in each case input to the classifier is 48×20=960. The classifier has four fully connected layers. Layer 1: input dimensions × 1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5. We used a batch size of 2048, and a learning rate of 0.00001.H3.6M dataset. N=50, T=10. Number of discrete cosine transformations (DCT) coefficients = 20, where the 40 highest frequency DCT coefficients are culledA4FIGUREConfusion matrix for a multiclass classifier for action labels. In each case, we use the same input convention x→k=xk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. Such that in each case input to the classifier is 48×20=960. The classifier has four fully connected layers. Layer 1: input dimensions × 1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5. We used a batch size of 2048, and a learning rate of 0.00001. CMU dataset. N=10, T=25. Number of discrete cosine transformations (DCT) coefficients = 35 (losses less transformation)A.2.Appendix B: Full resultsA1TABLEShort‐term prediction of Euclidean distance between predicted and ground truth joint angles on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Discussion (OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.220.370.600.650.220.380.650.790.280.551.081.100.290.650.981.08SD0.0010.0080.0080.010.0030.010.030.040.010.010.020.020.0040.010.040.04Ours (OoD)0.230.370.590.640.210.370.590.720.280.541.010.990.310.650.971.07SD0.0030.0040.030.030.0080.010.030.040.0050.010.010.020.0050.0090.020.01Directions (OoD)Greeting (OoD)Phoning (OoD)Posing (OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.380.590.820.920.480.811.251.440.581.121.521.610.270.591.261.53SD0.010.030.050.060.0060.010.020.020.0060.010.010.010.010.050.10.1Ours (OoD)0.380.580.790.900.490.811.241.430.571.101.521.610.330.681.251.51SD0.0070.020.00.050.0060.0050.020.020.0040.0030.010.010.020.050.030.03Purchases (OoD)Sitting (OoD)Sitting down (OoD)Taking photo (OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.620.901.341.420.400.661.151.330.460.941.521.690.260.530.820.93SD0.0010.0010.020.030.0030.0070.020.030.010.030.040.050.0050.010.010.02Ours (OoD)0.620.891.231.310.390.631.051.200.400.791.191.330.260.520.810.95SD0.0010.0020.0050.010.0010.0010.0040.0050.0070.0090.010.020.0050.010.010.01Waiting (OoD)Walking dog (OoD)Walking together (OoD)Average (of 14 for OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.290.591.061.300.520.861.181.330.210.440.670.720.380.691.091.27SD0.010.030.050.050.010.020.020.030.0050.020.030.030.0070.020.040.04Ours (OoD)0.290.581.061.290.520.881.171.340.210.440.660.740.380.681.071.21SD0.00070.0030.0010.0060.0060.010.0080.010.010.010.010.010.0060.010.010.02Note: Each experiment conducted three times. We report the mean and standard deviation. Note that we have lower variance in our results.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A2TABLEEuclidean distance between predicted and ground truth joint angles on CMUBasketball (ID)Basketball signal (OoD)Directing traffic (OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.400.671.111.251.630.270.551.141.422.180.310.621.051.242.49Ours0.400.661.121.291.760.280.571.151.432.070.280.560.961.102.33Jumping (OoD)Running (OoD)Soccer (OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.420.731.721.982.660.460.841.501.721.570.290.541.151.412.14Ours0.380.721.742.032.700.460.811.361.532.090.280.531.071.271.99Walking (OoD)Washing window (OoD)Average (of 7 for OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.400.610.971.181.850.360.651.231.512.310.360.651.411.492.17Ours0.380.540.820.991.270.350.631.201.512.260.340.621.351.412.10Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A3TABLEMean joint per position error (MPJPE) between predicted and ground truth 3D Cartesian coordinates of joints on CMUBasketball (ID)Basketball signal (OoD)Directing traffic (OoD)Milliseconds801603204001000801603204001000801603204001000GCN15.728.954.165.4108.414.430.463.578.7114.818.537.475.693.6210.7Ours16.030.054.565.598.112.826.053.767.6103.218.337.275.793.8199.6Jumping (OoD)Running (OoD)Soccer (OoD)Milliseconds801603204001000801603204001000801603204001000GCN24.651.2111.4139.6219.732.354.885.999.399.922.646.692.8114.3192.5Ours25.052.0110.3136.8200.229.850.283.598.7107.321.144.290.4112.1202.0Walking (OoD)Washing window (OoD)Average of 7 for (OoD)Milliseconds801603204001000801603204001000801603204001000GCN10.820.742.953.486.517.136.477.696.0151.620.043.886.3105.8169.2Ours10.518.939.248.672.217.637.382.0103.4167.521.642.384.2103.8164.3Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A4TABLELong‐term prediction of 3D joint positions on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Discussion (OoD)milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)55.460.565.268.787.6103.6113.2120.381.793.7102.9108.7114.6130.0133.5136.3Ours (OoD)58.760.665.569.181.794.4102.7109.380.689.999.2104.1115.4129.0134.5139.4Directions (OoD)Greeting (OoD)Phoning (OoD)Posing (OoD)Milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)107.0123.6132.7138.4127.4142.0153.4158.698.7117.3129.9138.4151.0176.0189.4199.6Ours (OoD)107.1120.6129.2136.6128.0140.3150.8155.795.8111.0122.7131.4158.7181.3194.4203.4Purchases (OoD)Sitting (OoD)Sitting down (OoD)Taking photo (OoD)Milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)126.6144.0154.3162.1118.3141.1154.6164.0136.8162.3177.7189.9113.7137.2149.7159.9Ours (OoD)128.0143.2154.7164.3118.4137.7149.7157.5136.8157.6170.8180.4116.3134.5145.6155.4Waiting (OoD)Walking Dog (OoD)Walking together (OoD)Average (of 14 for OoD)Milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)109.9125.1135.3141.2131.3146.9161.1171.464.571.176.880.8112.1129.6140.3147.8Ours (OoD)110.4124.5133.9140.3138.3151.2165.0175.567.771.977.180.8113.1127.7137.9145.3Note: Here ours is also trained with the attention‐GCN model.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A.3.Appendix C: Latent space of the VAEOne of the advantages of having a generative model involved is that we have a latent variable which represents a distribution over deterministic encodings of the data. We considered the question of whether or not the VAE was learning anything interpretable with its latent variable as was the case in Reference [55].The purpose of this investigation was 2‐fold. First to determine if the generative model was learning a comprehensive internal state, or just a nonlinear average state as is common to see in the training of VAE like architectures. The result of this should suggest a key direction of future work. Second, an interpretable latent space may be of paramount usefulness for future applications of human motion prediction. Namely, if dimensionality reduction of the latent space to an inspectable number of dimensions yields actions, or behaviour that are close together if kinematically or teleolgically similar as in Reference [50], then, human experts may find unbounded potential application for a interpretation that is both quantifiable and qualitatively comparable to all other classes within their domain of interest. For example, a medical doctor may consider a patient to have unusual symptoms for condition, say, A. It may be useful to know that the patient's deviation from a classic case of A, is in the direction of condition, say, B.We trained the augmented GCN model discussed in the main text with all actions, for both datasets. We use Uniform Manifold Approximation and Projection (UMAP)61 to project the latent space of the trained GCN models onto two dimensions for all samples in the dataset for each dataset independently. From Figure A5, we can see that for both models the 2D project relatively closely resembles a spherical gaussian. Furthermore, we can see from Figure A5B that the action walking does not occupy a discernible domain of the latent space. This result is further verified by using the same classifier as used in Appendix 8, which achieved no better than chance when using the latent variables as input rather than the raw data input.This result implies that the benefit observed in the main text is by using the generative model is significant even if the generative model has poor performance itself. In this case we can be sure that the reconstructions are at least not good enough to distinguish between actions. It is hence natural for future work to investigate if the improvement on OoD performance is greater if trained in such a way as to ensure that the generative model performs well. There are multiple avenues through which such an objective might be achieve. Pre‐training the generative model being one of the salient candidates.A5FIGURELatent embedding of the trained model on both the H3.6M and the CMU datasets independently projected in 2D using UMAP from 384 dimensions for H3.6M, and 512 dimensions for CMU using default hyperparameters for UMAP. (A) H3.6M. All actions, opacity=0.1. (B) H3.6M. All actions in blue: opacity=0.1. Walking in red: opacity=1. (C) CMU. All actions in blue: opacity = 0.1A.4.Appendix D: Architecture diagramsA6FIGURENetwork architecture with discriminative and variational autoencoder (VAE) branchA7FIGUREGraph convolutional layer (GCL) and a residual graph convolutional block (GCB) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied AI Letters Wiley

Generative model‐enhanced human motion prediction

Loading next page...
 
/lp/wiley/generative-model-enhanced-human-motion-prediction-AYt7Tu8piO

References (76)

Publisher
Wiley
Copyright
© 2022 The Authors. Applied AI Letters published by John Wiley & Sons Ltd
eISSN
2689-5595
DOI
10.1002/ail2.63
Publisher site
See Article on Publisher Site

Abstract

INTRODUCTIONHuman motion is naturally intelligible as a time‐varying graph of connected joints constrained by locomotor anatomy and physiology. Its prediction allows the anticipation of actions with applications across healthcare,1,2 physical rehabilitation and training,3,4 robotics,5‐7 navigation,8‐11 manufacture,12 entertainment,13‐15 and security.16‐18The favoured approach to predicting movements over time has been purely inductive, relying on the history of a specific class of movement to predict its future. For example, state‐space models19 enjoyed early success for simple, common, or cyclic motions.20‐22 The range, diversity and complexity of human motion has encouraged a shift to more expressive, deep neural network architectures,23‐30 but still within a simple inductive framework.This approach would be adequate were actions both sharply distinct and highly stereotyped. But their complex, compositional nature means that within one category of action the kinematics may vary substantially, while between two categories they may barely differ. Moreover, few real‐world tasks restrict the plausible repertoire to a small number of classes—distinct or otherwise—that could be explicitly learnt. Rather, any action may be drawn from a great diversity of possibilities—both kinematic and teleological—that shape the characteristics of the underlying movements. This has two crucial implications. First, any modelling approach that lacks awareness of the full space of motion possibilities will be vulnerable to poor generalisation and brittle performance in the face of kinematic anomalies. Second, the very notion of in‐distribution (ID) testing becomes moot, for the relations between different actions and their kinematic signatures are plausibly determinable only across the entire domain of action. A test here arguably needs to be out‐of‐distribution (OoD) if it is to be considered a robust test at all.These considerations are amplified by the nature of real‐world applications of kinematic modelling, such as anticipating arbitrary deviations from expected motor behaviour early enough for an automatic intervention to mitigate them. Most urgent in the domain of autonomous driving,9,11 such safety concerns are of the highest importance, and are best addressed within the fundamental modelling framework. Indeed,31 cites the ability to recognise our own ignorance as a safety mechanism that must be a core component in safe AI. Nonetheless, to our knowledge, current predictive models of human kinematics neither quantify OoD performance nor are designed with it in mind. There is therefore a need for two frameworks, applicable across the domain of action modelling: one for hardening a predictive model to anomalous cases, and another for quantifying OoD performance with established benchmark datasets. General frameworks are here desirable in preference to new models, for the field is evolving so rapidly greater impact can be achieved by introducing mechanisms that can be applied to a breadth of candidate architectures, even if they are demonstrated in only a subset. Our approach here is founded on combining a latent variable generative model with a standard predictive model, illustrated with the current state‐of‐the‐art discriminative architecture,29,32 a strategy that has produced state‐of‐the‐art in the medical imaging domain.33 Our aim is to achieve robust performance within a realistic, low‐volume, high‐heterogeneity data regime by providing a general mechanism for enhancing a discriminative architecture with a generative model.In short, our contributions to the problem of achieving robustness to distributional shift in human motion prediction are as follows:1. We provide a framework to benchmark OoD performance on the most widely used open‐source motion capture datasets: Human3.6M,34 and Carnegie Mellon University (CMU)‐Mocap (http://mocap.cs.cmu.edu/) and evaluate state‐of‐the‐art models on it.2. We present a framework for hardening deep feed‐forward models to OoD samples. We show that the hardened models are fast to train, and exhibit substantially improved OoD performance with minimal impact on ID performance.We begin Section 2 with a brief review of human motion prediction with deep neural networks, and of OoD generalisation using generative models. In Section 3, we define a framework for benchmarking OoD performance using open‐source multi‐action datasets. We introduce in Section 4 the discriminative models that we harden using a generative branch to achieve a state‐of‐the‐art (SOTA) OoD benchmark. We then turn in Section 5 to the architecture of the generative model and the overall objective function. Section 6 presents our experiments and results. We conclude in Section 7 with a summary of our results, current limitations, and caveats, and future directions for developing robust and reliable OoD performance and a quantifiable awareness of unfamiliar behaviour.RELATED WORKDeep‐network‐based human motion predictionHistorically, sequence‐to‐sequence prediction using recurrent neural networks (RNNs) have been the de facto standard for human motion prediction.26,28,30,35‐39 Currently, the SOTA is dominated by feed‐forward models.24,27,29,32 These are inherently faster and easier to train than RNNs. The jury is still out, however, on the optimal way to handle temporality for human motion prediction. Meanwhile, recent trends have overwhelmingly shown that graph‐based approaches are an effective means to encode the spatial dependencies between joints,29,32 or sets of joints.28 In this study, we consider the SOTA models that have graph‐based approaches with a feed‐forward mechanism as presented by,29 and the subsequent extension which leverages motion attention,.32 Further attention‐based approaches may indicate an upcoming trend.40 We show that these may be augmented to improve robustness to OoD samples.Generative models for out‐of‐distribution prediction and detectionDespite the power of deep neural networks for prediction in complex domains,41 they face several challenges that limit their suitability for safety‐critical applications. Amodei et al31 list robustness to distributional shift as one of the five major challenges to AI safety. Deep generative models, have been used extensively for the detection of OoD inputs and have been shown to generalise well in such scenarios.42‐44 While recent work has shown some failures in simple OoD detection using density estimates from deep generative models,45,46 they remain a prime candidate for anomaly detection.45,47,48Myronenko33 use a variational autoencoder (VAE)49 to regularise an encoder‐decoder architecture with the specific aim of better generalisation. By simultaneously using the encoder as the recognition model of the VAE, the model is encouraged to base its segmentations on a complete picture of the data, rather than on a reductive representation that is more likely to be fitted to the training data. Furthermore, the original loss and the VAE's loss are combined as a weighted sum such that the discriminator's objective still dominates. Further work may also reveal useful interpretability of behaviour (via visualisation of the latent space as in Reference [50]), generation of novel motion,51 or reconstruction of missing joints as in Reference [52].QUANTIFYING OUT‐OF‐DISTRIBUTION PERFORMANCE OF HUMAN MOTION PREDICTORSEven a very compact representation of the human body such as OpenPose's 17 joint parameterisation53 explodes to unmanageable complexity when a temporal dimension is introduced of the scale and granularity necessary to distinguish between different kinds of action: typically many seconds, sampled at hundredths of a second. Moreover, though there are anatomical and physiological constraints on the space of licit joint configurations, and their trajectories, the repertoire of possibility remains vast and the kinematic demarcations of teleologically different actions remain indistinct. Thus, no practically obtainable dataset may realistically represent the possible distance between instances. To simulate OoD data, we first need ID data that can be varied in its quantity and heterogeneity, closely replicating cases where a particular kinematic morphology may be rare, and therefore undersampled, and cases where kinematic morphologies are both highly variable within a defined class and similar across classes. Such replication needs to accentuate the challenging aspects of each scenario.We therefore propose to evaluate OoD performance where only a single action, drawn from a single action distribution, is available for training and hyperparameter search, and testing is carried out on the remaining classes. To determine which actions can be clearly separated from the other actions we train a classifier of action category based on the motion inputs. We select the action “walking” from H3.6M, and “basketball” from CMU. Where the classifier can identify these actions with a precision and recall of 0.95 and 0.81, respectively for walking, in H3.6M, and 1.0, and 1.0 for basketball, in CMU. This is discussed further in Appendix A.BACKGROUNDHere, we describe the current SOTA model proposed by Mao et al29 (graph convolutional network [GCN]). We then describe the extension by Mao et al32 (attention‐GCN) which antecedes the GCN prediction model with motion attention.Problem formulationWe are given a motion sequence X1:N=x1x2x3…xN consisting of N consecutive human poses, where xi∈ℝK, with K the number of parameters describing each pose. The goal is to predict the poses XN+1:N+T for the subsequent T time steps.Discrete cosine transformations‐based temporal encodingThe input is transformed using discrete cosine transformations (DCT). In this way, each resulting coefficient encodes information of the entire sequence at a particular temporal frequency. Furthermore, the option to remove high or low frequencies is provided. Given a joint, k, the position of k over N time steps is given by the trajectory vector: xk=xk,1…xk,N where we convert to a DCT vector of the form: Ck=Ck,1…Ck,N where Ck,l represents the lth DCT coefficient. For δl1∈ℝN=1,0,…,0, these coefficients may be computed as1Ck,l=2N∑n=1Nxk,n11+δl1cosπ2N2n−1l−1.If no frequencies are cropped, the DCT is invertible via the inverse discrete cosine transform (IDCT):2xk,l=2N∑l=1NCk,l11+δl1cosπ2N2n−1l−1.Mao et al. use the DCT transform with a GCN architecture to predict the output sequence. This is achieved by having an equal length input‐output sequence, where the input is the DCT transformation of xk=xk,1…xk,Nxk,N+1…xk,N+T, here xk,1…xk,N is the observed sequence and xk,N+1…xk,N+T are replicas of xk,N (ie, xk,n=xk,N for n≥N). The target is now simply the ground truth xk.Graph convolutional networkSuppose C∈ℝK×N+T is defined on a graph with k nodes and N+T dimensions, then we define a GCN to respect this structure. First, we define a graph convolutional layer (GCL) that, as input, takes the activation of the previous layer (Al−1), where l is the current layer.3GCLAl−1=SAl−1W+bwhere A0=C∈ℝK×N+T and S∈ℝK×K is a layer‐specific learnable normalised graph laplacian that represents connections between joints, W∈ℝnl−1×nl are the learnable inter‐layer weightings and b∈ℝnl are the learnable biases where nl are the number of hidden units in layer l.Network structure and lossThe network consists of 12 graph convolutional blocks (GCBs), each containing two GCLs with skip (or residual) connections, see Figures A6 and A7. In addition, there is one GCL at the beginning of the network, and one at the end. nl=256, for each layer, l. There is one final skip connection from the DCT inputs to the DCT outputs, which greatly reduces train time. The model has around 2.6M parameters. Hyperbolic tangent functions are used as the activation function. Batch normalisation is applied before each activation.The outputs are converted back to their original coordinate system using the IDCT (Equation (2)) to be compared to the ground truth. The loss used for joint angles is the average l1 distance between the ground‐truth joint angles, and the predicted ones. Thus, the joint angle loss is:4ℓa=1KN+T∑n=1N+T∑k=1Kx̂k,n−xk,nwhere x̂k,n is the predicted kth joint at timestep n and xk,n is the corresponding ground truth.This is separately trained on three‐dimensional (3D) joint coordinate prediction making use of the mean per joint position error (MPJPE), as proposed in Reference [34] and used in References [29,32]. This is defined, for each training example, as5ℓm=1JN+T∑n=1N+T∑j=1Jp̂j,n−pj,n2where p̂j,n∈ℝ3 denotes the predicted jth joint position in frame n. And pj,n is the corresponding ground truth, while J is the number of joints in the skeleton.Motion attention extensionMao et al.32 extend this model by summing multiple DCT transformations from different sections of the motion history with weightings learned via an attention mechanism. For this extension, the above model (the GCN) along with the anteceding motion attention is trained end‐to‐end. We refer to this as the attention‐GCN.OUR APPROACHMyronenko33 augment an encoder‐decoder discriminative model by using the encoder as a recognition model for a VAE.49,54 Myronenko33 show this to be a very effective regulariser. Here, we also use a VAE, but for conjugacy with the discriminator, we use graph convolutional layers in the decoder. This can be compared to the Variational Graph Autoencoder (VGAE), proposed by Kipf and Welling55 However, Kipf and Welling's application is a link prediction task in citation networks and thus it is desired to model only connectivity in the latent space. Here we model connectivity, position and temporal frequency. To reflect this distinction, the layers immediately before, and after, the latent space are fully connected creating a homogenous latent space.The generative model sets a precedence for information that can be modelled causally, while leaving elements of the discriminative machinery, such as skip connections, to capture correlations that remain useful for prediction but are not necessarily persuant to the objective of the generative model. In addition to performing the role of regularisation in general, we show that we gain robustness to distributional shift across similar, but different, actions that are likely to share generative properties. The architecture may be considered with the visual aid in Figure 1.1FIGUREGraph convolutional network (GCN) network architecture with variational autoencoder (VAE) branch. Here, nz=16 is the number of latent variables per jointVAE branch and lossHere we define the first 6 GCB blocks as our VAE recognition model, with a latent variable z∈ℝK×nz=Nμzσz, where μz∈ℝK×nz,σz∈ℝK×nz. nz = 8, or 32 depending on training stability.The KL divergence between the latent space distribution and a spherical Gaussian N0,I is given by:6ℓl=KLqZC‖qZ=12∑1nzμz2+σz2−1−logσz2.The decoder part of the VAE has the same structure as the discriminative branch; 6 GCBs. We parametrise the output neurons as μ∈ℝK×N+T, and logσ2∈ℝK×N+T. We can now model the reconstruction of inputs as samples of a maximum likelihood of a Gaussian distribution which constitutes the second term of the negative variational lower bound (VLB) of the VAE:7ℓG=logpCZ=−12∑n=1N+T∑l=1Klogσk,l2+log2π+Ck,l−μk,l2elogσk,l2,where Ck,l are the DCT coefficients of the ground truth.TrainingWe train the entire network together with the addition of the negative VLB:8ℓ=1N+TK∑n=1N+T∑k=1Kx̂k,n−xk,n⏟Discriminitiveloss−λℓG−ℓl⏟VLB.Here, λ is a hyperparameter of the model. The overall network is ≈3.4M parameters. The number of parameters varies slightly as per the number of joints, K, since this is reflected in the size of the graph in each layer (k=48 for H3.6M, K=64 for CMU joint angles, and K=J=75 for CMU Cartesian coordinates). Furthermore, once trained, the generative model is not required for prediction and hence for this purpose is as compact as the original models.EXPERIMENTSDatasets and experimental setupHuman3.6M (H3.6M)The H3.6M dataset,34,56 so called as it contains a selection of 3.6 million 3D human poses and corresponding images, consists of seven actors each performing 15 actions, such as walking, eating, discussion, sitting and talking on the phone. Li et al,28 Mao et al,29 and Martinez et al30 all follow the same training and evaluation procedure: training their motion prediction model on 6 (5 for train and 1 for cross‐validation) of the actors, for each action, and evaluate metrics on the final actor, subject 5. For easy comparison to these ID baselines, we maintain the same train; cross‐validation; and test splits. However, we use the single, most well‐defined action (see Appendix A), walking, for train and cross‐validation, and we report test error on all the remaining actions from subject 5. In this way, we conduct all parameter selections based on ID performance.CMU motion capture(CMU‐mocap) The CMU dataset consists of five general classes of actions. Similar to References [27,29,57], we use eight detailed actions from these classes: “basketball,” “basketball signal,” “directing traffic,” “jumping,” “running,” “soccer,” “walking,” and “window washing.” We use two representations, a 64‐dimensional vector that gives an exponential map representation58 of the joint angle, and a 75‐dimensional vector that gives the 3D Cartesian coordinates of 25 joints. We do not tune any hyperparameters on this dataset and use only a train and test set with the same split as is common in the literature.29,30Model configurationWe implemented the model in PyTorch59 using the Adam optimiser.60 The learning rate was set to 0.0005 for all experiments where, unlike Mao et al.,29,32 we did not decay the learning rate as it was hypothesised that the dynamic relationship between the discriminative and generative loss would make this redundant. The batch size was 16. For numerical stability, gradients were clipped to a maximum ℓ2‐norm of 1 and logσ̂2 and values were clamped between −20 and 3. Code for all experiments is available at: https://github.com/bouracha/OoDMotionBaseline comparisonBoth Mao et al29 (GCN), and Mao et al32 (attention‐GCN) use this same GCN architecture with DCT inputs. In particular, Mao et al32 increase the amount of history accounted for by the GCN by adding a motion attention mechanism to weight the DCT coefficients from different sections of the history prior to being input to the GCN. We compare against both of these baselines on OoD actions. For attention‐GCN, we leave the attention mechanism preceding the GCN unchanged such that the generative branch of the model is reconstructing the weighted DCT inputs to the GCN, and the whole network is end‐to‐end differentiable.Hyperparameter searchSince a new term has been introduced to the loss function, it was necessary to determine a sensible weighting between the discriminative and generative models. In Reference [33], this weighting was arbitrarily set to 0.1. It is natural that the optimum value here will relate to the other regularisation parameters in the model. Thus, we conducted random hyperparameter search for pdrop and λ in the ranges pdrop=0,0.5 on a linear scale, and λ=10,0.00001 on a logarithmic scale. For fair comparison, we also conducted hyperparameter search on GCN, for values of the dropout probability (pdrop) between 0.1 and 0.9. For each model, 25 experiments were run and the optimum values were selected on the lowest ID validation error. The hyperparameter search was conducted only for the GCN model on short‐term predictions for the H3.6M dataset and used for all future experiments hence demonstrating generalisability of the architecture.ResultsConsistent with the literature, we report short‐term (<500ms) and long‐term (>500ms) predictions. In comparison to GCN, we take short‐term history into account (10 frames, 400ms) for both datasets to predict both short‐ and long‐term motion. In comparison to attention‐GCN, we take long‐term history (50 frames, 2 seconds) to predict the next 10 frames, and predict further into the future by recursively applying the predictions as input to the model as in Reference[32]. In this way, a single short‐term prediction model may produce long‐term predictions.We use Euclidean distance between the predicted and ground‐truth joint angles for the Euler angle representation. For 3D joint coordinate representation, we use the MPJPE as used for training (Equation (5)). Table 1 reports the joint angle error for the short‐term predictions on the H3.6M dataset. Here, we found the optimum hyperparameters to be pdrop=0.5 for GCN, and λ=0.003, with pdrop=0.3 for our augmentation of GCN. The latter of which was used for all future experiments, where for our augmentation of attention‐GCN we removed dropout altogether. On average, our model performs convincingly better both ID and OoD. Here, the generative branch works well as both a regulariser for small datasets and by creating robustness to distributional shifts. We see similar and consistent results for long‐term predictions in Table 2.1TABLEShort‐term prediction of Euclidean distance between predicted and ground truth joint angles on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Average (of 14 for OoD)Milliseconds160320400160320400160320400160320400GCN (OoD)0.370.600.650.380.650.790.551.081.100.691.091.27SD0.0080.0080.010.010.030.040.010.020.020.020.040.04Ours (OoD)0.370.590.640.370.590.720.541.010.990.681.071.21SD0.0040.030.030.010.030.040.010.010.020.010.010.02Note: Each experiment conducted three times. We report the mean and SD. Note that we have lower variance in our results. Full table is given in Table A1. Bold values correspond to the best score for the respective simulation across the different models.Abbreviations: GCN, graph convolutional network; OoD, out‐of‐distribution.2TABLELong‐term prediction of Eucildean distance between predicted and ground truth joint angles on H3.6MWalkingEatingSmokingDiscussionAverageMilliseconds56010005601000560100056010005601000GCN (OoD)0.800.800.891.201.261.851.451.881.101.43Ours (OoD)0.660.720.901.191.171.781.441.901.041.40Note: Bold correspond to lowest values.Abbreviations: GCN, graph convolutional network; OoD, out‐of‐distribution.From Tables 3 and 4, we can see that the superior OoD performance generalises to the CMU dataset with the same hyperparameter settings with a similar trend of the difference being larger for longer predictions for both joint angles and 3D joint coordinates. For each of these experiments nz=8.3TABLEEuclidean distance between predicted and ground truth joint angles on CMUBasketball (ID)Basketball signal (OoD)Average (of 7 for OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.400.671.111.251.630.270.551.141.422.180.360.651.411.492.17Ours0.400.661.121.291.760.280.571.151.432.070.340.621.351.412.10Note: Full table is given in Table A2. Bold values correspond to the best score for the respective simulation across the different models.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.4TABLEMean joint per position error (MPJPE) between predicted and ground truth three‐dimensional Cartesian coordinates of joints on CMUBasketballBasketball signalAverage (of 7 for OoD)Milliseconds801603204001000801603204001000801603204001000GCN (OoD)15.728.954.165.4108.414.430.463.578.7114.820.043.886.3105.8169.2Ours (OoD)16.030.054.565.598.112.826.053.767.6103.221.642.384.2103.8164.3Note: Full table is given in Table A3.Abbreviations: GCN, graph convolutional network; OoD, out‐of‐distribution.Table 5, shows that the effectiveness of the generative branch generalises to the very recent motion attention architecture. For attention‐GCN we used nz=32. Here, interestingly short‐term predictions are poor but long‐term predictions are consistently better. This supports our assertion that information relevant to generative mechanisms are more intrinsic to the causal model and thus, here, when the predicted output is recursively used, more useful information is available for the future predictions.5TABLELong‐term prediction of three‐dimensional joint positions on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Average (of 14 for OoD)Milliseconds5607208801000560720880100056072088010005607208801000att‐GCN (OoD)55.460.565.268.787.6103.6113.2120.381.793.7102.9108.7112.1129.6140.3147.8Ours (OoD)58.760.665.569.181.794.4102.7109.380.689.999.2104.1113.1127.7137.9145.3Note: Here ours is also trained with the attention‐GCN model. Full table is given in Table A4. Bold values correspond to the best score for the respective simulation across the different models.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.CONCLUSIONWe draw attention to the need for robustness to distributional shifts in predicting human motion, and propose a framework for its evaluation based on major open‐source datasets. We demonstrate that state‐of‐the‐art discriminative architectures can be hardened to extreme distributional shifts by augmentation with a generative model, combining low in‐distribution predictive error with maximal generalisability. Our investigation argues for wider use of generative models in behavioural modelling, and shows it can be performed with minimal or no performance penalty, within hybrid architectures of potentially diverse constitution. Further work could examine the survey ability of latent space introduced by the VAE.ACKNOWLEDGEMENTSAnthony Bourached is funded by the UKRI UCL Centre for Doctoral Training in AI‐enabled Healthcare Systems. Robert Gray, Ashwani Jha and Parashkev Nachev are funded by the Wellcome Trust (213038) and the NHR UCL Biomedical Research Centre.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available, instructions at https://github.com/bouracha/OoDMotion.REFERENCESGeertsema EE, Thijs RD, Gutter T, et al. Automated video‐based detection of nocturnal convulsive seizures in a residential care setting. Epilepsia. 2018;59:53‐60.Kakar M, Nyström H, Aarup LR, Nøttrup TJ, Olsen DR. Respiratory motion prediction by using the adaptive neuro fuzzy inference system (anfis). Phys Med Biol. 2005;50(19):4721‐4728.Chang C‐Y, Lange B, Zhang M, et al. Towards pervasive physical rehabilitation using microsoft kinect. 2012 6th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops. IEEE; 2012:159‐162. https://ieeexplore.ieee.org/abstract/document/6240377Webster D, Celik O. Systematic review of kinect applications in elderly care and stroke rehabilitation. J Neuroeng Rehabil. 2014;11(1):108.Gui L‐Y, Zhang K, Wang Y‐X, Liang X, Moura JM, Veloso M. Teaching robots to predict human motion. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2018:562‐567. https://ieeexplore.ieee.org/abstract/document/8594452Koppula H, Saxena A. Learning spatio‐temporal structure from RGB‐D videos for human activity detection and anticipation. International Conference on Machine Learning; 2013:792‐800. https://proceedings.mlr.press/v28/koppula13.htmlKoppula HS, Saxena A. Anticipating human activities for reactive robotic response. Tokyo: IROS; 2013:2071.Alahi A, Goel K, Ramanathan V, Robicquet A, Fei‐Fei L, Savarese S. Social lstm: human trajectory prediction in crowded spaces. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:961‐971. https://openaccess.thecvf.com/content_cvpr_2016/html/Alahi_Social_LSTM_Human_CVPR_2016_paper.htmlBhattacharyya A, Fritz M, Schiele B. Long‐term on‐board prediction of people in traffic scenes under uncertainty. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018:4194‐4202. https://openaccess.thecvf.com/content_cvpr_2018/html/Bhattacharyya_Long-Term_On-Board_Prediction_CVPR_2018_paper.htmlPaden B, Čáp M, Yong SZ, Yershov D, Frazzoli E. A survey of motion planning and control techniques for self‐driving urban vehicles. IEEE Trans Intell Veh. 2016;1(1):33‐55. https://ieeexplore.ieee.org/abstract/document/7490340Wang Y, Liu Z, Zuo Z, Li Z, Wang L, Luo X. Trajectory planning and safety assessment of autonomous vehicles based on motion prediction and model predictive control. IEEE Trans Veh Technol. 2019;68(9):8546‐8556.Švec P, Thakur A, Raboin E, Shah BC, Gupta SK. Target following with motion prediction for unmanned surface vehicle operating in cluttered environments. Autonomous Robots. 2014;36(4):383‐405.Lau RW, Chan A. Motion prediction for online gaming. International Workshop on Motion in Games. Berlin/Heidelberg, Germany: Springer; 2008:104‐114.A. R. Rofougaran, M. Rofougaran, N. Seshadri, B. B. Ibrahim, J. Walley, and J. Karaoguz. Game console and gaming object with motion prediction modeling and methods for use therewith, 2018. US Patent 9,943,760.Shirai A, Geslin E, Richir S. Wiimedia: motion analysis methods and applications using a consumer video game controller. Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. New York, NY: Association for Computing Machinery; 2007:133‐140.Grant J, Boukouvalas A, Griffiths R‐R, Leslie D, Vakili S, De Cote EM. Adaptive sensor placement for continuous spaces. International Conference on Machine Learning. PMLR; 2019:2385‐2393. https://proceedings.mlr.press/v97/grant19a.htmlKim D, Paik J. Gait recognition using active shape model and motion prediction. IET Compu Vis. 2010;4(1):25‐36.Ma Z, Wang X, Ma R, Wang Z, Ma J. Integrating gaze tracking and head‐motion prediction for mobile device authentication: a proof of concept. Sensors. 2018;18(9):2894.Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: MIT Press; 2009.Lehrmann AM, Gehler PV, Nowozin S. Efficient nonlinear markov models for human motion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014:1314‐1321. https://openaccess.thecvf.com/content_cvpr_2014/html/Lehrmann_Efficient_Nonlinear_Markov_2014_CVPR_paper.htmlSutskever I, Hinton GE, Taylor GW. The recurrent temporal restricted boltzmann machine. Advances in Neural Information Processing Systems; 2009:1601‐1608. https://www.cs.utoronto.ca/~hinton/absps/rtrbm.pdfTaylor GW, Hinton GE, Roweis ST. Modeling human motion using binary latent variables. Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press; 2007:1345‐1352. https://proceedings.neurips.cc/paper/2006/file/1091660f3dff84fd648efe31391c5524‐Paper.pdfAksan E, Kaufmann M, Hilliges O. Structured prediction helps 3d human motion modelling. Proceedings of the IEEE International Conference on Computer Vision; 2019:7144‐7153.Butepage J, Black MJ, Kragic D, Kjellstrom H. Deep representation learning for human motion prediction and classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:6158‐6166.Cai Y, Huang L, Wang Y, et al. Learning progressive joint propagation for human motion prediction. Proceedings of the European Conference on Computer Vision (ECCV); 2020.Fragkiadaki K, Levine S, Felsen P, Malik J. Recurrent network models for human dynamics. Proceedings of the IEEE International Conference on Computer Vision; 2015:4346‐4354.Li C, Zhang Z, Sun Lee W, Hee Lee G. Convolutional sequence to sequence model for human dynamics. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018:5226‐5234.Li M, Chen S, Zhao Y, Zhang Y, Wang Y, Tian Q. Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020:214‐223.Mao W, Liu M, Salzmann M, Li H. Learning trajectory dependencies for human motion prediction. Proceedings of the IEEE International Conference on Computer Vision; 2019:9489‐9497.Martinez J, Black MJ, Romero J. On human motion prediction using recurrent neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:2891‐2900.Amodei D, Olah C, Steinhardt J, Christiano P, Schulman J, Mané D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. 2016.Mao W, Miaomiao L, Mathieu S. History repeats itself: human motion prediction via motion attention. ECCV. 2020.Myronenko A. 3D MRI brain tumor segmentation using autoencoder regularization. International MICCAI Brainlesion Workshop. Cham, Switzerland: Springer; 2018:311‐320.Ionescu C, Papava D, Olaru V, Sminchisescu C. Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell. 2013;36(7):1325‐1339.Gopalakrishnan A, Mali A, Kifer D, Giles L, Ororbia AG. A neural temporal model for human motion prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019:12116‐12125.Gui L‐Y, Wang Y‐X, Liang X, Moura JM. Adversarial geometry‐aware human motion prediction. Proceedings of the European Conference on Computer Vision (ECCV); 2018:786‐803.Guo X, Choi J. Human motion prediction via learning local structure representations and temporal dependencies. Proc AAAI Conf Artif Intel. 2019;33:2580‐2587.Jain A, Zamir AR, Savarese S, Saxena A. Structural‐rnn: deep learning on spatio‐temporal graphs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:5308‐5317.Pavllo D, Grangier D, Auli M. Quaternet: a quaternion‐based recurrent model for human motion. arXiv preprint arXiv:1805.06485. 2018.Gossner O, Steiner J, Stewart C. Attention please! Econometrica. 2021;89(4):1717‐1751.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436‐444.Hendrycks D, Gimpel K. A baseline for detecting misclassified and out‐of‐distribution examples in neural networks. arXiv preprint arXiv: 1610.02136. 2016.Hendrycks D, Mazeika M, Dietterich T. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. 2018.Liang S, Li Y, Srikant R. Enhancing the reliability of out‐of‐distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. 2017.Daxberger E, Hernández‐Lobato JM. Bayesian variational autoencoders for unsupervised out‐of‐distribution detection. arXiv preprint arXiv:1912.05651. 2019.Nalisnick E, Matsukawa A, Teh YW, Gorur D, Lakshminarayanan B. Do deep generative models know what they don't know? arXiv preprint arXiv:1810.09136. 2018.Grathwohl W, Wang K‐C, Jacobsen J‐H, Duvenaud D, Norouzi M, Swersky K. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263. 2019.Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems; 2017:5574‐5584.Kingma DP, Welling M. Auto‐encoding variational bayes. arXiv preprint arXiv:1312.6114. 2013.Bourached A, Nachev P. Unsupervised videographic analysis of rodent behaviour. arXiv preprint arXiv:1910.11065. 2019.Motegi Y, Hijioka Y, Murakami M. Human motion generative model using variational autoencoder. Int J Model Optim. 2018;8(1):8‐12.Chen N, Bayer J, Urban S, P. Van Der Smagt. Efficient movement representation by embedding dynamic movement primitives in deep autoencoders. 2015 IEEE‐RAS 15th International Conference on Humanoid Robots (Humanoids). IEEE; 2015:434‐440.Cao Z, Hidalgo G, Simon T, Wei S‐E, Sheikh Y. Openpose: realtime multi‐person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008. 2018.Rezende DJ, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning; 2014:1278‐1286.Kipf TN, Welling M. Variational graph auto‐encoders. arXiv preprint arXiv:1611.07308. 2016.Ionescu C, Li F, Sminchisescu C. Latent structured models for human pose estimation. 2011 International Conference on Computer Vision. IEEE; 2011:2220‐2227.Li D, Rodriguez C, Yu X, Li H. Word‐level deep sign language recognition from video: a new large‐scale dataset and methods comparison. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2020.Grassia FS. Practical parameterization of rotations using the exponential map. J Graph Tools. 1998;3(3):29‐48.A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic Differentiation in Pytorch. 2017. https://openreview.net/forum?id=BJJsrmfCZKingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. 2018.AAPPENDIXThe appendix consists of four parts. We provide a brief summary of each section below.Appendix A: We provide results from our experimentation to determine the optimum way of defining separable distributions on the H3.6M, and the CMU datasets.Appendix B: We provide the full results of tables which are shown in part in the main text.Appendix C: We inspect the generative model by examining its latent space and use it to consider the role that the generative model plays in learning as well as possible directions of future work.Appendix D: We provide larger diagrams of the architecture of the augmented GCN.A.1.Appendix A: Discussion of the definition of out‐of‐distributionHere, we describe in more detail the empirical motivation for our definition of out‐of‐distribution (OoD) on the H3.6M and CMU datasets.Figure A1 shows the distribution of actions for the H3.6M and CMU datasets. We want our ID data to be small in quantity, and narrow in domain. Since this dataset is labelled by action, we are provided with a natural choice of distribution being one of these actions. Moreover, it is desirable that the action be quantifiably distinct from the other actionsTo determine which action supports these properties we train a simple classifier to determine which action is most easily distinguished from the others based on the DCT inputs: DCTx→k=DCTxk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. We make no assumption on the architecture that would be optimum to determine the separation, and so use a simple fully connected model with 4 layers. Layer 1: inputdimensions×1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy is used as a loss function on these logits during training. We used ReLU activations with a dropout probability of 0.5.We trained this model using the last 10 historic frames (N=10, T=10) with 20 DCT coefficients for both the H3.6M and CMU datasets, as well as (N=50, T=10) with 20 DCT coefficients additionally for H3.6M (here we select only the 20 lowest frequency DCT coefficients). We trained each model for 10 epochs with a batch size of 2048, and a learning rate of 0.00001. The confusion matrices for the H3.6M dataset are shown in Figures A2 and A3, respectively. Here, we use the same train set as outlined in Section 6.1. However, we report results on subject 11‐ which for motion prediction was used as the validation set. We did this because the number of instances are much greater than subject 5, and no hyperparameter tuning was necessary. For the CMU dataset, we used the same train and test split as for all other experiments.In both cases, for the H3.6M dataset, the classifier achieves the highest precision score (0.91 and 0.95, respectively) for the action walking as well as a recall score of 0.83 and 0.81, respectively. Furthermore, in both cases walking together dominates the false negatives for walking (50%, and 44% in each case) as well as the false positives (33% in each case).A1FIGURE(A) Distribution of short‐term training instances for actions in H3.6M. (B) Distribution of training instances for actions in CMUThe general increase in the distinguishability that can be seen in Figure A3 increases the demand to be able to robustly handle distributional shifts as the distribution of values that represent different actions only gets more pronounced as the time scale is increased. This is true with even the näive DCT transformation to capture longer time scales without increasing vector size.As we can see from the confusion matrix in Figure A4, the actions in the CMU dataset are even more easily separable. In particular, our selected ID action in the paper, Basketball, can be identified with 100% precision and recall on the test set.A2FIGUREConfusion matrix for a multiclass classifier for action labels. In each case, we use the same input convention x→k=xk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. Such that in each case input to the classifier is 48×20=960. The classifier has four fully connected layers. Layer 1: input dimensions × 1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5. We used a batch size of 2048, and a learning rate of 0.00001.H3.6M dataset. N=10, T=10. Number of discrete cosine transformations (DCT) coefficients = 20 (lossesless transformation)A3FIGUREConfusion matrix for a multiclass classifier for action labels. In each case, we use the same input convention x→k=xk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. Such that in each case input to the classifier is 48×20=960. The classifier has four fully connected layers. Layer 1: input dimensions × 1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5. We used a batch size of 2048, and a learning rate of 0.00001.H3.6M dataset. N=50, T=10. Number of discrete cosine transformations (DCT) coefficients = 20, where the 40 highest frequency DCT coefficients are culledA4FIGUREConfusion matrix for a multiclass classifier for action labels. In each case, we use the same input convention x→k=xk,1…xk,Nxk,N+1…xk,N+T, where xk,n=xk,N for n≥N. Such that in each case input to the classifier is 48×20=960. The classifier has four fully connected layers. Layer 1: input dimensions × 1024, layer 2: 1024×512, layer 3: 512×128, layer 4: 128×15 (or 128×8 for CMU). Where the final layer uses a softmax to predict the class label. Cross entropy loss is used for training and ReLU activations with a dropout probability of 0.5. We used a batch size of 2048, and a learning rate of 0.00001. CMU dataset. N=10, T=25. Number of discrete cosine transformations (DCT) coefficients = 35 (losses less transformation)A.2.Appendix B: Full resultsA1TABLEShort‐term prediction of Euclidean distance between predicted and ground truth joint angles on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Discussion (OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.220.370.600.650.220.380.650.790.280.551.081.100.290.650.981.08SD0.0010.0080.0080.010.0030.010.030.040.010.010.020.020.0040.010.040.04Ours (OoD)0.230.370.590.640.210.370.590.720.280.541.010.990.310.650.971.07SD0.0030.0040.030.030.0080.010.030.040.0050.010.010.020.0050.0090.020.01Directions (OoD)Greeting (OoD)Phoning (OoD)Posing (OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.380.590.820.920.480.811.251.440.581.121.521.610.270.591.261.53SD0.010.030.050.060.0060.010.020.020.0060.010.010.010.010.050.10.1Ours (OoD)0.380.580.790.900.490.811.241.430.571.101.521.610.330.681.251.51SD0.0070.020.00.050.0060.0050.020.020.0040.0030.010.010.020.050.030.03Purchases (OoD)Sitting (OoD)Sitting down (OoD)Taking photo (OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.620.901.341.420.400.661.151.330.460.941.521.690.260.530.820.93SD0.0010.0010.020.030.0030.0070.020.030.010.030.040.050.0050.010.010.02Ours (OoD)0.620.891.231.310.390.631.051.200.400.791.191.330.260.520.810.95SD0.0010.0020.0050.010.0010.0010.0040.0050.0070.0090.010.020.0050.010.010.01Waiting (OoD)Walking dog (OoD)Walking together (OoD)Average (of 14 for OoD)Milliseconds80160320400801603204008016032040080160320400GCN (OoD)0.290.591.061.300.520.861.181.330.210.440.670.720.380.691.091.27SD0.010.030.050.050.010.020.020.030.0050.020.030.030.0070.020.040.04Ours (OoD)0.290.581.061.290.520.881.171.340.210.440.660.740.380.681.071.21SD0.00070.0030.0010.0060.0060.010.0080.010.010.010.010.010.0060.010.010.02Note: Each experiment conducted three times. We report the mean and standard deviation. Note that we have lower variance in our results.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A2TABLEEuclidean distance between predicted and ground truth joint angles on CMUBasketball (ID)Basketball signal (OoD)Directing traffic (OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.400.671.111.251.630.270.551.141.422.180.310.621.051.242.49Ours0.400.661.121.291.760.280.571.151.432.070.280.560.961.102.33Jumping (OoD)Running (OoD)Soccer (OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.420.731.721.982.660.460.841.501.721.570.290.541.151.412.14Ours0.380.721.742.032.700.460.811.361.532.090.280.531.071.271.99Walking (OoD)Washing window (OoD)Average (of 7 for OoD)Milliseconds801603204001000801603204001000801603204001000GCN0.400.610.971.181.850.360.651.231.512.310.360.651.411.492.17Ours0.380.540.820.991.270.350.631.201.512.260.340.621.351.412.10Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A3TABLEMean joint per position error (MPJPE) between predicted and ground truth 3D Cartesian coordinates of joints on CMUBasketball (ID)Basketball signal (OoD)Directing traffic (OoD)Milliseconds801603204001000801603204001000801603204001000GCN15.728.954.165.4108.414.430.463.578.7114.818.537.475.693.6210.7Ours16.030.054.565.598.112.826.053.767.6103.218.337.275.793.8199.6Jumping (OoD)Running (OoD)Soccer (OoD)Milliseconds801603204001000801603204001000801603204001000GCN24.651.2111.4139.6219.732.354.885.999.399.922.646.692.8114.3192.5Ours25.052.0110.3136.8200.229.850.283.598.7107.321.144.290.4112.1202.0Walking (OoD)Washing window (OoD)Average of 7 for (OoD)Milliseconds801603204001000801603204001000801603204001000GCN10.820.742.953.486.517.136.477.696.0151.620.043.886.3105.8169.2Ours10.518.939.248.672.217.637.382.0103.4167.521.642.384.2103.8164.3Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A4TABLELong‐term prediction of 3D joint positions on H3.6MWalking (ID)Eating (OoD)Smoking (OoD)Discussion (OoD)milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)55.460.565.268.787.6103.6113.2120.381.793.7102.9108.7114.6130.0133.5136.3Ours (OoD)58.760.665.569.181.794.4102.7109.380.689.999.2104.1115.4129.0134.5139.4Directions (OoD)Greeting (OoD)Phoning (OoD)Posing (OoD)Milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)107.0123.6132.7138.4127.4142.0153.4158.698.7117.3129.9138.4151.0176.0189.4199.6Ours (OoD)107.1120.6129.2136.6128.0140.3150.8155.795.8111.0122.7131.4158.7181.3194.4203.4Purchases (OoD)Sitting (OoD)Sitting down (OoD)Taking photo (OoD)Milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)126.6144.0154.3162.1118.3141.1154.6164.0136.8162.3177.7189.9113.7137.2149.7159.9Ours (OoD)128.0143.2154.7164.3118.4137.7149.7157.5136.8157.6170.8180.4116.3134.5145.6155.4Waiting (OoD)Walking Dog (OoD)Walking together (OoD)Average (of 14 for OoD)Milliseconds5607208801000560720880100056072088010005607208801000Attention‐GCN (OoD)109.9125.1135.3141.2131.3146.9161.1171.464.571.176.880.8112.1129.6140.3147.8Ours (OoD)110.4124.5133.9140.3138.3151.2165.0175.567.771.977.180.8113.1127.7137.9145.3Note: Here ours is also trained with the attention‐GCN model.Abbreviations: GCN, graph convolutional network; ID, in‐distribution; OoD, out‐of‐distribution.A.3.Appendix C: Latent space of the VAEOne of the advantages of having a generative model involved is that we have a latent variable which represents a distribution over deterministic encodings of the data. We considered the question of whether or not the VAE was learning anything interpretable with its latent variable as was the case in Reference [55].The purpose of this investigation was 2‐fold. First to determine if the generative model was learning a comprehensive internal state, or just a nonlinear average state as is common to see in the training of VAE like architectures. The result of this should suggest a key direction of future work. Second, an interpretable latent space may be of paramount usefulness for future applications of human motion prediction. Namely, if dimensionality reduction of the latent space to an inspectable number of dimensions yields actions, or behaviour that are close together if kinematically or teleolgically similar as in Reference [50], then, human experts may find unbounded potential application for a interpretation that is both quantifiable and qualitatively comparable to all other classes within their domain of interest. For example, a medical doctor may consider a patient to have unusual symptoms for condition, say, A. It may be useful to know that the patient's deviation from a classic case of A, is in the direction of condition, say, B.We trained the augmented GCN model discussed in the main text with all actions, for both datasets. We use Uniform Manifold Approximation and Projection (UMAP)61 to project the latent space of the trained GCN models onto two dimensions for all samples in the dataset for each dataset independently. From Figure A5, we can see that for both models the 2D project relatively closely resembles a spherical gaussian. Furthermore, we can see from Figure A5B that the action walking does not occupy a discernible domain of the latent space. This result is further verified by using the same classifier as used in Appendix 8, which achieved no better than chance when using the latent variables as input rather than the raw data input.This result implies that the benefit observed in the main text is by using the generative model is significant even if the generative model has poor performance itself. In this case we can be sure that the reconstructions are at least not good enough to distinguish between actions. It is hence natural for future work to investigate if the improvement on OoD performance is greater if trained in such a way as to ensure that the generative model performs well. There are multiple avenues through which such an objective might be achieve. Pre‐training the generative model being one of the salient candidates.A5FIGURELatent embedding of the trained model on both the H3.6M and the CMU datasets independently projected in 2D using UMAP from 384 dimensions for H3.6M, and 512 dimensions for CMU using default hyperparameters for UMAP. (A) H3.6M. All actions, opacity=0.1. (B) H3.6M. All actions in blue: opacity=0.1. Walking in red: opacity=1. (C) CMU. All actions in blue: opacity = 0.1A.4.Appendix D: Architecture diagramsA6FIGURENetwork architecture with discriminative and variational autoencoder (VAE) branchA7FIGUREGraph convolutional layer (GCL) and a residual graph convolutional block (GCB)

Journal

Applied AI LettersWiley

Published: Apr 1, 2022

Keywords: deep learning; generative models; human motion prediction; variational autoencoders

There are no references for this article.