Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Models Genesis

Models Genesis Medical Image Analysis (2020) Contents lists available at ScienceDirect Medical Image Analysis journal homepage: www.elsevier.com/locate/media a b b c a, Zongwei Zhou , Vatsal Sodha , Jiaxuan Pang , Michael B. Gotway , Jianming Liang Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ 85281 USA Department of Radiology, Mayo Clinic, Scottsdale, AZ 85259, USA A R T I C L E I N F O A B S T R A C T Transfer learning from natural images to medical images has been established as one Article history: Received 1 May 2019 of the most practical paradigms in deep learning for medical image analysis. To fit Received in final form *** this paradigm, however, 3D imaging tasks in the most prominent imaging modalities Accepted *** (e.g., CT and MRI) have to be reformulated and solved in 2D, losing rich 3D anatom- Available online *** ical information, thereby inevitably compromising its performance. To overcome this limitation, we have built a set of models, called Generic Autodidactic Models, nick- Communicated by *** named Models Genesis, because they are created ex nihilo (with no manual labeling), self-taught (learnt by self-supervision), and generic (served as source models for gener- 2000 MSC: 41A05, 41A10, 65D05, ating application-specific target models). Our extensive experiments demonstrate that 65D17 our Models Genesis significantly outperform learning from scratch and existing pre- trained 3D models in all five target 3D applications covering both segmentation and Keywords: 3D Deep Learning, Represen- classification. More importantly, learning a model from scratch simply in 3D may not tation Learning, Transfer Learning, Self- -supervised Learning necessarily yield performance better than transfer learning from ImageNet in 2D, but our Models Genesis consistently top any 2D/2.5D approaches including fine-tuning the models pre-trained from ImageNet as well as fine-tuning the 2D versions of our Models Genesis, confirming the importance of 3D anatomical information and significance of Models Genesis for 3D medical imaging. This performance is attributed to our uni- fied self-supervised learning framework, built on a simple yet powerful observation: the sophisticated and recurrent anatomy in medical images can serve as strong yet free supervision signals for deep models to learn common anatomical representation au- tomatically via self-supervision. As open science, all codes and pre-trained Models Genesis are available at https://github.com/MrGiovanni/ModelsGenesis. © 2020 Elsevier B. V. All rights reserved. 1. Introduction models built directly using medical images. To test this hypoth- esis, we have chosen chest imaging because the chest contains Transfer learning from natural images to medical images has several critical organs, which are prone to a number of diseases become the de facto standard in deep learning for medical im- that result in substantial morbidity and mortality, hence asso- age analysis (Tajbakhsh et al., 2016; Shin et al., 2016), but given ciated with significant health-care costs. In this research, we the marked di erences between natural images and medical focus on Chest CT, because of its prominent role in diagnos- images, we hypothesize that transfer learning can yield more ing lung diseases, and our research community has accumu- powerful (application-specific) target models from the source lated several Chest CT image databases, for instance, LIDC- IDRI (Armato III et al., 2011) and NLST (NLST, 2011), con- taining a large number of Chest CT images. However, system- Corresponding author: Jianming.Liang@asu.edu (Jianming Liang) arXiv:2004.07882v4 [cs.CV] 16 Dec 2020 2 Zongwei Zhou et al. / Medical Image Analysis (2020) Table 1: Pre-trained models with proxy tasks and target tasks. This paper uses transfer learning in a broader sense, where a source model is first trained to learn image presentation via full supervision or self supervision by solving a problem, called proxy task (general or application-specific), on a source dataset with expert-provided or automatically-generated labels, and then this pre-trained source model is fine tuned (transferred) through full supervision to yield a target model to solve application-specific problems (target tasks) in the same or di erent datasets (target datasets). We refer transfer learning to same-domain transfer learning when the models are pre-trained and fine-tuned within the same domain (modality, organ, disease, or dataset), and to cross-domain when the models are pre-trained in one domain and fine-tuned for a di erent domain. Pre-trained model Modality Source dataset Superv. / Annot. Proxy task Genesis Chest CT 2D CT LUNA 2016 (Setio et al., 2017) Self / 0 Image restoration on 2D Chest CT slices Genesis Chest CT (3D) CT LUNA 2016 (Setio et al., 2017) Self / 0 Image restoration on 3D Chest CT volumes Genesis Chest X-ray (2D) X-ray ChestX-ray8 (Wang et al., 2017b) Self / 0 Image restoration on 2D Chest Radiographs Models ImageNet Natural ImageNet (Deng et al., 2009) Full / 14M images Image classification on 2D ImageNet Inflated 3D (I3D) Natural Kinetics (Carreira and Zisserman, 2017) Full / 240K videos Action recognition on human action videos NiftyNet CT Pancreas-CT & BTCV (Gibson et al., 2018a) Full / 90 cases Organ segmentation on abdominal CT MedicalNet CT, MRI 3DSeg-8 (Chen et al., 2019b) Full / 1,638 cases Disease/organ segmentation on 8 datasets Code Object Modality Target dataset Target task NCC Lung Nodule CT LUNA 2016 (Setio et al., 2017) Lung nodule false positive reduction NCS Lung Nodule CT LIDC-IDRI (Armato III et al., 2011) Lung nodule segmentation ECC Pulmonary Emboli CT PE-CAD (Tajbakhsh et al., 2015) Pulmonary embolism false positive reduction LCS Liver CT LiTS 2017 (Bilic et al., 2019) Liver segmentation BMS Brain Tumor MRI BraTS 2018 (Menze et al., 2015; Bakas et al., 2018) Brain tumor segmentation The first letter denotes the object of interest (“N” for lung nodule, “E” for pulmonary embolism, “L” for liver, etc); the second letter denotes the modality (“C” for CT, “M” for MRI, etc); the last letter denotes the task (“C” for classification, “S” for segmentation). atically annotating Chest CT scans is not only tedious, labori- available, pre-trained, (fully) supervised 3D models (see Ta- ous, and time-consuming, but it also demands costly, specialty- ble 3). Our results confirm the importance of 3D anatomical oriented skills, which are not easily accessible. Therefore, we information and demonstrate the significance of Models Gene- seek to answer the following question: Can we utilize the large sis for 3D medical imaging. number of available Chest CT images without systematic anno- This performance is attributable to the following key obser- tation to train source models that can yield high-performance vation: medical imaging protocols typically focus on partic- target models via transfer learning? ular parts of the body for specific clinical purposes, resulting in images of similar anatomy. The sophisticated yet recurrent To answer this question, we have developed a framework that anatomy o ers consistent patterns for self-supervised learning trains generic source models for 3D medical imaging. Our to discover common representation of a particular body part framework is autodidactic—eliminating the need for labeled (the lungs in our case). As illustrated in Fig. 1, the fundamental data by self-supervision; robust—learning comprehensive im- idea behind our self-supervised learning method is to recover age representation from a mixture of self-supervised tasks; scalable—consolidating a variety of self-supervised tasks into anatomical patterns from images transformed via various ways a single image restoration task with the same encoder-decoder in a unified framework. architecture; and generic—benefiting a range of 3D medical In summary, we make the following three contributions: imaging tasks through transfer learning. We call the models 1. A collection of generic pre-trained 3D models, performing trained with our framework Generic Autodidactic Models, nick- e ectively across diseases, organs, and modalities. named Models Genesis, and refer to the model trained using Chest CT images as Genesis Chest CT. As ablation studies, we 2. A scalable self-supervised learning framework, o ering have also trained a downgraded 2D version using 2D Chest CT encoder for classification and encoder-decoder for seg- slices, called Genesis Chest CT 2D. For thorough performance mentation. comparisons, we have trained a 2D model using Chest X-ray images, named as Genesis Chest X-ray (detailed in Table 1). 3. A set of self-supervised training schemes, learning robust Naturally, 3D imaging tasks in the most prominent medical representation from multiple perspectives. imaging modalities (e.g., CT and MRI) should be solved di- rectly in 3D, but 3D models generally have significantly more In the remainder of this paper, we first in Sec. 2 introduce our parameters than their 2D counterparts, thus demanding more self-supervised learning framework for training Models Gen- labeled data for training. As a result, learning from scratch sim- esis, covering our four proposed image transformations with ply in 3D may not necessarily yield performance better than their learning perspectives, and describing the four unique prop- fine-tuning Models ImageNet (i.e., pre-trained models on Ima- erties of our Models Genesis. Sec. 3 details the training pro- geNet), as revealed in Fig. 7. However, as demonstrated by our cess of Models Genesis and the five target tasks for evaluating extensive experiments in Sec. 3, our Genesis Chest CT not only Models Genesis, while Sec. 4 summarizes the five major ob- significantly outperforms learning 3D models from scratch (see servations from our extensive experiments, demonstrating that Fig. 4), but also consistently tops any 2D/2.5D approaches in- our Models Genesis can serve as a primary source of trans- cluding fine-tuning Models ImageNet as well as fine-tuning our fer learning for 3D medical imaging. In Sec. 5, we discuss Genesis Chest X-ray and Genesis Chest CT 2D (see Fig. 7 and various aspects of Models Genesis, including their relationship Table 4). Furthermore, Genesis Chest CT surpasses publicly- with automated data augmentation, their impact on the creation Zongwei Zhou et al. / Medical Image Analysis (2020) 3 Fig. 1: [Better viewed on-line, in color, and zoomed in for details] Our self-supervised learning framework aims to learn general-purpose image representation by recovering the original sub-volumes of images from their transformed ones. We first crop arbitrarily-size sub-volume x at a ran- dom location from an unlabeled CT image. Each sub-volume x can undergo at most three out of four transformations: non-linear, local-shuing, outer-cutout, and inner-cutout, resulting in a transformed sub-volume x . It should be noted that outer-cutout and inner-cutout are considered mu- tually exclusive. Therefore, in addition to the four original individual transformations, this process yields eight more transformations, including one identity mapping ( meaning none of the four individual transformations is selected) and seven combined transformations. A Model Genesis, an encoder-decoder architecture with skip connections in between, is trained to learn a common image representation by restoring the original sub-volume x (as ground truth) from the transformed one x ˜ (as input), in which the reconstruction loss (MSE) is computed between the model i i prediction x and ground truth x . Once trained, the encoder alone can be fine-tuned for target classification tasks; while the encoder and decoder together can be fine-tuned for target segmentation tasks. of a medical ImageNet, and their capabilities for same- and then detail each of the training schemes with its learning ob- cross-domain transfer learning followed by a thorough review jectives and perspectives, followed by a summary of the four of existing supervised and self-supervised representation learn- unique properties of our Models Genesis. ing approaches in medical imaging in Sec. 6. Finally, Sec. 7 concludes and outlines future extensions of Models Genesis. 2.1. Models Genesis learn by image restoration Given a raw dataset consisting of N patient volumes, the- oretically we can crop infinite number of sub-volumes from 2. Models Genesis the dataset. In practice, we randomly generate a subset X = fx ; x ; :::; x g, which includes n number of sub-volumes and 1 2 n The objective of Models Genesis is to learn a common image then apply image transformation function to these sub-volumes, representation that is transferable and generalizable across dis- yielding eases, organs, and modalities. Fig. 1 depicts our self-supervised X = f (X); (1) learning framework, which enables training 3D models from scratch using unlabeled images, consisting of three steps: (1) where X = fx ˜ ; x ˜ ; :::; x ˜ g and f () denotes a transformation 1 2 n cropping sub-volumes from patient CT images, (2) deforming function. Subsequently, a Model Genesis, being an encoder- the sub-volumes, and (3) training a model to restore the orig- decoder network with skip connections in between, will learn inal sub-volume. In the following sections, we first introduce to approximate the function g() which aims to map the trans- the denotations of our self-supervised learning framework and formed sub-volumes X back to their original ones X, that is, 4 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. 2: Illustration of the proposed image transformations and their learning perspectives. For simplicity and clarity, we illustrate the transfor- mation on a 2D CT slice, but our Genesis Chest CT is trained directly using 3D sub-volumes, which are transformed in a 3D manner. For ease of understanding, in (a) non-linear transformation, we have displayed an image undergoing di erent translating functions in Columns 2—7; in (b) local-shuing, (c) outer-cutout, and (d) inner-cutout transformation, we have illustrated each of the processes step by step in Columns 2—6, where the first and last columns denote the original images and the final transformed images, respectively. In local-shuing, a di erent window W is automatically generated and used in each step. We provide the implementation details in Sec. 2.2 and more visualizations in Fig. D.13. Forbes, 2012). Hence, this training scheme enables the model ˜ ˜ g(X) = X = f (X): (2) to learn the appearance of the anatomic structures present in the images. In order to keep the appearance of the anatomic To avoid heavy weight dedicated towers for each proxy task structures perceivable, we intentionally retain the non-linear in- and to maximize parameter sharing in Models Genesis, we tensity transformation function as monotonic, allowing pixels consolidate four self-supervised schemes into a single image of di erent values to be assigned with new distinct values. To restoration task, enabling models to learn robust image repre- realize this idea, we use Bezier ´ Curve (Mortenson, 1999), a sentation by restoring from various sets of image transforma- smooth and monotonic transformation function, which is gen- tions. Our proposed framework includes four transformations: erated from two end points (P and P ) and two control points 0 3 (1) non-linear, (2) local-shuing, (3) outer-cutout, and (4) (P and P ), defined as: 1 2 inner-cutout. Each transformation is independently applied to a sub-volume with a predefined probability, while outer-cutout 3 2 2 3 B(t) = (1t) P +3(1t) tP +3(1t)t P +t P ; t 2 [0; 1]; (3) 0 1 2 3 and inner-cutout are considered mutually exclusive. Conse- quently, each sub-volume can undergo at most three of the where t is a fractional value along the length of the line. In above transformations, resulting in twelve possible transformed Fig. 2(a), we illustrate the original CT sub-volume (the left- sub-volume (see step 2 in Fig. 1). For clarity, we further de- most column) and its transformed ones based on di erent trans- fine a training scheme as the process that (1) transforms sub- formation functions. The corresponding transformation func- volumes using any of the aforementioned transformations, and tions are shown in the top row. Notice that, when P = P 0 1 (2) trains a model to restore the original sub-volumes from the and P = P the Bezier ´ Curve is a linear function (shown in 2 3 transformed ones. For convenience, we refer to an individual Columns 2, 5). Besides, we set P = (0; 0) and P = (1; 1) 0 3 training scheme as the scheme using one particular individual to get an increasing function (shown in Columns 2—4) and the transformation. We should emphasize that our ultimate goal is opposite to get a decreasing function (shown in Columns 5—7). not the task of image restoration per se. While restoring images The control points are randomly generated for more variances is advocated and investigated as a training scheme for models (shown in Columns 3, 4, 6, 7). Before applying the transforma- to learn image representation, the usefulness of the learned rep- tion functions, in Genesis CT, we first clip the Hounsfield units resentation must be assessed objectively based on its generaliz- values within the range of [1000; 1000] and then normalize ability and transferability to various target tasks. each CT scan to [0; 1], while in Genesis X-ray, we directly nor- malize each X-ray to [0; 1] without intensity clipping. 2.2. Models Genesis learn from multiple perspectives 1) Learning appearance via non-linear transformation. We 2) Learning texture via local pixel shuing. We propose lo- propose a novel self-supervised training scheme based on non- cal pixel shuing to enrich local variations of a sub-volume linear translation, with which the model learns to restore the without dramatically compromising its global structures, which intensity values of an input image transformed with a set of encourages the model to learn the local boundaries and textures non-linear functions. The rationale is that the absolute intensity of objects. To be specific, for each input sub-volume, we ran- values (i.e., Hounsfield units) in CT scans or relative intensity domly select 1,000 windows and then shue the pixels inside values in other imaging modalities convey important informa- each window sequentially. Mathematically, let us consider a tion about the underlying structures and organs (Buzug, 2011; small window W with a size of m n. The local-shuing acts Zongwei Zhou et al. / Medical Image Analysis (2020) 5 on each window and can be formulated as 2.3. Models Genesis have several unique properties 1) Autodidactic—requiring no manual labeling. Models W = P W P ; (4) Genesis are trained in a self-supervised manner with abundant where W is the transformed window, P and P denote permu- unlabeled image datasets, demanding zero expert annotation ef- tation metrics with the size of m  m and n  n, respectively. fort. Consequently, Models Genesis are fundamentally di erent Pre-multiplying W with P permutes the rows of the window from traditional (fully) supervised transfer learning from Ima- W, whereas post-multiplying W with P results in the permu- geNet (Shin et al., 2016; Tajbakhsh et al., 2016), which o ers tation of the columns of the window W. The size of the local modest benefits to 3D medical imaging applications as well as window determines the diculty of proxy task. In practice, to that from the existing pre-trained, full-supervised models in- preserve the global content of the image, we keep the window cluding I3D (Carreira and Zisserman, 2017), NiftyNet (Gibson sizes smaller than the receptive field of the network, so that the et al., 2018b), and MedicalNet (Chen et al., 2019b), which de- network can learn much more robust image representation by mand a volume of annotation e ort to obtain the source models “resetting” the original pixels positions. Note that our method (statistics given in Table 1). To our best knowledge, this work is quite di erent from PatchShuing (Kang et al., 2017), which represents the first e ort to establish publicly-available, autodi- is a regularization technique to avoid over-fitting. Unlike de- dactic models for 3D medical image analysis. noising (Vincent et al., 2010) and in-painting (Pathak et al., 2) Robust—learning from multiple perspectives. Our com- 2016; Iizuka et al., 2017), our local-shuing transformation bined approach trains Models Genesis from multiple perspec- does not intend to replace the pixel values with noise, which tives (appearance, texture, context, etc.), leading to more ro- therefore preserves the identical global distributions to the orig- bust models across all target tasks, as evidenced in Figure 3, inal sub-volume. In addition, local-shuing within an extent where our combined approach is compared with our individ- keeps the objects perceivable, as shown in Fig. 2(b), benefiting ual schemes. This eclectic approach, incorporating multiple the deep neural network in learning local invariant image repre- tasks into a single image restoration task, empowers Models sentations, which serves as a complementary perspective with Genesis to learn more comprehensive representation. While global patch shuing (Chen et al., 2019a) (discussed in-depth most self-supervised methods devise isolated training schemes in Appendix C.1). to learn from specific perspectives—learning intensity value 3) Learning context via outer and inner cutouts. We devise via colorization, context information via Jigsaw, orientation outer-cutout as a new training scheme for self-supervised learn- via rotation, etc—these methods are reported with mixed re- ing. To realize it, we generate an arbitrary number ( 10) of sults on di erent tasks, in review papers such as Goyal et al. windows, with various sizes and aspect ratios, and superim- (2019), Kolesnikov et al. (2019), Taleb et al. (2020), and Jing pose them on top of each other, resulting in a single window and Tian (2020). It is critical as a multitude of state-of-the-art of a complex shape. When applying this merged window to a results in the literature show the importance of using compo- sub-volume, we leave the sub-volume region inside the window sitions of more than one transformations per image (Graham, exposed and mask its surrounding (i.e., outer-cutout) with a ran- 2014; Dosovitskiy et al., 2015; Wu et al., 2020), which has also dom number. Moreover, to prevent the task from being too dif- been experimentally confirmed in our image restoration task. ficult or even unsolvable, we extensively search for the optimal 3) Scalable—accommodating many training schemes. Con- size of cutout regions spanning from 0% to 90%, incremented solidated into a single image restoration task, our novel self- by 10% (detailed study presented in Appendix C.3). In the supervised schemes share the same encoder and decoder during end, we limit the outer-cutout region to be less than 1/4 of the training. Had each task required its own decoder, due to limited whole sub-volume. By restoring the outer-cutouts, the model memory on GPUs, our framework would have failed to accom- will learn the global geometry and spatial layout of organs in modate a large number of self-supervised tasks. By unifying all medical images via extrapolating within each sub-volume. We tasks as a single image restoration task, any favorable transfor- have illustrated this process step by step in Fig. 2(c). The first mation can be easily amended into our framework, overcoming and last columns denote the original sub-volumes and the final the scalability issue associated with multi-task learning (Doer- transformed sub-volumes, respectively. sch and Zisserman, 2017; Noroozi et al., 2018; Standley et al., Our self-supervised learning framework also utilizes inner- 2019; Chen et al., 2019b), where the network heads are subject cutout as a training scheme, where we mask the inner win- to the specific proxy tasks. dow regions (i.e., inner-cutouts) and leave their surroundings exposed. By restoring the inner-cutouts, the model will learn 4) Generic—yielding diverse applications. Models Genesis, local continuities of organs in medical images via interpolat- trained via a diverse set of self-supervised schemes, learn a ing within each sub-volume. Unlike Pathak et al. (2016), where general-purpose image representation that can be leveraged for in-painting is proposed as a proxy task by restoring only the a wide range of target tasks. Specifically, Models Genesis can central region of the image, we restore the entire sub-volume be utilized to initialize the encoder for the target classification as the model output. Examples of inner-cutout are illustrated in tasks and to initialize the encoder-decoder for the target seg- Fig. 2(d). Following the suggestion from Pathak et al. (2016), mentation tasks, while the existing self-supervised approaches the inner-cutout areas are limited to be less than 1=4 of the are largely focused on providing encoder models only (Jing and whole sub-volume, in order to keep the task reasonably di- Tian, 2020). As shown in Table 3, Models Genesis can be gen- cult. eralized across diseases (e.g., nodule, embolism, tumor), organs 6 Zongwei Zhou et al. / Medical Image Analysis (2020) Table 2: Genesis CT is pre-trained on only LUNA 2016 dataset (e.g., lung, liver, brain), and modalities (e.g., CT and MRI), a (i.e., the source) and then fine-tuned for five distinct medical image generic behavior that sets us apart from all previous works in applications (i.e., the targets). These target tasks are selected such that the literature where the representation is learned via a specific they show varying levels of semantic distance from the source, in terms self-supervised task, and thus lack generality. of organs, diseases, and modalities, allowing us to investigate the trans- ferability of the pre-trained weights of Genesis CT with respect to the domain distance. The cells checked by 7 denote the properties that are 3. Experiments di erent between the source and target datasets. 3.1. Pre-training Models Genesis Task Disease Organ Dataset Modality Our Models Genesis are pre-trained from 623 Chest CT scans NCC in LUNA 2016 (Setio et al., 2017) in a self-supervised manner. NCS ECC 7 7 The reason that we decided not to use all 888 scans provided LCS 7 7 7 by this dataset was to avoid test-image leaks between proxy and BMS 7 7 7 7 target tasks, so that we can confidently use the rest of the images solely for testing Models Genesis as well as the target models, although Models Genesis are trained from only unlabeled im- replacing the last layer with a 1 1 1 convolutional layer for ages, involving no annotation shipped with the dataset. We first target segmentation tasks. For scenarios (2) and (3), it is possi- randomly crop sub-volumes, sized 64 64 32 pixels, from dif- ble to fine-tune all the layers of the model or to keep some of the ferent locations. To extract more informative sub-volumes for earlier layers fixed, only fine-tuning some higher-level portion training, we then intentionally exclude those which are empty of the model. We have evaluated the performance of our self- (air) or contain full tissues. Our Models Genesis 2D are self- supervised representation for transfer learning by fine-tuning supervised pre-trained from LUNA 2016 (Setio et al., 2017) and all layers in the network. In the following, we examine Models ChestX-ray14 (Wang et al., 2017b) using 2D CT slices in an Genesis on five distinct medical applications, covering classi- axial view and X-ray images, respectively. For all proxy tasks fication and segmentation tasks in CT and MRI images with and target tasks, the raw image intensities were normalized to varying levels of semantic distance from the source (Chest CT) the [0; 1] range before training. We use the mean square error to the targets in terms of organs, diseases, and modalities (see (MSE) between input and output images as objective function Table 2) for investigating the transferability of Models Genesis. for the proxy task of image restoration. As suggested by Pathak et al. (2016) and Chen et al. (2019a), the MSE loss is sucient for representation learning, although the restored images may 3.2.1. Lung nodule false positive reduction (NCC) be blurry. The dataset is provided by LUNA 2016 (Setio et al., 2017) When pre-training Models Genesis, we apply each of the and consists of 888 low-dose lung CTs with slice thickness transformations on sub-volumes with a pre-defined probability. less than 2.5mm. Patients are randomly assigned into a train- That being said, the model will encounter not only the trans- ing set (445 cases), a validation set (178 cases), and a test set formed sub-volumes as input, but also the original sub-volumes. (265 cases). The dataset o ers the annotations for a set of This design o ers two advantages: 5,510,166 candidate locations for the false positive reduction task, wherein true positives are labeled as “1” and false posi- ˆ First, the model must distinguish original versus trans- tives are labeled as “0”. Following the prior works (Setio et al., formed images, discriminate transformation type(s), and 2016; Sun et al., 2017c), we evaluate performance via Area Un- restore images if transformed. Our self-supervised learn- der the Curve (AUC) score on classifying true positives and ing framework, therefore, results in pre-trained models false positives. that are capable of handling versatile tasks. ˆ Second, since original images are presented in the proxy 3.2.2. Lung nodule segmentation (NCS) task, the semantic di erence of input images between the The dataset is provided by the Lung Image Database Con- proxy and target task becomes smaller. As a result, the sortium image collection (LIDC-IDRI) (Armato III et al., 2011) pre-trained model can be transferable to process regu- and consists of 1,018 cases collected by seven academic centers lar/normal images in a broad variety of target tasks. and eight medical imaging companies. The cases were split into training (510), validation (100), and test (408) sets. Each case 3.2. Fine-tuning Models Genesis is a 3D CT scan and the nodules have been marked as volu- The pre-trained Models Genesis can be adapted to new imag- metric binary masks. We have re-sampled the volumes to 1-1-1 ing tasks through transfer learning or fine-tuning. There are spacing and then extracted a 64 64 32 crop around each nod- three major transfer learning scenarios: (1) employing the en- ule. These 3D crops are used for model training and evaluation. coder as a fixed feature extractor for a new dataset and follow- As in prior works (Aresta et al., 2019; Tang et al., 2019; Zhou ing up with a linear classifier (e.g., Linear SVM or Softmax et al., 2018), we adopt Intersection over Union (IoU) and Dice classifier), (2) taking the pre-trained encoder and appending a coecient scores to evaluate performance. Note that for this sequence of fully-connected (fc) layers for target classification particular application, we calculate mean of the IoUs at thresh- tasks, and (3) taking the pre-trained encoder and decoder and olds ranging from 0.5 to 0.95 with a step size of 0.05. Zongwei Zhou et al. / Medical Image Analysis (2020) 7 3.2.3. Pulmonary embolism false positive reduction (ECC) 3D versions for a fair comparison (see detailed implementation in Appendix A). To promote the 3D self-supervised learning We utilize a database consisting of 121 computed tomogra- research, we make our own implementation of the 3D extended phy pulmonary angiography (CTPA) scans with a total of 326 methods and their corresponding pre-trained weights publicly emboli. Following the prior works (Liang and Bi, 2007), we available as an open-source tool that can e ectively be used out- utilize their PE candidate generator based on the toboggan al- of-the-box. In addition, we have examined publicly available gorithm, resulting in total of 687 true positives and 5,568 false pre-trained models for 3D transfer learning in medical imaging, positives. The dataset is then divided at the patient-level into 2 3 including NiftyNet (Gibson et al., 2018b), MedicalNet (Chen a training set with 434 true positive PE candidates and 3,406 et al., 2019b), and, the most influential 2D weights initializa- false positive PE candidates, and a test set with 253 true posi- tion, Models ImageNet. We also fine-tune I3D (Carreira and tive PE candidates and 2,162 false positive PE candidates. To Zisserman, 2017) in our five target tasks because it has been conduct a fair comparison with the prior study (Zhou et al., shown to successfully initialize 3D models for lung nodule de- 2017; Tajbakhsh et al., 2016, 2019b), we compute candidate- tection in Ardila et al. (2019). The detailed configurations of level AUC on classifying true positives and false positives. these models can be found in Appendix B. 3D U-Net architecture is used in 3D applications; U-Net 3.2.4. Liver segmentation (LCS) architecture is used in 2D applications. Batch normaliza- The dataset is provided by MICCAI 2017 LiTS Challenge tion (Io e and Szegedy, 2015) is utilized in all 3D/2D deep and consists of 130 labeled CT scans, which we split into train- models. For proxy tasks, SGD method (Zhang, 2004) with an ing (100 patients), validation (15 patients), and test (15 patients) initial learning rate of 1e0 is used for optimization. We use subsets. The ground truth segmentation provides two di erent ReduceLROnPlateau to schedule learning rate, in which if no labels: liver and lesion. For our experiments, we only consider improvement is seen in the validation set for a certain num- liver as positive class and others as negative class and evaluate ber of epochs, the learning rate is reduced. For target tasks, segmentation performance using Intersection over Union (IoU) Adam method (Kinga and Adam, 2015) with a learning rate of and Dice coecient scores. 1e 3 is used for optimization, where = 0:9, = 0:999, 1 2 = 1e 8. We use early-stop mechanism on the validation set 3.2.5. Brain tumor segmentation (BMS) to avoid over-fitting. Simple yet heavy 3D data augmentation The dataset is provided by BraTS 2018 challenge (Menze techniques are employed in all five target tasks, including ran- et al., 2015; Bakas et al., 2018) and consists of 285 patients dom flipping, transposing, rotating, and adding Gaussian noise. (210 HGG and 75 LGG), each with four 3D MRI modalities We run each method ten times on all of the target tasks and (T1, T1c, T2, and Flair) rigidly aligned. We adopt 3-fold cross report the average, standard deviation, and further present sta- validation, in which two folds (190 patients) are for training and tistical analysis based on an independent two-sample t-test. one fold (95 patients) for test. Annotations include background In the proxy task, we pre-train the model using 3D sub- (label 0) and three tumor subregions: GD-enhancing tumor (la- volumes sized 64  64  32, whereas in target tasks, the input bel 4), the peritumoral edema (label 2), and the necrotic and is not limited to sub-volumes with certain size. That being said, non-enhancing tumor core (label 1). We consider those with our pre-trained models can be fine-tuned in the tasks with CT label 0 as negatives and others as positives and evaluate seg- sub-volumes, entire CT volumes, or even MRI volumes as in- mentation performance using Intersection over Union (IoU) and put upon user’s need. The flexibility of input size is attributed Dice coecient scores. to two reasons: (1) our pre-trained models learn generic image representation such as appearance, texture, and context feature, 3.3. Benchmarking Models Genesis and (2) the encoder-decoder architecture is able to process im- For a thorough comparison, we used three di erent tech- ages with arbitrary sizes. niques to randomly initialize the weights of models: (1) a ba- sic random initialization method based on Gaussian distribu- 4. Results tions, (2) a method commonly known as Xavier, which was sug- gested in Glorot and Bengio (2010), and (3) a revised version of In this section, we begin with an ablation study to compare Xavier called MSRA, which was suggested in He et al. (2015). the combined approach with each individual scheme, conclud- They are implemented as uniform, glorot uniform, and ing that the combined approach tends to achieve more robust re- he uniform, respectively, following the Initializers in Keras. sults and consistently exceeds any other training schemes. We We compare Models Genesis with Rubik’s cube (Zhuang et al., then take our pre-trained model from the combined approach 2019), the most recent multi-task and self-supervised learning and present results on five 3D medical applications, compar- method for 3D medical imaging. Considering that most of the ing them against the state-of-the-art approaches found in recent self-supervised learning methods are initially proposed and im- supervised and self-supervised learning literature. plemented in 2D, we have extended five most representative ones (Vincent et al., 2010; Pathak et al., 2016; Noroozi and NiftyNet Model Zoo: github.com/NifTK/NiftyNetModelZoo Favaro, 2016; Chen et al., 2019a; Caron et al., 2018) into their MedicalNet: github.com/Tencent/MedicalNet I3D: github.com/deepmind/kinetics-i3d 3D U-Net: github.com/ellisdg/3DUnetCNN 1 6 Initializers: faroit.com/keras-docs/1.2.2/initializations Segmentation Models: github.com/qubvel/segmentation models 8 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. 3: Comparing the combined training scheme with each of the proposed individual training schemes, we conduct statistical analyses between the top two training schemes as well as between the bottom two. Although some of the individual training schemes could be favorable for certain target tasks, there is no such clear clue to guarantee that any one of the individual training schemes would consistently o er the best performance on every target task. On the contrary, our combined training scheme consistently achieves the best results across all five target tasks. Fig. 4: Models Genesis, as presented with the red vertical lines, achieve higher and more stable performance compared with three popular types of random initialization methods, including MSRA, Xavier, and Uniform. Among three out of the five applications, three di erent types of random distribution reveal no significant di erence with respect to each other. 4.1. The combined learning scheme exceeds each individual the combination of di erent transformations is advantageous because, as discussed, we cannot rely on one single training We have devised four individual training schemes by apply- scheme to achieve the most robust and compelling results across ing each of the transformations (i.e., non-linear, local-shuing, multiple target tasks. It is our novel representation learning outer-cutout, and inner-cutout) individually to a sub-volume framework based on image restoration that allows integrating and training the model to restore the original one. We compare various training schemes into a single training scheme. Our each of these training schemes with identical-mapping, which qualitative assessment of image restoration quality, provided does not involve any image transformation. In three out of the in Fig. D.14, further indicates that the combined scheme is su- five target tasks, as shown in Figs. 3—4, the model pre-trained perior over all four individual schemes in restoring the images by identical-mapping scheme does not perform as well as ran- that have been undergone multiple transformations. In sum- dom initialization. This undesired representation obtained via mary, our combined scheme pre-trains a model from multiple identical-mapping suggests that without any image transforma- perspectives (appearance, texture, context, etc.), empowering tion, the model would not benefit much from the proxy image models to learn a more comprehensive representation, thereby restoration task. On the contrary, nearly all of the individual leading to more robust target models. Based on the above ab- schemes o er higher target task performances than identical- lation studies, in the following sections, we refer the models mapping, demonstrating the significance of the four devised im- pre-trained by the combined scheme to Models Genesis and, in age transformations in learning image representation. particular, refer the model pre-trained on LUNA 2016 dataset Although each of the individual schemes has established the to Genesis Chest CT. capability in learning image representation, its empirical perfor- mance varies from task to task. That being said, given a target 4.2. Models Genesis outperform learning from scratch task, there is no clear winner among the four individual schemes that can always guarantee the highest performance. As a re- Transfer learning accelerates training and boosts perfor- sult, we have further devised a combined scheme, which applies mance, only if the image representation learned from the origi- transformations to a sub-volume with a predefined probability nal (proxy) task is general and transferable to target tasks. Fine- for each transformation and trains a model to restore the origi- tuning models trained on ImageNet has been a great success nal one. To demonstrate the importance of combining these im- story in 2D (Tajbakhsh et al., 2016; Shin et al., 2016), but for age transformations together, we examine the combined train- 3D representation learning, there is no such a massive labeled ing scheme against each of the individual ones. Fig. 3 shows dataset like ImageNet. As a result, it is still common prac- that the combined scheme consistently exceeds any other in- tice to train 3D model from scratch in 3D medical imaging. dividual schemes in all five target tasks. We have found that Therefore, to establish the 3D baselines, we have trained 3D Zongwei Zhou et al. / Medical Image Analysis (2020) 9 Fig. 5: Models Genesis enable better optimization than learning from scratch, evident by the learning curves for the target tasks of reducing false positives in detecting lung nodules (NCC) and pulmonary embolism (ECC) as well as segmenting lung nodule (NCS), liver (LCS), and brain tumor (BMS). We have plotted the validation performance averaged by ten trials for each application, in which accuracy and dice-coecient scores are reported for classification and segmentation tasks, respectively. As seen, initializing with our pre-trained Models Genesis demonstrates benefits in the convergence speed. Table 3: Our Models Genesis lead the best or comparable performance on five distinct medical target tasks over six self-supervised learning approaches (revised in 3D) and three competing publicly available (fully) supervised pre-trained 3D models. For ease of comparison, we evaluate AUC score for the two classification tasks (i.e., NCC and ECC) and IoU score for the three segmentation tasks (i.e., NCS, LCS, and BMS). All of the results, including the mean and standard deviation (means.d.) across ten trials, reported in the table are evaluated using our dataset splitting, elaborated in Sec. 3.2. For every target task, we have further performed independent two sample t-test between the best (bolded) vs. others and highlighted boxes in blue when they are not statistically significantly di erent at p = 0:05 level. The footnotes compare our results with the state-of-the-art performance for each target task, using the evaluation metric for the data acquired from competitions. Target tasks Pre-training Approach 1 2 3 4 5 NCC (%) NCS (%) ECC (%) LCS (%) BMS (%) Random with Uniform Init 94.741.97 75.480.43 80.363.58 78.684.23 60.791.60 No Random with Xavier Init (Glorot and Bengio, 2010) 94.255.07 74.051.97 79.998.06 77.823.87 58.522.61 Random with MSRA Init (He et al., 2015) 96.031.82 76.440.45 78.243.60 79.765.43 63.001.73 I3D (Carreira and Zisserman, 2017) 98.260.27 71.580.55 80.551.11 70.654.26 67.830.75 (Fully) supervised NiftyNet (Gibson et al., 2018b) 94.144.57 52.982.05 77.338.05 83.231.05 60.781.60 MedicalNet (Chen et al., 2019b) 95.800.49 75.680.32 86.431.44 85.520.58 66.091.35 De-noising (Vincent et al., 2010) 95.921.83 73.990.62 85.143.02 84.360.96 57.831.57 In-painting (Pathak et al., 2016) 91.462.97 76.020.55 79.793.55 81.364.83 61.383.84 Jigsaw (Noroozi and Favaro, 2016) 95.471.24 70.901.55 81.791.04 82.041.26 63.331.11 Self-supervised DeepCluster (Caron et al., 2018) 97.220.55 74.950.46 84.820.62 82.661.00 65.960.85 Patch shuing (Chen et al., 2019a) 91.932.32 75.740.51 82.153.30 82.822.35 52.956.92 Rubik’s Cube (Zhuang et al., 2019) 96.241.27 72.870.16 80.494.64 75.590.20 62.751.93 Genesis Chest CT (ours) 98.340.44 77.620.64 87.202.87 85.102.15 67.961.29 The winner in LUNA (2016) holds an ocial score of 0.968 vs. 0.971 (ours) Wu et al. (2018) holds a Dice of 74.05% vs. 75.86%0.90% (ours) Zhou et al. (2017) holds an AUC of 87.06% vs. 87.20%2.87% (ours) The winner in LiTS (2017) with post-processing holds a Dice of 96.60% vs. 93.19%0.46% (ours without post-processing) MRI Flair images are only utilized for segmenting brain tumors, so the results are not submitted to BraTS 2018. Genesis Chest CT is slightly outperformed by MedicalNet in LCS because the latter has been (fully) supervised pre-trained on the LiTS dataset. models with three representative random initialization meth- networks from scratch. A small miscalibration of the ini- ods, including naive uniform initialization, Xavier/Glorot ini- tial weights can lead to vanishing or exploding gradients, tialization proposed by Glorot and Bengio (2010), and He nor- as well as poor convergence properties. mal (MSRA) initialization proposed by He et al. (2015). When ˆ In three out of the five 3D medical applications, the re- comparing deep model initialization by transfer learning and by sults reveal no significant di erence among these ran- controlling mathematical distribution, the former learns more dom initialization methods. Although randomly initial- sophisticated image representation but su ers from a domain izing weights can vary by the behaviors on di erent ap- gap, whereas the latter is task independent yet provides rela- plications, He normal (MSRA), in which the weights are tively less benefit than the former. The hypothesis underneath initialized with a specific ReLU-aware initialization, gen- transfer learning is that transferring deep features across visual erally works the most reliably among all five target tasks. tasks can obtain a semantically more powerful representation, compared with simply initializing weights using di erent dis- ˆ On the other hand, initialization with our pre-trained Gen- tributions. From our comprehensive experiments in Fig. 4, we esis Chest CT stabilizes the overall performance and, more have observed the following: importantly, elevates the average performance over all ˆ Within each method, random initialization of weights has three random initialization methods by a large margin. Our shown large variance in results of ten trials; it is in large statistical analysis shows that the performance gain is sig- part due to the diculty of adequately initializing these nificant for all the target tasks under study. This sug- 10 Zongwei Zhou et al. / Medical Image Analysis (2020) gests that, owing to the representation learning scheme, els. For example, we have adopted MedicalNet with resnet-101 our initial weights provide a better starting point than the as the backbone, which o ers the highest performance based ones generated under particular statistical distributions, on Chen et al. (2019b) but comprises of 85.75M parameters; while being over 13% faster (see Fig. 5). This observa- the pre-trained I3D (Carreira and Zisserman, 2017) contains tion has also been widely obtained in 2D model initializa- 25.35M parameters in the encoder; the pre-trained NiftyNet tion (Tajbakhsh et al., 2016; Shin et al., 2016; Rawat and uses Dense V-Networks (Gibson et al., 2018a) as backbone, Wang, 2017; Zhou et al., 2017; Voulodimos et al., 2018). comprising of only 2.60M parameters, but it does not perform as well as its counterparts in all five target tasks. Taken to- Altogether, in contrast to 3D scratch models, we believe gether, these results indicate that our Models Genesis, with only Models Genesis can potentially serve as a primary source of 16.32M parameters, surpass all existing pre-trained 3D models transfer learning for 3D medical imaging applications. Besides in terms of generalizability, transferability, and parameter e- contrasting with the three random initialization methods, we ciency. further examine our Models Genesis against the existing pre- trained 3D models in the coming section. 4.4. Models Genesis reduce annotation e orts by at least 30% 4.3. Models Genesis surpass existing pre-trained 3D models While critics often stress the need for suciently large We have evaluated our Models Genesis with existing pub- amounts of labeled data to train a deep model, transfer learn- licly available pre-trained 3D models on five distinct medical ing leverages the knowledge about medical images already target tasks. As shown in Table 3, Genesis Chest CT noticeably learned by pre-trained models and therefore requires consider- contrasts with any other existing 3D models, which have been ably fewer annotated data and training iterations than learning pre-trained by full supervision. Note that, in the liver segmen- from scratch. We have simulated the scenarios of using a hand- tation task (LCS), Genesis Chest CT is slightly outperformed by ful of labeled data, which allows investigating the power of our MedicalNet because of the benefit that MedicalNet gained from Models Genesis in transfer learning. Fig. 6 displays the results its (fully) supervised pre-training on the LiTS dataset directly. of training with a partial dataset, demonstrating that fine-tuning Further statistical tests reveal that Genesis Chest CT still yields Models Genesis saturates quickly on the target tasks since it comparable performance with MedicalNet at p = 0:05 level. can achieve similar performance compared with the full dataset For the rest four target tasks, Genesis Chest CT achieves su- training. Specifically, the performance of learning 3D models perior performance against all its counterparts by a large mar- from scratch with entire datasets can be approximated using gin, demonstrating the e ectiveness and transferability of the Models Genesis with only 50%, 5%, 30%, 5%, and 30% of datasets for NCC, NCS, ECC, LCS, and BMS, respectively. This learned features of Models Genesis, which are beneficial for shows that our Models Genesis can mitigate the lack of labeled both classification and segmentation tasks. images, resulting in a more annotation ecient deep learning in More importantly, although Genesis Chest CT is pre-trained the end. on Chest CT only, it can generalize to di erent organs, diseases, datasets, and even modalities. For instance, the target task of Furthermore, the performance gap between fine-tuning and pulmonary embolism false positive reduction is performed in learning from scratch is significant and steady over training Contrast-Enhanced CT scans that can appear di erently from models with each partial data point. For the lung nodule false the proxy tasks in normal CT scans; yet, Genesis Chest CT positive reduction target task (NCC in Fig. 6), using only 49% achieves a remarkable improvement over training from scratch, training data, Models Genesis equal the performance of 70% increasing the AUC by 7 points. Moreover, Genesis Chest CT training data learning from scratch. Therefore, about 30% of continues to yield a significant IoU gain in liver segmentation the annotation cost associated with learning from scratch in NCC even though the proxy task and target task are significantly dif- is recovered by initializing with Models Genesis. For the lung ferent in both, diseases a ecting the organs (lung vs. liver) and nodule segmentation target task (NCS in Fig. 6), with 5% train- the dataset itself (LUNA 2016 vs. LiTS 2017). We have fur- ing data, Models Genesis can achieve the performance equiv- ther examined Genesis Chest CT and other existing pre-trained alent to learning from scratch using 10% training data. Based models using MRI Flair images, which represent the widest do- on this analysis, the cost of annotation in NCS can be reduced main distance between the proxy and target tasks. As reported by half using Models Genesis compared with learning from in Table 3 (BMS), Genesis Chest CT yields nearly a 5-point scratch. For the pulmonary embolism false positive reduction improvement in comparison with random initialization. The target task (ECC), Fig. 6 suggests that with only 30% training increased performance on the MRI imaging task is a particu- samples, Models Genesis achieve performance equivalent to larly strong demonstration of the transfer learning capabilities learning from scratch using 70% training samples. Therefore, of our Genesis Chest CT. To further investigate the behavior nearly 57% of the labeling cost associated with the use of learn- of Genesis Chest CT when encountering medical images from ing from scratch for ECC could be recovered with our Models di erent modalities, we have provided extensive visualization Genesis. For the liver segmentation target task (LCS) in Fig. 6, in Fig. D.15, including example images from CT, X-ray, Ultra- using 8% training data, Models Genesis equal the performance sound, and MRI modalities. of learning from scratch using 50% training samples. There- Considering the model footprint, our Models Genesis take fore, about 84% of the annotation cost associated with learning the basic 3D U-Net as the backbone, carrying much fewer pa- from scratch in LCS is recovered by initializing with Models rameters than the existing open-source pre-trained 3D mod- Genesis. For the brain tumor segmentation target task (BMS) Zongwei Zhou et al. / Medical Image Analysis (2020) 11 Fig. 6: Initializing with our Models Genesis, the annotation cost can be reduced by 30%, 50%, 57%, 84%, and 44% for target tasks NCC, NCS, ECC, LCS, and BMS, respectively. With decreasing amounts of labeled data, Models Genesis (red) retain a much higher performance on all five target tasks, whereas learning from scratch (grey) fails to generalize. Note that the horizontal red and gray lines refer to the performances that can eventually be achieved by Models Genesis and learning from scratch, respectively, when using the entire dataset. Fig. 7: When solving problems in volumetric medical modalities, such as CT and MRI images, 3D volume-based approaches consistently o er superior performance than 2D slice-based approaches empowered by transfer learning. We conduct statistical analyses (circled in blue) between the highest performance achieved by 3D and 2D solutions. Training 3D models from scratch does not necessarily outperform their 2D counterparts (see NCC). However, training the same 3D models from Genesis Chest CT outperforms all their 2D counterparts, including fine-tuning Models ImageNet as well as fine-tuning our Genesis Chest X-ray and Genesis Chest CT 2D. It confirms the e ectiveness of Genesis Chest CT in unlocking the power of 3D models. In addition, we have also provided statistical analyses between the highest and the second highest performances achieved by 2D models, finding that Models Genesis (2D) o er equivalent performances (n.s.) to Models ImageNet in four out of the five applications. in Fig. 6, with less than 28% training data, Models Genesis Besides adopting 3D models, another common strategy to han- achieve the performance equivalent to learning from scratch us- dle limited data in volumetric medical imaging is to reformat ing 50% training data. Therefore, nearly 44% annotation e orts 3D data into a 2D image representation followed by fine-tuning can be reduced using Models Genesis compared with learning pre-trained Models ImageNet (Shin et al., 2016; Tajbakhsh from scratch. Overall, at least 30% annotation e orts have been et al., 2016). This approach increases the training examples reduced by Models Genesis, in comparison with learning a 3D by order of magnitude, but it sacrifices the 3D context. It is model from scratch in five target tasks. With such annotation- interesting to note how Genesis Chest CT compares with this ecient 3D transfer learning paradigm, computer-aided diag- de facto standard in 2D. We have thus implemented two di er- nosis of rare diseases or rapid response to global pandemics, ent methods to reformat 3D data into 2D input: the regular 2D which are severely underrepresented owing to the diculty of representation obtained by extracting adjacent axial slices (Ben- collecting a sizeable amount labeled data, could be eventually Cohen et al., 2016; Sun et al., 2017a), and the 2.5D represen- actualized. tation (Prasoon et al., 2013; Roth et al., 2014, 2015) composed of axial, coronal, and sagittal slices from volumetric data. Both of these 2D approaches seek to use 2D representation to emu- 4.5. Models Genesis consistently top any 2D/2.5D approaches late something three dimensional, in order to fit the paradigm of fine-tuning Models ImageNet. In the inference, classifica- We have thus far presented the power of 3D models in pro- tion and segmentation tasks are evaluated di erently in 2D: for cessing volumetric data, in particular, with limited annotation. 12 Zongwei Zhou et al. / Medical Image Analysis (2020) Table 4: Our 3D approach, initialized by Models Genesis, signifi- used in their pre-training (90 cases for NiftyNet (Gibson et al., cantly elevates the classification performance compared with 2.5D and 2018b) and 1,638 cases for MedicalNet (Chen et al., 2019b)) or 2D approaches in reducing lung nodule and pulmonary embolism false the domain distance (from videos to CT/MRI for I3D (Carreira positives. The entries in bold highlight the best results achieved by and Zisserman, 2017)). Evidenced by a prior study (Sun et al., di erent approaches. For the 2D slice-based approach, we extract in- 2017b) on ImageNet pre-training, large amount of supervision put consisting of three adjacent axial views of the lung nodule or pul- is required to foster a generic, comprehensive image represen- monary embolism and some of their surroundings. For the 2.5D or- tation. Back in 2009, when ImageNet had not been established, thogonal approach, each input is composed of an axial, coronal, and it was challenging to empower a deep model with generic im- sagittal slice and centered at a lung nodule or pulmonary embolism age representation using a small or even medium size of labeled candidate. data, the same situation, we believe, that presents in 3D med- Task: NCC Random ImageNet Genesis ical image analysis today. Therefore, despite the outstanding 2D slice-based input 96.030.86 97.790.71 97.450.61 performance of Models Genesis, there is no doubt that a large, 2.5D orthogonal input 95.761.05 97.241.01 97.070.92 strongly annotated dataset for medical image analysis, like Im- 3D volume-based input 96.031.82 n/a 98.340.44 ageNet (Deng et al., 2009) for computer vision, is still highly Task: ECC Random ImageNet Genesis 2D slice-based input 60.338.61 62.578.04 62.848.78 demanded. One of our goals for developing Models Genesis is 2.5D orthogonal input 71.274.64 78.613.73 78.583.67 to help create such a medical ImageNet. Based on a small set 3D volume-based input 80.363.58 n/a 88.041.40 of expert annotations, models fine-tuned from Models Genesis will be able to help quickly generate initial rough annotations of unlabeled images for expert review, thus reducing the anno- classification, the model predicts labels of slices extracted from tation e orts and accelerating the creation of a large, strongly the center locations because other slices are not guaranteed to annotated, medical ImageNet. In summary, Models Genesis are include objects; for segmentation, the model predicts segmenta- not designed to replace such a large, strongly annotated dataset tion mask slice by slice and form the 3D segmentation volume for medical image analysis, as ImageNet for computer vision, by simply stacking the 2D segmentation maps. but rather to help create one. Fig. 7 exposes the comparison between 3D and 2D models on five 3D target tasks. Additionally, Table 4 compares 2D slice- based, 2.5D orthogonal, and 3D volume-based approaches on 5.2. Same-domain or cross-domain transfer learning? lung nodule and pulmonary embolism false positive reduction tasks. As evidenced by our statistical analyses, the 3D models Same-domain transfer learning is always preferred whenever trained from Genesis Chest CT achieve significantly higher av- possible because a relatively smaller domain gap makes the erage performance and lower standard deviation than 2D mod- learned image representation more beneficial for target tasks. els fine-tuned from ImageNet using either 2D or 2.5D image Even the most recent self-supervised learning approaches in representation. Nonetheless, the same conclusion does not ap- medical imaging were solely evaluated within the same dataset, ply to the models trained from scratch—3D scratch models are such as Chen et al. (2019a); Tajbakhsh et al. (2019a); Zhu et al. outperformed by 2D models in one out of the five target tasks (2020). Same-domain transfer learning strikes as a preferred (i.e., NCC in Fig. 7 and Table 4) and also exhibit an undesirably choice in terms of performance; however, most of the exist- larger standard deviation. We attribute the mixed results of 3D ing medical datasets, with less than hundred cases, are usu- scratch models to the larger number of model parameters and ally too small for deep models to learn reliable image repre- limited sample size in the target tasks, which together impede sentation. Therefore, for our future work, we plan to combine the full utilization of 3D context. In fact, the undesirable perfor- the publicly available datasets from similar domains together mance of the 3D scratch models highlights the e ectiveness of to train modality-oriented models, including Genesis CT, Gen- Genesis Chest CT, which unlocks the power of 3D models for esis MRI, Genesis X-ray, and Genesis Ultrasound, as well as medical imaging. To summarize, we believe that 3D problems organ-oriented models, including Genesis Brain, Genesis Lung, in medical imaging should be solved in 3D directly. Genesis Heart, and Genesis Liver. Cross-domain transfer learning in medical imaging is the Holy Grail. Retrieving a large number of unlabeled images 5. Discussions from a PACS system requires an IRB approval, often a long 5.1. Do we still need a medical ImageNet? process; the retrieved images must be de-identified; organizing In computer vision, at the time this paper is written, no the de-identified images in a way suitable for deep learning is self-supervised learning method outperforms fine-tuning mod- tedious and laborious. Therefore, large quantities of unlabeled els pre-trained on ImageNet (Jing and Tian, 2020; Chen et al., datasets may not be readily available to many target domains. 2019a; Kolesnikov et al., 2019; Zhou et al., 2019b; Hendrycks Evidenced by our results in Table 3 (BMS), Models Genesis have et al., 2019; Zhang et al., 2019; Caron et al., 2019). Therefore, a great potential for cross-domain transfer learning; particu- it may seem surprising to observe from our results in Table 3 larly, our distortion-based approaches (such as non-linear and that (fully) supervised representation learning methods do not local-shuing) take advantage of relative intensity values (in necessarily o er higher performances in some 3D target tasks all modalities) to learn shapes and appearances of various or- than self-supervised representation learning methods. We as- gans. Therefore, as our future work, we will be focusing on cribe this phenomenon to the limited amount of supervision methods that generalize well across domains. Zongwei Zhou et al. / Medical Image Analysis (2020) 13 5.3. Is any data augmentation suitable as a transformation? transformations for automated data augmentation, while pre- serving class labels or null class for all data points. Dao et al. We propose a self-supervised learning framework to learn (2019) introduced a fast kernel alignment metric for augmenta- image representation by discriminating and restoring images tion selection. It requires image labels for computing the ker- undergoing di erent transformations. One might argue that nel target alignment (as the reward) between the feature ker- our image transformations can be interchangeable with exist- nel and the label kernel. Cubuk et al. (2019) used reinforce- ing data augmentation techniques (Gan et al., 2015; Wong et al., ment learning to form an algorithm that autonomously searches 2016; Perez and Wang, 2017; Shorten and Khoshgoftaar, 2019), for preferred augmentation policies, magnitude, and probabil- while we would like to make the distinction between these two ity for specific classification tasks, wherein the resultant accu- concepts clearer. It is critical to assess whether a specific aug- racy of predictions and labels is treated as the reward signal mentation is practical and feasible for the image restoration to train the recurrent network controller. Wu et al. (2020) pro- task when designing image transformations. Simply introduc- posed uncertainty-based sampling to select the most e ective ing data augmentation can make a task ambiguous and lead to augmentation, but it is based on the highest loss that is com- degenerate learning. To this end, we choose image transforma- puted between predictions and labels. While the reward is well- tions based on two principles: defined in the aforementioned works, unfortunately, there is no ˆ First, the transformed sub-volume should not be found in available metric to determine the power of image representation the original CT scan. But it is possible to find a trans- directly; hence, no reward is readily established for representa- formed sub-volume that has undergone such augmenta- tion learning. Rather than constrain the representation directly, tions as rotation, flip, zoom in/out, or translation, as an our paper aims to design an image restoration task to let the alternative sub-volume in the original CT scan. In this model learn generic image representation from 3D medical im- scenario, without additional spatial information, the model ages. To achieve this, inspired by Vincent et al. (2010), we would not be able to “recover” the original sub-volume by modify the definition of a good representation into the follow- seeing the transformed one. As a result, we only elect the ing: “a good representation is one that can be obtained robustly from a transformed input, and that will be useful for restoring augmentations that can be applied to sub-volumes at the the corresponding original input.” Consequently, mean square pixel level rather than the spatial level. error (MSE) between the model’s input and output is defined as ˆ Second, a transformation should be applicable for spe- the objective function in our framework. However, if we adopt cific image properties. The augmentations that manipulate MSE as the reward function, the existing automated augmen- RGB channels, such as color shift and channel dropping, tation strategies will end up selecting identical-mapping. This have little e ect on CT/MRI images without the avail- is because restoring images without any transformation is ex- ability of color information. Instead, we promote bright- pected to give a lower error than restoring those with transfor- ness and contrast into monotonic color curves, resulting in mations. Evidenced by Fig. 3, identical-mapping results in a a novel non-linear transformation, explicitly enabling the poor image representation. To summarize, the key challenge model to learn intensity distribution from medical images. when employing automated augmentation strategies into our framework is how to define a proper reward for restoring im- After filtering out using the above two principles, the remaining ages, and fundamentally, for learning image representation. data augmentation techniques are not as many as expected. We have endeavored to produce learning perspective driven trans- 5.5. How to assess restoration quality and its relationship to formations rather than inviting any types of data augmentation model transferability? into our framework. A recent study from Chen et al. (2020) has also discovered a similar phenomenon: carefully designed Our transfer learning results in Sec. 4 suggest that image augmentations are superior to autonomously discovered aug- restoration is a promising task to learn generic 3D image repre- mentations. This suggests a criterion of transformations driven sentation. This also means that image restoration quality has an by learning perspectives, in capturing a compelling, robust rep- implicit correlation with model transferability to some extent. resentation for 3D transfer learning in medical imaging. To assess restoration quality, we compare the Mean Square Er- ror (MSE) loss with other commonly used loss functions for 5.4. Can algorithms autonomously search for transformations? image restoration, such as Mean Absolute Error (MAE) and Structural Similarity Index (SSIM) (Wang et al., 2004). All We follow two principles when designing suitable image of them compute the distance between input and output im- transformations for our self-supervised learning framework (see ages, while SSIM concentrates more on the restoration quality Sec. 5.3). Potentially, “automated data augmentation” can be in terms of structural similarity than MSE and MAE. Since the considered as an ecient alternative because this line of re- publicly available 3D SSIM loss was implemented in PyTorch , search seeks to strip researchers from the burden of finding to make the comparisons fair, we have adapted our five target good parameterizations and compositions of transformations tasks into PyTorch as well. Fig. 8 shows mixed performances of manually. Specifically, existing automated augmentation strate- the five target tasks among the three alternative loss functions. gies reinforce models to learn an optimal set of augmentation policies by calculating the reward between predictions and im- age labels. To name a few, Ratner et al. (2017) proposed a method for learning how to parameterize and composite the SSIM loss in 3D: github.com/jinh0park/pytorch-ssim-3D 14 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. 8: We compare three di erent losses for the task of image restoration. There is no evidence that the three losses have a decisive impact on the transfer learning results of five target tasks. Note that for this ablation study, all the proxy and target tasks are implemented in PyTorch. 265 CT images from the dataset and present examples in Fig. 9. Specifically, we pass the original CT images to the pre-trained Genesis Chest CT. To visualize the modifications, we have fur- ther plotted the di erence maps by subtracting the input and output. Since the input images involve no image transforma- tion, most of the restored CT scans (see Rows 1—2) can pre- serve the texture and structures of the input images, only en- countering few changes thanks to the identical-mapping train- ing scheme and the skip connections between encoder and de- coder. Nonetheless, we observe some failed cases (see Row 3), especially when the input CT image contains di use disease, which appears as an opacity in the lung. Genesis Chest CT happens to “remove” those opaque regions and restore a much clearer lung. This may be due to the fact that the majority of cropped sub-volumes are normal and are being used as ground truth, which empowers the pre-trained model with capabilities of detecting and restoring “novelties” in the CT scans. More specifically, in our work, these novelties include abnormal in- tensity distribution injected by non-linear transformation, atyp- ical texture and boundary injected by local-shuing, and dis- continuity injected by both inner and outer cutout. Based on Fig. 9: [Better viewed on-line and zoomed in for details] Examples of the surrounding anatomical structure, the model predicts the image restoration using Genesis Chest CT. We pass unseen CT images (Column 1) to the pre-trained model, obtaining the restored images opaque area to be air, therefore restoring darker intensity val- (Column 2). The di erence between input and output has been shown ues. This behavior is certainly a “mistake” in terms of image in Column 3. In most of the normal cases, such as those in Rows 1—2, restoration, but it can also be thought of as an attempt to detect Genesis Chest CT can perform a fairly reasonable identical-mapping. di use diseases in the lung, which is challenging to annotate Meanwhile, for some cases that contain opacity in the lung, as illus- due to their unclear boundary. By training an image restoration trated in Row 3, Genesis Chest CT tends to restore a clearer lung. As task, the diseased area will be revealed by simple subtraction of a result, the di use region is revealed in the di erence map automat- the input and output. More importantly, this suggested detec- ically. We have zoomed in the region for a better visualization and tion approach requires zero human annotation, neither image- comparison. level label nor pixel-level contour, contrasting from the existing weakly supervised disease detection approaches (Zhou et al., 2016; Baumgartner et al., 2018; Cai et al., 2018; Siddiquee As discussed in Sec. 5.4, the ideal loss function for represen- et al., 2019). tation learning is one that can explicitly determine the power of image representation. However, the three losses explored in this section are implicit, based on the premise that the image 6. Related Work restoration quality can indicate a good representation. Further With the splendid success of deep neural networks, trans- studies with restoration quality assessment and its relationship fer learning (Pan and Yang, 2010; Weiss et al., 2016; Yosinski to model transferability are therefore suggested. et al., 2014) has become integral to many applications, espe- cially medical imaging (Greenspan et al., 2016; Litjens et al., 5.6. Could Models Genesis detect infected regions from images 2017; Lu et al., 2017; Shen et al., 2017; Wang et al., 2017a; autonomously? Zhou et al., 2017, 2019b). This immense popularity of transfer As referenced from Sec. 3.1, Genesis Chest CT has been pre- learning is attributed to the learned image representation, which trained using 623 CT images in the LUNA 2016 dataset. To o ers convergence speedups and performance gains for most assess the image restoration quality, we utilize the rest of the target tasks, in particular, with limited annotated data. In the Zongwei Zhou et al. / Medical Image Analysis (2020) 15 following sections, we review the works related to supervised allows models to learn image representation from abundant un- and self-supervised representation learning. labeled medical image data with zero human annotation e ort. 6.1. Supervised representation learning 6.2. Self-supervised representation learning ImageNet contains more than fourteen million images that Aiming at learning image representation from unlabeled have been manually annotated to indicate which objects are data, self-supervised learning research has recently experienced present in each image; and more than one million of the im- a surge in computer vision (Caron et al., 2018; Chen et al., ages have actually been annotated with the bounding boxes of 2019c; Doersch et al., 2015; Goyal et al., 2019; Jing and Tian, the objects in the image. Pre-training a model on ImageNet and 2020; Mahendran et al., 2018; Mundhenk et al., 2018; Noroozi then fine-tuning it on di erent medical imaging tasks has seen et al., 2018; Noroozi and Favaro, 2016; Pathak et al., 2016; the most practical adoption in medical image analysis (Shin Sayed et al., 2018; Zhang et al., 2016, 2017), but it is a rela- et al., 2016; Tajbakhsh et al., 2016). To classify the com- tively new trend in modern medical imaging. The key challenge mon thoracic diseases from ChestX-ray14 dataset, as evidenced for self-supervised learning is identifying a suitable self super- in Irvin et al. (2019), nearly all the leading methods (Guan vision task, i.e., generating input and output instance pairs from and Huang, 2018; Guendel et al., 2018; Ma et al., 2019; Tang the data. Two of the preliminary studies include predicting the et al., 2018) follow the paradigm of “fine-tuning Models Ima- distance and 3D coordinates of two patches randomly sampled geNet” by adopting di erent architectures, such as ResNet (He from the same brain (Spitzer et al., 2018), identifying whether et al., 2016) and DenseNet (Huang et al., 2017), along with two scans belong to the same person, and predicting the level their pre-trained weights. Other representative medical applica- of vertebral bodies (Jamaludin et al., 2017). Nevertheless, these tions include identifying skin cancer from dermatologist level two works are incapable of learning representation from “self- photographs (Esteva et al., 2017), o ering early detection of supervision” because they demand auxiliary information and Alzheimer’s Disease (Ding et al., 2018), and performing e ec- specialized data collection such as paired and registered images. tive detection of pulmonary embolism (Tajbakhsh et al., 2019b). By utilizing only the original pixel/voxel information shipped Despite the remarkable transferability of Models ImageNet, with data, several self-supervised learning schemes have been pre-trained 2D models o er little benefits towards 3D medi- developed for di erent medical applications: Ross et al. (2018) cal imaging tasks in the most prominent medical modalities adopted colorization as proxy task, wherein color colonoscopy (e.g., CT and MRI). To fit this paradigm, 3D imaging tasks images are converted to gray-scale and then recovered using a have to be reformulated and solved in 2D or 2.5D (Roth et al., conditional Generative Adversarial Network (GAN); Alex et al. 2015, 2014; Tajbakhsh et al., 2015), thus losing rich 3D anatom- (2017) pre-trained a stack of denoising auto-encoders, wherein ical information and inevitably compromising the performance. the self-supervision was created by mapping the patches with Annotating 3D medical images at the similar scale with Ima- the injected noise to the original patches; Chen et al. (2019a) geNet requires a significant research e ort and budget. It is designed image restoration as proxy task, wherein small regions currently not feasible to create annotated datasets comparable were shued within images and then let models learn to restore to this size for every 3D medical application. Consequently, for the original ones; Zhuang et al. (2019) and Zhu et al. (2020) in- lung cancer risk malignancy estimation, Ardila et al. (2019) re- troduced a 3D representation learning proxy task by recovering sorted to incorporate 3D spatial information by using Inflated the rearranged and rotated Rubik’s cube; and finally Tajbakhsh 3D (I3D) (Carreira and Zisserman, 2017), trained from the Ki- et al. (2019a) individualized self-supervised schemes for a set of netics dataset, as the feature extractor. Evidenced by Table 3, it target tasks. As seen, the previously discussed self-supervised is not the most favorable choice owing to the large domain gap learning schemes, both in computer vision and medical imag- between the temporal video and medical volume. This limita- ing, are developed individually for specific target tasks, there- tion has led to the development of model zoo in NiftyNet (Gib- fore, the generalizability and robustness of the learned image son et al., 2018b). However, they were trained with small representation have yet to be examined across multiple target datasets for specific applications (e.g., brain parcellation and or- tasks. To our knowledge, we are the first to investigate cross- gan segmentation), and were never intended as source models domain self-supervised learning in medical imaging. for transfer learning. Our experimental results in Table 3 indi- cate that NiftyNet models o er limited benefits to the five target 6.3. Our previous work medical applications via transfer learning. More recently, Chen et al. (2019b) have pre-trained 3D residual network by jointly Zhou et al. (2019b) first presented generic autodidactic mod- segmenting the objects annotated in a collection of eight med- els for 3D medical imaging, which obtain common image repre- ical datasets, resulting in MedicalNet for 3D transfer learning. sentation that is transferable and generalizable across diseases, In Table 3, we have examined the pre-trained MedicalNet on organs and modalities, overcoming the scalablity issue associ- five target tasks in comparison with our Models Genesis. As ated with multiple tasks. This paper extends the preliminary reviewed, each and every aforementioned pre-trained model re- version substantially with the following improvements. quires massive, high-quality annotated datasets. However, sel- dom do we have a perfectly-sized and systematically-labeled 1. We have introduced notations, formulas, and diagrams, as dataset to pre-train a deep model in medical imaging, where well as detailed methodology descriptions along with their both data and annotations are expensive to acquire. We over- learning objectives, for a succinct framework overview in come the above limitation via self-supervised learning, which Sec. 2. 16 Zongwei Zhou et al. / Medical Image Analysis (2020) 2. We have extended the brain tumor segmentation experi- Number R01HL128785. The content is solely the responsibil- ment using MRI Flair images in Sec. 3.2, highlighting the ity of the authors and does not necessarily represent the o- transfer learning capabilities of Models Genesis from CT cial views of the NIH. This work has utilized the GPUs pro- to MRI Flair domains. vided partially by the ASU Research Computing and partially by the Extreme Science and Engineering Discovery Environ- 3. We have conducted comprehensive ablation studies be- ment (XSEDE) funded by the National Science Foundation tween the combined scheme and each of the individual (NSF) under grant number ACI-1548562. We thank Z. Guo for learning schemes in Sec. 4.1, demonstrating that learning implementing Rubik’s Cube (Zhuang et al., 2019) and the 3D from multiple perspectives leads to a more robust target version of Jigsaw (Noroozi and Favaro, 2016) and DeepClus- task performance. ter (Caron et al., 2018); F. Haghighi and M. R. Hosseinzadeh Taher for implementing the 3D version of in-painting (Pathak 4. We have investigated three di erent random initialization et al., 2016), patch-shuing (Chen et al., 2019a), and work- methods for 3D models in Sec. 4.2, suggesting that initial- ing with Z. Guo in evaluating the performance of Medical- izing with Models Genesis can o er much higher perfor- Net (Chen et al., 2019b); M. M. Rahman Siddiquee for exam- mances and faster convergences. ining NiftyNet (Gibson et al., 2018b) with our Models Gen- 5. We have examined Models Genesis with the existing pre- esis; P. Zhang for comparing two additional random initializa- trained 3D models on five distinct medical target tasks tion methods with our Models Genesis; S. Bajpai for comparing in Sec. 4.3, showing that with fewer parameters, Models three loss functions of the proxy task; N. Tajbakhsh for revis- Genesis surpass all publicly available 3D models in both ing our conference paper; R. Feng for valuable discussions; and generalizability and transferability. S. Tatapudi for helping improve the writing of this paper. The content of this paper is covered by patents pending. 6. We have provided experimental results on five target tasks using limited annotated data in Sec. 4.4, indicating that transfer learning from our Models Genesis can reduce an- References notation e orts by at least 30%. Alex, V., Vaidhya, K., Thirunavukkarasu, S., Kesavadas, C., Krishnamurthi, 7. We have investigated 3D sub-volume based approaches G., 2017. Semisupervised learning using denoising autoencoders for brain compared with 2D approaches fine-tuning from Models lesion detection and segmentation. Journal of Medical Imaging 4, 041311. Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse, ImageNet using 2D/2.5D representation, underlining the D., Etemadi, M., Ye, W., Corrado, G., et al., 2019. End-to-end lung cancer power of pre-trained 3D models in Sec. 4.5. screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine 25, 954–961. Aresta, G., Jacobs, C., Araujo, ´ T., Cunha, A., Ramos, I., van Ginneken, B., 7. Conclusion Campilho, A., 2019. iw-net: an automatic and minimalistic interactive lung nodule segmentation deep network. Scientific reports 9, 1–9. A key contribution of ours is a collection of generic source Armato III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., models, nicknamed Models Genesis, built directly from unla- Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Ho man, E.A., et al., beled 3D imaging data with our novel unified self-supervised 2011. The lung image database consortium (lidc) and image database re- source initiative (idri): a completed reference database of lung nodules on ct method, for generating powerful application-specific target scans. Medical physics 38, 915–931. models through transfer learning. While the empirical results Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shino- are strong, surpassing state-of-the-art performances in most of hara, R.T., Berger, C., Ha, S.M., Rozycki, M., et al., 2018. Identifying the the applications, our goal is to extend our Models Genesis to best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv modality-oriented models, such as Genesis MRI and Genesis preprint arXiv:1811.02629 . Ultrasound, as well as organ-oriented models, such as Gene- Baumgartner, C.F., Koch, L.M., Can Tezcan, K., Xi Ang, J., Konukoglu, E., sis Brain and Genesis Heart. We envision that Models Genesis 2018. Visual feature attribution using wasserstein gans, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. may serve as a primary source of transfer learning for 3D med- 8309–8319. ical imaging applications, in particular, with limited annotated Ben-Cohen, A., Diamant, I., Klang, E., Amitai, M., Greenspan, H., 2016. Fully data. To benefit the research community, we make the develop- convolutional network for liver segmentation and lesions detection, in: Deep ment of Models Genesis open science, releasing our codes and learning and data labeling for medical applications. Springer, pp. 77–85. Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou, Q., Fu, C.W., models to the public. Creating all Models Genesis, an ambi- Han, X., Heng, P.A., Hesser, J., et al., 2019. The liver tumor segmentation tious undertaking, takes a village; therefore, we would like to benchmark (lits). arXiv preprint arXiv:1901.04056 . invite researchers around the world to contribute to this e ort, Buzug, T.M., 2011. Computed tomography, in: Springer Handbook of Medical and hope that our collective e orts will lead to the Holy Grail Technology. Springer, pp. 311–342. Cai, J., Lu, L., Harrison, A.P., Shi, X., Chen, P., Yang, L., 2018. Iterative of Models Genesis, all powerful across diseases, organs, and attention mining for weakly supervised thoracic disease pattern localization modalities. in chest x-rays, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 589–598. Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep clustering for Acknowledgments unsupervised learning of visual features, in: Proceedings of the European Conference on Computer Vision, pp. 132–149. This research has been supported partially by ASU and Mayo Caron, M., Bojanowski, P., Mairal, J., Joulin, A., 2019. Unsupervised pre- Clinic through a Seed Grant and an Innovation Grant, and par- training of image features on non-curated data, in: Proceedings of the IEEE tially by the National Institutes of Health (NIH) under Award International Conference on Computer Vision, pp. 2959–2968. Zongwei Zhou et al. / Medical Image Analysis (2020) 17 Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpass- and the kinetics dataset, in: Proceedings of the IEEE Conference on Com- ing human-level performance on imagenet classification, in: Proceedings puter Vision and Pattern Recognition, pp. 6299–6308. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D., 2019a. 1026–1034. Self-supervised learning for medical image analysis using image context He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image restoration. Medical image analysis 58, 101539. recognition, in: Proceedings of the IEEE Conference on Computer Vision Chen, S., Ma, K., Zheng, Y., 2019b. Med3d: Transfer learning for 3d medical and Pattern Recognition, pp. 770–778. image analysis. arXiv preprint arXiv:1904.00625 . Hendrycks, D., Mazeika, M., Kadavath, S., Song, D., 2019. Using self- Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple frame- supervised learning can improve model robustness and uncertainty, in: Ad- work for contrastive learning of visual representations. arXiv preprint vances in Neural Information Processing Systems, pp. 15637–15648. arXiv:2002.05709 . Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2017. Densely Chen, T., Zhai, X., Ritter, M., Lucic, M., Houlsby, N., 2019c. Self-supervised connected convolutional networks, in: Proceedings of the IEEE Conference gans via auxiliary rotation loss, in: Proceedings of the IEEE Conference on on Computer Vision and Pattern Recognition, p. 3. Computer Vision and Pattern Recognition, pp. 12154–12163. Hurst, R.T., Burke, R.F., Wissner, E., Roberts, A., Kendall, C.B., Lester, S.J., Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V., 2019. Autoaugment: Somers, V., Goldman, M.E., Wu, Q., Khandheria, B., 2010. Incidence of Learning augmentation strategies from data, in: Proceedings of the IEEE subclinical atherosclerosis as a marker of cardiovascular risk in retired pro- conference on computer vision and pattern recognition, pp. 113–123. fessional football players. The American journal of cardiology 105, 1107– Dao, T., Gu, A., Ratner, A.J., Smith, V., De Sa, C., Re, ´ C., 2019. A kernel theory 1111. of modern data augmentation. Proceedings of machine learning research 97, Iizuka, S., Simo-Serra, E., Ishikawa, H., 2017. Globally and locally consistent 1528. image completion. ACM Transactions on Graphics (ToG) 36, 107. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A Io e, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network large-scale hierarchical image database, in: Proceedings of the IEEE Con- training by reducing internal covariate shift, in: Bach, F., Blei, D. (Eds.), ference on Computer Vision and Pattern Recognition, IEEE. pp. 248–255. Proceedings of the 32nd International Conference on Machine Learning, Ding, Y., Sohn, J.H., Kawczynski, M.G., Trivedi, H., Harnish, R., Jenkins, PMLR, Lille, France. pp. 448–456. URL: http://proceedings.mlr. N.W., Lituiev, D., Copeland, T.P., Aboian, M.S., Mari Aparici, C., et al., press/v37/ioffe15.html. 2018. A deep learning model to predict a diagnosis of alzheimer disease by Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Mark- using 18f-fdg pet of the brain. Radiology 290, 456–464. lund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al., 2019. Chexpert: Doersch, C., Gupta, A., Efros, A.A., 2015. Unsupervised visual representation A large chest radiograph dataset with uncertainty labels and expert compar- learning by context prediction, in: Proceedings of the IEEE International ison. arXiv preprint arXiv:1901.07031 . Conference on Computer Vision, pp. 1422–1430. Jamaludin, A., Kadir, T., Zisserman, A., 2017. Self-supervised learning for Doersch, C., Zisserman, A., 2017. Multi-task self-supervised visual learning, spinal mris, in: Deep Learning in Medical Image Analysis and Multimodal in: Proceedings of the IEEE International Conference on Computer Vision, Learning for Clinical Decision Support. Springer, pp. 294–302. pp. 2051–2060. Jing, L., Tian, Y., 2020. Self-supervised visual feature learning with deep neural Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T., 2015. networks: A survey. IEEE Transactions on Pattern Analysis and Machine Discriminative unsupervised feature learning with exemplar convolutional Intelligence . neural networks. IEEE transactions on pattern analysis and machine intelli- Kang, G., Dong, X., Zheng, L., Yang, Y., 2017. Patchshue regularization. gence 38, 1734–1747. arXiv preprint arXiv:1707.07103 . Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, Kinga, D., Adam, J.B., 2015. Adam: A method for stochastic optimization, in: S., 2017. Dermatologist-level classification of skin cancer with deep neural International Conference on Learning Representations (ICLR). networks. Nature 542, 115. Kolesnikov, A., Zhai, X., Beyer, L., 2019. Revisiting self-supervised visual rep- Forbes, G.B., 2012. Human body composition: growth, aging, nutrition, and resentation learning, in: Proceedings of the IEEE conference on Computer activity. Springer Science & Business Media. Vision and Pattern Recognition, pp. 1920–1929. Gan, Z., Henao, R., Carlson, D., Carin, L., 2015. Learning deep sigmoid belief Liang, J., Bi, J., 2007. Computer aided detection of pulmonary embolism with networks with data augmentation, in: Artificial Intelligence and Statistics, tobogganing and mutiple instance classification in ct pulmonary angiogra- pp. 268–276. phy, in: Biennial International Conference on Information Processing in Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., David- Medical Imaging, Springer. pp. 630–641. son, B., Pereira, S.P., Clarkson, M.J., Barratt, D.C., 2018a. Automatic multi- Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, organ segmentation on abdominal ct with dense v-networks. IEEE transac- M., Van Der Laak, J.A., Van Ginneken, B., Sanchez, ´ C.I., 2017. A survey tions on medical imaging 37, 1822–1834. on deep learning in medical image analysis. Medical image analysis 42, Gibson, E., Li, W., Sudre, C., Fidon, L., Shakir, D.I., Wang, G., Eaton-Rosen, 60–88. Z., Gray, R., Doel, T., Hu, Y., et al., 2018b. Niftynet: a deep-learning plat- LiTS, 2017. Results of all submissions for liver segmentation. URL: https: form for medical imaging. Computer methods and programs in biomedicine //competitions.codalab.org/competitions/17094#results. 158, 113–122. Lu, L., Zheng, Y., Carneiro, G., Yang, L., 2017. Deep learning and convolu- Glorot, X., Bengio, Y., 2010. Understanding the diculty of training deep tional neural networks for medical image computing: precision medicine, feedforward neural networks, in: Proceedings of the Thirteenth International high performance and large-scale datasets. Springer. Conference on Artificial Intelligence and Statistics, pp. 249–256. LUNA, 2016. Results of all submissions for nodule false positive reduction. Goyal, P., Mahajan, D., Gupta, A., Misra, I., 2019. Scaling and bench- URL: https://luna16.grand-challenge.org/results/. marking self-supervised visual representation learning. arXiv preprint Ma, Y., Zhou, Q., Chen, X., Lu, H., Zhao, Y., 2019. Multi-attention network arXiv:1905.01235 . for thoracic disease classification and localization, in: ICASSP 2019-2019 Graham, B., 2014. Fractional max-pooling. arXiv preprint arXiv:1412.6071 . IEEE International Conference on Acoustics, Speech and Signal Processing Greenspan, H., van Ginneken, B., Summers, R.M., 2016. Guest editorial deep (ICASSP), IEEE. pp. 1378–1382. learning in medical imaging: Overview and future promise of an exciting Mahendran, A., Thewlis, J., Vedaldi, A., 2018. Cross pixel optical-flow similar- new technique. IEEE Transactions on Medical Imaging 35, 1153–1159. ity for self-supervised learning, in: Asian Conference on Computer Vision, Guan, Q., Huang, Y., 2018. Multi-label chest x-ray image classification via Springer. pp. 99–116. category-wise residual attention learning. Pattern Recognition Letters . Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Guendel, S., Grbic, S., Georgescu, B., Liu, S., Maier, A., Comaniciu, D., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al., 2015. The multimodal 2018. Learning to recognize abnormalities in chest x-rays with location- brain tumor image segmentation benchmark (brats). IEEE transactions on aware dense networks, in: Iberoamerican Congress on Pattern Recognition, medical imaging 34, 1993. Springer. pp. 757–765. Mortenson, M.E., 1999. Mathematics for computer graphics applications. In- Hara, K., Kataoka, H., Satoh, Y., 2018. Can spatiotemporal 3d cnns retrace the dustrial Press Inc. history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference Mundhenk, T.N., Ho, D., Chen, B.Y., 2018. Improvements to context based on Computer Vision and Pattern Recognition, pp. 6546–6555. self-supervised learning., in: Proceedings of the IEEE Conference on Com- 18 Zongwei Zhou et al. / Medical Image Analysis (2020) puter Vision and Pattern Recognition, pp. 9339–9348. Sun, C., Guo, S., Zhang, H., Li, J., Chen, M., Ma, S., Jin, L., Liu, X., Li, NLST, 2011. Reduced lung-cancer mortality with low-dose computed tomo- X., Qian, X., 2017a. Automatic segmentation of liver tumors from multi- graphic screening. New England Journal of Medicine 365, 395–409. phase contrast-enhanced ct images based on fcns. Artificial intelligence in Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations medicine 83, 58–66. by solving jigsaw puzzles, in: European Conference on Computer Vision, Sun, C., Shrivastava, A., Singh, S., Gupta, A., 2017b. Revisiting unreason- Springer. pp. 69–84. able e ectiveness of data in deep learning era, in: Proceedings of the IEEE Noroozi, M., Vinjimoor, A., Favaro, P., Pirsiavash, H., 2018. Boosting self- international conference on computer vision, pp. 843–852. supervised learning via knowledge transfer, in: Proceedings of the IEEE Sun, W., Zheng, B., Qian, W., 2017c. Automatic feature learning using multi- Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. channel roi based on deep structured algorithms for computerized lung can- Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on cer diagnosis. Computers in biology and medicine 89, 530–539. knowledge and data engineering 22, 1345–1359. Tajbakhsh, N., Gotway, M.B., Liang, J., 2015. Computer-aided pulmonary em- Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A., 2016. Con- bolism detection using a novel vessel-aligned multi-planar image represen- text encoders: Feature learning by inpainting, in: Proceedings of the IEEE tation and convolutional neural networks, in: International Conference on Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Medical Image Computing and Computer-Assisted Intervention, Springer. Perez, L., Wang, J., 2017. The e ectiveness of data augmentation in image pp. 62–69. classification using deep learning. arXiv preprint arXiv:1712.04621 . Tajbakhsh, N., Hu, Y., Cao, J., Yan, X., Xiao, Y., Lu, Y., Liang, J., Terzopoulos, Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M., 2013. D., Ding, X., 2019a. Surrogate supervision for medical image analysis: Ef- Deep feature learning for knee cartilage segmentation using a triplanar con- fective deep learning from limited quantities of labeled data. arXiv preprint volutional neural network, in: International conference on medical image arXiv:1901.08707 . computing and computer-assisted intervention, Springer. pp. 246–253. Tajbakhsh, N., Shin, J.Y., Gotway, M.B., Liang, J., 2019b. Computer-aided Ratner, A.J., Ehrenberg, H., Hussain, Z., Dunnmon, J., Re, ´ C., 2017. Learn- detection and visualization of pulmonary embolism using a novel, com- ing to compose domain-specific transformations for data augmentation, in: pact, and discriminative image representation. Medical image analysis 58, Advances in neural information processing systems, pp. 3236–3246. 101541. Rawat, W., Wang, Z., 2017. Deep convolutional neural networks for image clas- Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, sification: A comprehensive review. Neural computation 29, 2352–2449. M.B., Liang, J., 2016. Convolutional neural networks for medical image Ross, T., Zimmerer, D., Vemuri, A., Isensee, F., Wiesenfarth, M., Bodenstedt, analysis: Full training or fine tuning? IEEE transactions on medical imaging S., Both, F., Kessler, P., Wagner, M., Muller ¨ , B., et al., 2018. Exploiting the 35, 1299–1312. potential of unlabeled endoscopic video data with self-supervised learning. Taleb, A., Loetzsch, W., Danz, N., Severin, J., Gaertner, T., Bergner, B., Lip- International journal of computer assisted radiology and surgery 13, 925– pert, C., 2020. 3d self-supervised methods for medical imaging. arXiv 933. preprint arXiv:2006.03829 . Roth, H.R., Lu, L., Liu, J., Yao, J., Se , A., Cherry, K., Kim, L., Summers, Tang, H., Zhang, C., Xie, X., 2019. Nodulenet: Decoupled false positive re- R.M., 2015. Improving computer-aided detection using convolutional neu- duction for pulmonary nodule detection and segmentation, in: International ral networks and random view aggregation. IEEE transactions on medical Conference on Medical Image Computing and Computer-Assisted Interven- imaging 35, 1170–1181. tion, Springer. pp. 266–274. Roth, H.R., Lu, L., Se , A., Cherry, K.M., Ho man, J., Wang, S., Liu, J., Tang, Y., Wang, X., Harrison, A.P., Lu, L., Xiao, J., Summers, R.M., 2018. Turkbey, E., Summers, R.M., 2014. A new 2.5 d representation for lymph Attention-guided curriculum learning for weakly supervised classification node detection using random sets of deep convolutional neural network ob- and localization of thoracic diseases on chest radiographs, in: International servations, in: International conference on medical image computing and Workshop on Machine Learning in Medical Imaging, Springer. pp. 249–258. computer-assisted intervention, Springer. pp. 520–527. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., 2010. Sayed, N., Brattoli, B., Ommer, B., 2018. Cross and learn: Cross-modal self- Stacked denoising autoencoders: Learning useful representations in a deep supervision, in: German Conference on Pattern Recognition, Springer. pp. network with a local denoising criterion. Journal of machine learning re- 228–243. search 11, 3371–3408. Setio, A.A.A., Ciompi, F., Litjens, G., Gerke, P., Jacobs, C., Van Riel, S.J., Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., 2018. Deep Wille, M.M.W., Naqibullah, M., Sanchez, ´ C.I., van Ginneken, B., 2016. Pul- learning for computer vision: A brief review. Computational intelligence monary nodule detection in ct images: false positive reduction using multi- and neuroscience 2018. view convolutional networks. IEEE transactions on medical imaging 35, Wang, H., Zhou, Z., Li, Y., Chen, Z., Lu, P., Wang, W., Liu, W., Yu, L., 2017a. 1160–1169. Comparison of machine learning methods for classifying mediastinal lymph Setio, A.A.A., Traverso, A., De Bel, T., Berens, M.S., van den Bogaard, C., node metastasis of non-small cell lung cancer from 18 f-fdg pet/ct images. Cerello, P., Chen, H., Dou, Q., Fantacci, M.E., Geurts, B., et al., 2017. Val- EJNMMI research 7, 11. idation, comparison, and combination of algorithms for automatic detection Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M., 2017b. of pulmonary nodules in computed tomography images: the luna16 chal- Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on lenge. Medical image analysis 42, 1–13. weakly-supervised classification and localization of common thorax dis- Shen, D., Wu, G., Suk, H.I., 2017. Deep learning in medical image analysis. eases, in: Proceedings of the IEEE Conference on Computer Vision and Annual review of biomedical engineering 19, 221–248. Pattern Recognition, pp. 2097–2106. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mol- Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004. Image quality lura, D., Summers, R.M., 2016. Deep convolutional neural networks for assessment: from error visibility to structural similarity. IEEE transactions computer-aided detection: CNN architectures, dataset characteristics and on image processing 13, 600–612. transfer learning. IEEE transactions on medical imaging 35, 1285–1298. Weiss, K., Khoshgoftaar, T.M., Wang, D., 2016. A survey of transfer learning. Shorten, C., Khoshgoftaar, T.M., 2019. A survey on image data augmentation Journal of Big Data 3, 9. for deep learning. Journal of Big Data 6, 60. Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D., 2016. Understand- Siddiquee, M.M.R., Zhou, Z., Tajbakhsh, N., Feng, R., Gotway, M.B., Bengio, ing data augmentation for classification: when to warp?, in: 2016 interna- Y., Liang, J., 2019. Learning fixed points in generative adversarial networks: tional conference on digital image computing: techniques and applications From image-to-image translation to disease detection and localization, in: (DICTA), IEEE. pp. 1–6. Proceedings of the IEEE International Conference on Computer Vision, pp. Wu, B., Zhou, Z., Wang, J., Wang, Y., 2018. Joint learning for pulmonary 191–200. nodule segmentation, attributes and malignancy prediction, in: 2018 IEEE Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S., Dickscheid, T., 2018. 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE. Improving cytoarchitectonic segmentation of human brain areas with self- pp. 1109–1113. supervised siamese networks, in: International Conference on Medical Im- Wu, S., Zhang, H.R., Valiant, G., Re, ´ C., 2020. On the generalization age Computing and Computer-Assisted Intervention, Springer. pp. 663–671. e ects of linear transformations in data augmentation. arXiv preprint Standley, T., Zamir, A.R., Chen, D., Guibas, L., Malik, J., Savarese, S., 2019. arXiv:2005.00695 . Which tasks should be learned together in multi-task learning? arXiv Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are preprint arXiv:1905.07553 . features in deep neural networks?, in: Advances in neural information pro- Zongwei Zhou et al. / Medical Image Analysis (2020) 19 cessing systems, pp. 3320–3328. Appendix A. Implementation details of revised baselines Zhang, L., Qi, G.J., Wang, L., Luo, J., 2019. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data, This work is among the first e ort to create a comprehen- in: Proceedings of the IEEE Conference on Computer Vision and Pattern sive benchmark for existing self-supervised learning methods Recognition, pp. 2547–2555. Zhang, R., Isola, P., Efros, A.A., 2016. Colorful image colorization, in: Pro- for 3D medical image analysis. We have extended the six ceedings of the European Conference on Computer Vision, Springer. pp. most representative self-supervised learning methods into their 649–666. 3D versions, including De-noising (Vincent et al., 2010), In- Zhang, R., Isola, P., Efros, A.A., 2017. Split-brain autoencoders: Unsupervised painting (Pathak et al., 2016), Jigsaw (Noroozi and Favaro, learning by cross-channel prediction, in: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 1058–1067. 2016), and Patch-shuing (Chen et al., 2019a). These meth- Zhang, T., 2004. Solving large scale linear prediction problems using stochastic ods were originally introduced for the purpose of 2D imag- gradient descent algorithms, in: Proceedings of the twenty-first international ing. On the other hand, the most recent 3D self-supervised conference on Machine learning, ACM. p. 116. method (Zhuang et al., 2019) learns representation by playing a Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A., 2016. Learning deep features for discriminative localization, in: Proceedings of the IEEE Rubik’s cube. We have reimplemented it because their ocial Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. implementation is not publicly available at the time this paper is Zhou, Z., Shin, J., Feng, R., Hurst, R.T., Kendall, C.B., Liang, J., 2019a. Inte- written. All of the models are pre-trained using the LUNA 2016 grating active learning and transfer learning for carotid intima-media thick- dataset (Setio et al., 2017) with the same sub-volumes extracted ness video interpretation. Journal of digital imaging 32, 290–299. Zhou, Z., Shin, J., Zhang, L., Gurudu, S., Gotway, M., Liang, J., 2017. Fine- from CT scans as our models (see Sec. 3.1). The detailed im- tuning convolutional neural networks for biomedical image analysis: ac- plementations of the baselines are elaborated in the following tively and incrementally, in: Proceedings of the IEEE Conference on Com- sections. puter Vision and Pattern Recognition, pp. 7340–7349. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2018. Unet++: A nested u-net architecture for medical image segmentation, in: Deep Learning Appendix A.1. Extended 3D De-noising in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 3–11. In our 3D De-noising, which is inspired by its 2D coun- Zhou, Z., Sodha, V., Rahman Siddiquee, M.M., Feng, R., Tajbakhsh, N., Got- terpart (Vincent et al., 2010), the model is trained to restore way, M.B., Liang, J., 2019b. Models genesis: Generic autodidactic models the original sub-volume from its transformed one with addi- for 3d medical image analysis, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Springer International Publishing, tive Gaussian noise (randomly sampling  2 [0; 0:1]). To cor- Cham. pp. 384–393. URL: https://link.springer.com/chapter/ rectly restore the original sub-volume, models are required to 10.1007/978-3-030-32251-9_42. learn Gabor-like edge detectors when denoising transformed Zhu, J., Li, Y., Hu, Y., Ma, K., Zhou, S.K., Zheng, Y., 2020. Rubik’s cube+: A sub-volumes. Following the proposed image restoration train- self-supervised feature learning framework for 3d medical image analysis. Medical Image Analysis , 101746. ing scheme, the auto-encoder network is replaced with a 3D Zhuang, X., Li, Y., Hu, Y., Ma, K., Yang, Y., Zheng, Y., 2019. Self-supervised U-Net, wherein the input is a 64 64 32 sub-volume that has feature learning for 3d medical images by playing a rubik’s cube, in: Inter- undergone Gaussian noise and the output is the restored sub- national Conference on Medical Image Computing and Computer-Assisted volume. The L2 distance between input and output is used as Intervention, Springer. pp. 420–428. the loss function. Appendix A.2. Extended 3D In-painting In our 3D In-painting, which is inspired by its 2D counter- part (Pathak et al., 2016), the model is trained to in-paint ar- bitrary cutout regions based on the rest of the sub-volume. A qualitative illustration of the image in-painting task is shown in the right panel of Fig. A.11(a). To correctly predict missing regions, networks are required to learn local continuities of or- gans in medical images via interpolation. Unlike the original in-painting, the adversarial loss and discriminator are excluded from our implementation of the 3D version because our primary goal is to empower models with generic representation, rather than generating sharper and realistic sub-volumes. The genera- tor is a 3D U-Net, consisting of an encoder and a decoder. The input of the encoder is a 64 64 32 sub-volume that needs to be in-painted. Their decoder works di erently than our inner- cutout because it predicts the missing region only, and there- fore, the loss is just computed on the cutout region—an ablation study on the loss has been further presented in Appendix C.2. Appendix A.3. Extended 3D Jigsaw In our 3D Jigsaw, which is inspired by its 2D counter- part (Noroozi and Favaro, 2016), we utilize the implementation 20 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. A.10: A direct comparison between global patch shuing (Chen et al., 2019a) and our local pixel shuing. (a) illustrates ten example images undergone local-shuing and patch-shuing independently. As seen, the overall anatomical structure such as individual organs, blood vessels, lymph nodes, and other soft tissue structures are preserved in the transformed image through local-shuing. (b) presents the performance on five target tasks, showing that models pre-trained by our local-shuing noticeably outperform those pre-trained by patch-shuing for cross-domain transfer learning (BMS). by Taleb et al. (2020) , wherein the puzzles are created by sam- tecture, where the encoder and decoder serve as analysis and pling a 3  3  3 grid of 3D patches. Then, these patches are restoration parts, respectively. shued according to an arbitrary permutation, selected from a set of predefined permutations. This set with size P = 100 is Appendix A.5. Extended 3D DeepCluster chosen out of the (333)! possible permutations, by following In our 3D DeepCluster, which is inspired by its 2D coun- the Hamming distance based algorithm, and each permutation terpart (Caron et al., 2018), we iteratively cluster deep fea- is assigned an index. As a result, the problem is cast as a P-way tures extracted from sub-volumes by k-means and use the sub- classification task, i.e., the model is trained to recognize the ap- sequent assignments as supervision to update the weights of plied permutation index, allowing us to solve the 3D puzzles the model. Through clustering, the model can obtain useful eciently. We build the classification model by taking the en- general-purpose visual features, requiring little domain knowl- coder of 3D U-Net and appending a sequence of f c layers. In edge and no specific signal from the inputs. We replaced orig- the implementation, we minimize the cross-entropy loss of the inal AlexNet/VGG architecture with the encoder of 3D U-Net list of extracted puzzles. to process 3D input sub-volumes. The number of clusters that works best for 2D tasks may not be a good choice for 3D tasks. Appendix A.4. Extended 3D Patch-shuing To ensure a fair comparison, we extensively tune this hyper- parameter in f10; 20; 40; 80; 160; 320g and finally set to 260 In our 3D Patch-shuing, which is inspired by its 2D coun- from the narrowed down search space of f240; 260; 280g. Un- terpart (Chen et al., 2019a), the model learns image represen- like ImageNet models for 2D imaging tasks, there is no avail- tation by restoring the image context. Given a sub-volume, we able pre-trained 3D feature extractor for medical imaging tasks; randomly select two isolated small 3D patches and swap their therefore, we randomly initialize the model weights at the be- context. We set the length, width, and height of the 3D patch ginning. Our Models Genesis, the first generic 3D pre-trained to be proportional to those in the entire sub-volume by 25% to models, could potentially be used as the 3D feature extractor 50%. Repeating this process for T = 10 times can generate the and co-trained with 3D DeepCluster. transformed sub-volume (see examples in Fig. A.10(a)). The model is trained to restore the original sub-volume, where L2 Appendix A.6. Rubik’s cube distance between input and output is used as the loss function. To process volumetric input and ensure a fair comparison with We implement Rubik’s Cube with respect to Zhuang et al. other baselines, we replace their U-Net with 3D U-Net archi- (2019), which consists of cube rearrangement and cube rota- tion. Like playing a Rubik’s cube, this proxy task enforces mod- els to learn translational and rotational invariant features from Self-Supervised 3D Tasks: github.com/HealthML/self-supervised-3d-tasks raw 3D data. Given a sub-volume, we partition it into a 2 2 2 Zongwei Zhou et al. / Medical Image Analysis (2020) 21 Fig. A.11: A direct comparison between image in-painting (Pathak et al., 2016) and our inner-cutout. (a) contrasts our inner-cutout with in- painting, wherein the model in the former scheme computes loss on the entire image and the model in the latter scheme computes loss only for the cutout area. (b) presents the performance on five target tasks, showing that inner-cutout is better suited for target classification tasks (e.g., NCC and ECC), while in-painting is more helpful for target segmentation tasks (e.g., NCS, LCS, and BMS). grid of cubes. In addition to predicting orders (3D Jigsaw), this chest region in CT modality and applied an encoder-decoder proxy task permutes the cubes with random rotations, forcing architecture that is similar to our work. We directly adopt the models to predict the orientation. Following the original pa- pre-trained weights of the dense V-Net architecture provided by per, we limit the directions for cube rotation, i.e., only allowing NiftyNet, so it carries a smaller number of parameters than our 180 horizontal and vertical rotations, to reduce the complexity 3D U-Net (2.60M vs. 16.32M). For target classification tasks, of the task. The eight cubes are then fed into a Siamese network we use the dense V-Net encoder by appending a sequence of f c with eight branches sharing the same weight to extract features. layers; for target segmentation tasks, we use the entire dense V- The feature maps from the last fully-connected or convolution Net. Since NiftyNet is developed in Tensorflow, all five target layer of all branches are concatenated and given as input to the tasks are re-implemented using their build-in configuration. For fully-connected layer of separate tasks, i.e., cube ordering and each target task, we have tuned hyper-parameters (e.g., learning orienting, which are supervised by permutation loss and rota- rate and optimizer) and applied extensive data augmentations tion loss, respectively, with equal weights. (e.g., rotation and scaling). Appendix B.2. Inflated 3D Appendix B. Configurations of publicly available models We download the Inflated 3D (I3D) model pre-trained from Flow streams in the Kinetics dataset (Hara et al., 2018) and fine- For publicly available models, we do not re-train their proxy tune it on our five target tasks. The input sub-volume is copied tasks and instead simply endeavor to find the best hyper- into two channels to align with the required input shape. For parameters for each of them in target tasks. We compare them target classification tasks, we take the pre-trained I3D and ap- with our Models Genesis in a user perspective, which might pend a sequence of randomly initialized fully-connected layers. seem to be unfair in a research perspective because many vari- For target segmentation tasks, we take the pre-trained I3D as ables are asymmetric among the competitors, such as program- the encoder and expand a decoder to predict the segmentation ming platform, model architecture, number of parameters, etc. map, resulting in a U-Net like architecture. The decoder is the However, the goal of this section is to experiment with existing same as that implemented in our 3D U-Net, consisting of up- ready-to-use pre-trained models under di erent medical tasks; sampling layers followed by a sequence of convolutional layers, therefore, we presume that all of the publicly available models batch normalization, and ReLU activation. Besides, four skip and their configurations have been carefully composed to the connections are built between the encoder and decoder, wherein optimal setting. feature maps before each pooling layer in the encoder are con- catenated with same-scale feature maps in the decoder. All of Appendix B.1. NiftyNet the layers in the model are trainable during transfer learning. Adam method (Kinga and Adam, 2015) with a learning rate of We examine the e ectiveness of fine-tuning from NiftyNet 1e 4 is used for optimization. in five target tasks. We should note that NiftyNet is not ini- tially designed for transfer learning but is one of the few pub- Appendix B.3. MedicalNet licly available supervised pre-trained 3D models. The model from Gibson et al. (2018a) has been considered as the baseline We download MedicalNet models (Chen et al., 2019b) that in our experiments because it has also been pre-trained on the have been pre-trained on eight publicly available 3D segmenta- 22 Zongwei Zhou et al. / Medical Image Analysis (2020) ˆ Global patch shuing preserves local information while distorting global structure; local pixel shuing maintains global structure but loses local details. ˆ For same-domain transfer learning (e.g., pre-training and fine-tuning in CT images), global-shuing and local- shuing reveal no significant di erence in terms of target task performance. Note that local-shuing is preferable when recognizing small objects in target tasks (e.g., pulmonary nodule and embolism), whereas patch- shuing is beneficial for large objects (e.g., brain tumor and liver). ˆ For cross-domain transfer learning (e.g., pre-training in CT and fine-tuning in MRI images), models pre-trained by our local-shuing noticeably outperform those pre-trained by patch-shuing. Appendix C.2. Compute loss on cutouts vs. entire images The results of our ablation study for in-painting and inner- cutout on five target tasks are presented in Fig. A.11. We set all the hyper-parameters the same except for one factor: where to compute MSE loss, only cutout areas or the entire image. In general, there is a marginal di erence in target segmentation Fig. C.12: We extensively search for the optimal size of cutout regions tasks, but inner-cutout is superior to in-painting in target clas- spanning from 0% to 90%, incremented by 10%. The points plotted sification tasks. These results are in line with our hypothesis within the red shade denote no significant di erence ( p > 0:05) from in Sec. 3.1: the model must distinguish original versus trans- the pinnacle from the curve. The horizontal red and gray lines refer formed parts within the image, preserving the context if it is to the performances achieved by Models Genesis and learning from original and, otherwise, in-painting the context. Seemingly, in- scratch, respectively. This ablation study reveals that cutting 20%— painting that only computes loss on cutouts can fail to learn 40% of regions out could produce the most robust performance of tar- comprehensive representation as it is unable to leverage ad- get tasks. As a result, in our implementation, we cutout around 25% vancements from both ends. regions from each sub-volume. Appendix C.3. Masked area size in outer-cutout tion datasets. ResNet-50 and ResNet-101 backbones are chosen because they are reported by Chen et al. (2019b) as the most When applying cutout transformations to our self-supervised compelling backbones for target segmentation and classifica- learning framework, we have one hyper-parameter to evaluate, tion tasks, respectively. Like I3D, we append a decoder at the i.e., the size of cutout regions. Intuitively, it can influence the end of the pre-trained encoder, randomly initialize its weights, diculty of the image restoration task. To explore the impact and link the encoder with the decoder using skip connections. of this parameter on the performance of target tasks, we have Owing to the 3D ResNet backbones, the resultant segmentation conducted an ablation study to extensively search for the opti- network for MedicalNet is much heavier than our 3D U-Net. To mal value, spanning from 0% to 90%, incremented by intervals be consistent with the original programming platform of Medi- of 10%. Fig. C.12 shows the performance of all five target tasks calNet, we re-implement all five target tasks in PyTorch, using under di erent settings, suggesting that outer-cutout is robust the same data separation and augmentation. We report the high- to hyper-parameter changes to some extent. This finding is also est results achieved by any of the two backbones in Table 3. consistent with that recommended in the original in-painting paper, where Pathak et al. (2016) removed a number of smaller possibly overlapping masks, covering up to 1/4 of the image. Appendix C. Ablation experiments Altogether, we finally cutout less than 1/4 of the entire sub- volume in both outer and inner cutout implementations. Appendix C.1. Local pixel shuing vs. global patch shuing In the main paper, we have reported results of patch- Appendix D. Qualitative assessment of image restoration shuing (Chen et al., 2019a) as a baseline in Table 3 and our local-shuing in Fig. 3. To underline the value of preserving local and global structural consistency in the proxy task, we Since there is no such metric to directly determine the power provide an explicit comparison between the two counterparts in of image representation, rather than constrain the representa- Fig. A.10, arriving at three findings: tion, our paper aims to design an image restoration task to let the Zongwei Zhou et al. / Medical Image Analysis (2020) 23 model learn generic image representation from 3D medical im- ages. In doing so, as seen in Sec. 5.4, we have modified the def- inition of a good representation. As presented in Sec. 3.1, Gen- esis CT and Genesis X-ray are pre-trained on LUNA 2016 (Se- tio et al., 2017) and ChestX-ray8 (Wang et al., 2017b), respec- tively, using a series of self-supervised learning schemes with di erent image transformations. In this section, we have (1) il- lustrated more examples of our four individual transformations (i.e., non-linear, local-shuing, outer-cutout, and inner-cutout) in Fig. D.13; (2) evaluated the power of the pre-trained model by assessing restoration quality on previously unseen patients’ images not only from the LUNA 2016 dataset (see Fig. D.14), but also from di erent modalities, covering CT, X-ray, and MRI (see Fig. D.15). The qualitative assessment shows that our pre- trained model is not merely overfitting on anatomical patterns in specific patients, but indeed can be robustly used for restoring images, thus can be generalized to many target tasks. To assess the image restoration quality at the time of infer- ence, we pass the transformed images to the models that have been trained with di erent self-supervised learning schemes, including four individual and one combined schemes. In our visualization, the input images have undergone four individual transformations as well as eight di erent combined transforma- tions, including the identity mapping (i.e., no transformation). As shown in Fig. D.14, the combined scheme can restore the unseen image by handling a variety of transformations (framed in red), whereas the models trained with the individual scheme can only restore unseen images that have undergone the same transformation that they were trained on (framed in green). This qualitative observation is consistent with our experimental find- ing in Sec. 4.1: the combined learning scheme achieves superior and more robust results over the individual scheme in transfer learning. In Fig. D.15, we have further provided a qualitative assess- ment of image restoration quality by Genesis CT and Gene- sis X-ray, across medical imaging modalities. In our visualiza- tion, the input images are selected from four di erent medical modalities, covering X-ray, CT, Ultrasound, and MRI. It is clear from the figure that even though the models are only trained on single image modality, they can largely maintain the texture and structures during restoration, not only within the same modality but also across di erent ones. These observations suggest that Models Genesis are of great potential in transferring learned im- age representation across diseases, organs, datasets, and modal- ities. 24 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. D.13: Illustration of the proposed four image transformations. For simplicity and clarity, we illustrate the transformation on a 2D CT slice, but our Genesis Chest CT is trained using 3D sub-volumes directly, transformed in a 3D manner; our 3D image transformations, with an exception of non-linear transformation, cannot be approximated in 2D. For ease of understanding, in (a) non-linear transformation, we have displayed an image undergoing di erent translating functions in Columns 2—7; in (b) local-shuing, (c) outer-cutout, and (d) inner-cutout transformation, we have illustrated each of the processes step by step in Columns 2—6, where the first and last columns denote the original images and the final transformed images, respectively. In local-shuing, a di erent window (b) is automatically generated and used in each step. Zongwei Zhou et al. / Medical Image Analysis (2020) 25 Fig. D.14: The left and right panels show the qualitative assessment of image restoration quality using Genesis CT and Genesis X-ray, respectively. These models are trained with di erent training schemes, including four individual schemes (Columns 3—6) and a combined scheme (Column 7). As discussed in Fig. 1, each original image x can possibly undergo twelve di erent transformations. We test the models with all these possible twelve transformed images x . We specify types of the image transformation f () for each row and the training scheme g() for each column. First of all, it can be seen that the models trained with individual schemes can restore previously unseen images that have undergone the same transformation very well (framed in green), but fail to handle other transformations. Taking non-linear transformation f () as an example, (NL) any individual training scheme besides non-linear transformation itself cannot invert the pixel intensity from transformed whitish to the original blackish. As expected, the model trained with the combined scheme successfully restores original images from various transformations (framed in red). Second, the model trained with the combined scheme shows it is superior to other models even if they are trained with and tested on the same transformation. For example, in the local-shuing case f (), the image recovered from the local-shuing pre-trained model g () is noisy and (LS) (LS) lacks texture. However, the model trained with the combined scheme g () generates an image with more underlying structures, which (NL,LS,OC,IC) demonstrates that learning with augmented tasks can even improve the performance on each of the individual tasks. Third, the model trained with the combined scheme significantly outperforms models trained with individual training schemes when restoring images that have undergone seven di erent combined transformations (Rows 6—12). For example, the model trained with non-linear transformation g () can only recover (NL) the intensity distribution in the transformed image undergone f () but leaves the inner cutouts unchanged. These observations suggest that (NL,IC) the model trained with the proposed unified self-supervised learning framework can successfully learn general anatomical structures and yield promising transferability on di erent target tasks. The quality assessment of image restoration further confirms our experimental observation, provided in Sec. 4.1, that the combined learning scheme exceeds each individual in transfer learning. 26 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. D.15: The left and right panels visualize the qualitative assessment of image restoration quality by Genesis CT and Genesis X-ray, respec- tively, across medical imaging modalities. For testing, we use the pre-trained model to directly restore images from LUNA 2016 (CT) (Setio et al., 2017), ChestX-ray8 (X-ray) (Wang et al., 2017b), CIMT (Ultrasound) (Hurst et al., 2010; Zhou et al., 2019a), and BraTS (MRI) (Menze et al., 2015). Though the models are only trained on single image modality, they can largely maintain the texture and structures during restoration not only within the same modality (framed in red), but also across di erent modalities. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Loading next page...
 
/lp/arxiv-cornell-university/models-genesis-toPXcgxWGS

References (113)

ISSN
1361-8415
eISSN
ARCH-3348
DOI
10.1016/j.media.2020.101840
Publisher site
See Article on Publisher Site

Abstract

Medical Image Analysis (2020) Contents lists available at ScienceDirect Medical Image Analysis journal homepage: www.elsevier.com/locate/media a b b c a, Zongwei Zhou , Vatsal Sodha , Jiaxuan Pang , Michael B. Gotway , Jianming Liang Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ 85281 USA Department of Radiology, Mayo Clinic, Scottsdale, AZ 85259, USA A R T I C L E I N F O A B S T R A C T Transfer learning from natural images to medical images has been established as one Article history: Received 1 May 2019 of the most practical paradigms in deep learning for medical image analysis. To fit Received in final form *** this paradigm, however, 3D imaging tasks in the most prominent imaging modalities Accepted *** (e.g., CT and MRI) have to be reformulated and solved in 2D, losing rich 3D anatom- Available online *** ical information, thereby inevitably compromising its performance. To overcome this limitation, we have built a set of models, called Generic Autodidactic Models, nick- Communicated by *** named Models Genesis, because they are created ex nihilo (with no manual labeling), self-taught (learnt by self-supervision), and generic (served as source models for gener- 2000 MSC: 41A05, 41A10, 65D05, ating application-specific target models). Our extensive experiments demonstrate that 65D17 our Models Genesis significantly outperform learning from scratch and existing pre- trained 3D models in all five target 3D applications covering both segmentation and Keywords: 3D Deep Learning, Represen- classification. More importantly, learning a model from scratch simply in 3D may not tation Learning, Transfer Learning, Self- -supervised Learning necessarily yield performance better than transfer learning from ImageNet in 2D, but our Models Genesis consistently top any 2D/2.5D approaches including fine-tuning the models pre-trained from ImageNet as well as fine-tuning the 2D versions of our Models Genesis, confirming the importance of 3D anatomical information and significance of Models Genesis for 3D medical imaging. This performance is attributed to our uni- fied self-supervised learning framework, built on a simple yet powerful observation: the sophisticated and recurrent anatomy in medical images can serve as strong yet free supervision signals for deep models to learn common anatomical representation au- tomatically via self-supervision. As open science, all codes and pre-trained Models Genesis are available at https://github.com/MrGiovanni/ModelsGenesis. © 2020 Elsevier B. V. All rights reserved. 1. Introduction models built directly using medical images. To test this hypoth- esis, we have chosen chest imaging because the chest contains Transfer learning from natural images to medical images has several critical organs, which are prone to a number of diseases become the de facto standard in deep learning for medical im- that result in substantial morbidity and mortality, hence asso- age analysis (Tajbakhsh et al., 2016; Shin et al., 2016), but given ciated with significant health-care costs. In this research, we the marked di erences between natural images and medical focus on Chest CT, because of its prominent role in diagnos- images, we hypothesize that transfer learning can yield more ing lung diseases, and our research community has accumu- powerful (application-specific) target models from the source lated several Chest CT image databases, for instance, LIDC- IDRI (Armato III et al., 2011) and NLST (NLST, 2011), con- taining a large number of Chest CT images. However, system- Corresponding author: Jianming.Liang@asu.edu (Jianming Liang) arXiv:2004.07882v4 [cs.CV] 16 Dec 2020 2 Zongwei Zhou et al. / Medical Image Analysis (2020) Table 1: Pre-trained models with proxy tasks and target tasks. This paper uses transfer learning in a broader sense, where a source model is first trained to learn image presentation via full supervision or self supervision by solving a problem, called proxy task (general or application-specific), on a source dataset with expert-provided or automatically-generated labels, and then this pre-trained source model is fine tuned (transferred) through full supervision to yield a target model to solve application-specific problems (target tasks) in the same or di erent datasets (target datasets). We refer transfer learning to same-domain transfer learning when the models are pre-trained and fine-tuned within the same domain (modality, organ, disease, or dataset), and to cross-domain when the models are pre-trained in one domain and fine-tuned for a di erent domain. Pre-trained model Modality Source dataset Superv. / Annot. Proxy task Genesis Chest CT 2D CT LUNA 2016 (Setio et al., 2017) Self / 0 Image restoration on 2D Chest CT slices Genesis Chest CT (3D) CT LUNA 2016 (Setio et al., 2017) Self / 0 Image restoration on 3D Chest CT volumes Genesis Chest X-ray (2D) X-ray ChestX-ray8 (Wang et al., 2017b) Self / 0 Image restoration on 2D Chest Radiographs Models ImageNet Natural ImageNet (Deng et al., 2009) Full / 14M images Image classification on 2D ImageNet Inflated 3D (I3D) Natural Kinetics (Carreira and Zisserman, 2017) Full / 240K videos Action recognition on human action videos NiftyNet CT Pancreas-CT & BTCV (Gibson et al., 2018a) Full / 90 cases Organ segmentation on abdominal CT MedicalNet CT, MRI 3DSeg-8 (Chen et al., 2019b) Full / 1,638 cases Disease/organ segmentation on 8 datasets Code Object Modality Target dataset Target task NCC Lung Nodule CT LUNA 2016 (Setio et al., 2017) Lung nodule false positive reduction NCS Lung Nodule CT LIDC-IDRI (Armato III et al., 2011) Lung nodule segmentation ECC Pulmonary Emboli CT PE-CAD (Tajbakhsh et al., 2015) Pulmonary embolism false positive reduction LCS Liver CT LiTS 2017 (Bilic et al., 2019) Liver segmentation BMS Brain Tumor MRI BraTS 2018 (Menze et al., 2015; Bakas et al., 2018) Brain tumor segmentation The first letter denotes the object of interest (“N” for lung nodule, “E” for pulmonary embolism, “L” for liver, etc); the second letter denotes the modality (“C” for CT, “M” for MRI, etc); the last letter denotes the task (“C” for classification, “S” for segmentation). atically annotating Chest CT scans is not only tedious, labori- available, pre-trained, (fully) supervised 3D models (see Ta- ous, and time-consuming, but it also demands costly, specialty- ble 3). Our results confirm the importance of 3D anatomical oriented skills, which are not easily accessible. Therefore, we information and demonstrate the significance of Models Gene- seek to answer the following question: Can we utilize the large sis for 3D medical imaging. number of available Chest CT images without systematic anno- This performance is attributable to the following key obser- tation to train source models that can yield high-performance vation: medical imaging protocols typically focus on partic- target models via transfer learning? ular parts of the body for specific clinical purposes, resulting in images of similar anatomy. The sophisticated yet recurrent To answer this question, we have developed a framework that anatomy o ers consistent patterns for self-supervised learning trains generic source models for 3D medical imaging. Our to discover common representation of a particular body part framework is autodidactic—eliminating the need for labeled (the lungs in our case). As illustrated in Fig. 1, the fundamental data by self-supervision; robust—learning comprehensive im- idea behind our self-supervised learning method is to recover age representation from a mixture of self-supervised tasks; scalable—consolidating a variety of self-supervised tasks into anatomical patterns from images transformed via various ways a single image restoration task with the same encoder-decoder in a unified framework. architecture; and generic—benefiting a range of 3D medical In summary, we make the following three contributions: imaging tasks through transfer learning. We call the models 1. A collection of generic pre-trained 3D models, performing trained with our framework Generic Autodidactic Models, nick- e ectively across diseases, organs, and modalities. named Models Genesis, and refer to the model trained using Chest CT images as Genesis Chest CT. As ablation studies, we 2. A scalable self-supervised learning framework, o ering have also trained a downgraded 2D version using 2D Chest CT encoder for classification and encoder-decoder for seg- slices, called Genesis Chest CT 2D. For thorough performance mentation. comparisons, we have trained a 2D model using Chest X-ray images, named as Genesis Chest X-ray (detailed in Table 1). 3. A set of self-supervised training schemes, learning robust Naturally, 3D imaging tasks in the most prominent medical representation from multiple perspectives. imaging modalities (e.g., CT and MRI) should be solved di- rectly in 3D, but 3D models generally have significantly more In the remainder of this paper, we first in Sec. 2 introduce our parameters than their 2D counterparts, thus demanding more self-supervised learning framework for training Models Gen- labeled data for training. As a result, learning from scratch sim- esis, covering our four proposed image transformations with ply in 3D may not necessarily yield performance better than their learning perspectives, and describing the four unique prop- fine-tuning Models ImageNet (i.e., pre-trained models on Ima- erties of our Models Genesis. Sec. 3 details the training pro- geNet), as revealed in Fig. 7. However, as demonstrated by our cess of Models Genesis and the five target tasks for evaluating extensive experiments in Sec. 3, our Genesis Chest CT not only Models Genesis, while Sec. 4 summarizes the five major ob- significantly outperforms learning 3D models from scratch (see servations from our extensive experiments, demonstrating that Fig. 4), but also consistently tops any 2D/2.5D approaches in- our Models Genesis can serve as a primary source of trans- cluding fine-tuning Models ImageNet as well as fine-tuning our fer learning for 3D medical imaging. In Sec. 5, we discuss Genesis Chest X-ray and Genesis Chest CT 2D (see Fig. 7 and various aspects of Models Genesis, including their relationship Table 4). Furthermore, Genesis Chest CT surpasses publicly- with automated data augmentation, their impact on the creation Zongwei Zhou et al. / Medical Image Analysis (2020) 3 Fig. 1: [Better viewed on-line, in color, and zoomed in for details] Our self-supervised learning framework aims to learn general-purpose image representation by recovering the original sub-volumes of images from their transformed ones. We first crop arbitrarily-size sub-volume x at a ran- dom location from an unlabeled CT image. Each sub-volume x can undergo at most three out of four transformations: non-linear, local-shuing, outer-cutout, and inner-cutout, resulting in a transformed sub-volume x . It should be noted that outer-cutout and inner-cutout are considered mu- tually exclusive. Therefore, in addition to the four original individual transformations, this process yields eight more transformations, including one identity mapping ( meaning none of the four individual transformations is selected) and seven combined transformations. A Model Genesis, an encoder-decoder architecture with skip connections in between, is trained to learn a common image representation by restoring the original sub-volume x (as ground truth) from the transformed one x ˜ (as input), in which the reconstruction loss (MSE) is computed between the model i i prediction x and ground truth x . Once trained, the encoder alone can be fine-tuned for target classification tasks; while the encoder and decoder together can be fine-tuned for target segmentation tasks. of a medical ImageNet, and their capabilities for same- and then detail each of the training schemes with its learning ob- cross-domain transfer learning followed by a thorough review jectives and perspectives, followed by a summary of the four of existing supervised and self-supervised representation learn- unique properties of our Models Genesis. ing approaches in medical imaging in Sec. 6. Finally, Sec. 7 concludes and outlines future extensions of Models Genesis. 2.1. Models Genesis learn by image restoration Given a raw dataset consisting of N patient volumes, the- oretically we can crop infinite number of sub-volumes from 2. Models Genesis the dataset. In practice, we randomly generate a subset X = fx ; x ; :::; x g, which includes n number of sub-volumes and 1 2 n The objective of Models Genesis is to learn a common image then apply image transformation function to these sub-volumes, representation that is transferable and generalizable across dis- yielding eases, organs, and modalities. Fig. 1 depicts our self-supervised X = f (X); (1) learning framework, which enables training 3D models from scratch using unlabeled images, consisting of three steps: (1) where X = fx ˜ ; x ˜ ; :::; x ˜ g and f () denotes a transformation 1 2 n cropping sub-volumes from patient CT images, (2) deforming function. Subsequently, a Model Genesis, being an encoder- the sub-volumes, and (3) training a model to restore the orig- decoder network with skip connections in between, will learn inal sub-volume. In the following sections, we first introduce to approximate the function g() which aims to map the trans- the denotations of our self-supervised learning framework and formed sub-volumes X back to their original ones X, that is, 4 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. 2: Illustration of the proposed image transformations and their learning perspectives. For simplicity and clarity, we illustrate the transfor- mation on a 2D CT slice, but our Genesis Chest CT is trained directly using 3D sub-volumes, which are transformed in a 3D manner. For ease of understanding, in (a) non-linear transformation, we have displayed an image undergoing di erent translating functions in Columns 2—7; in (b) local-shuing, (c) outer-cutout, and (d) inner-cutout transformation, we have illustrated each of the processes step by step in Columns 2—6, where the first and last columns denote the original images and the final transformed images, respectively. In local-shuing, a di erent window W is automatically generated and used in each step. We provide the implementation details in Sec. 2.2 and more visualizations in Fig. D.13. Forbes, 2012). Hence, this training scheme enables the model ˜ ˜ g(X) = X = f (X): (2) to learn the appearance of the anatomic structures present in the images. In order to keep the appearance of the anatomic To avoid heavy weight dedicated towers for each proxy task structures perceivable, we intentionally retain the non-linear in- and to maximize parameter sharing in Models Genesis, we tensity transformation function as monotonic, allowing pixels consolidate four self-supervised schemes into a single image of di erent values to be assigned with new distinct values. To restoration task, enabling models to learn robust image repre- realize this idea, we use Bezier ´ Curve (Mortenson, 1999), a sentation by restoring from various sets of image transforma- smooth and monotonic transformation function, which is gen- tions. Our proposed framework includes four transformations: erated from two end points (P and P ) and two control points 0 3 (1) non-linear, (2) local-shuing, (3) outer-cutout, and (4) (P and P ), defined as: 1 2 inner-cutout. Each transformation is independently applied to a sub-volume with a predefined probability, while outer-cutout 3 2 2 3 B(t) = (1t) P +3(1t) tP +3(1t)t P +t P ; t 2 [0; 1]; (3) 0 1 2 3 and inner-cutout are considered mutually exclusive. Conse- quently, each sub-volume can undergo at most three of the where t is a fractional value along the length of the line. In above transformations, resulting in twelve possible transformed Fig. 2(a), we illustrate the original CT sub-volume (the left- sub-volume (see step 2 in Fig. 1). For clarity, we further de- most column) and its transformed ones based on di erent trans- fine a training scheme as the process that (1) transforms sub- formation functions. The corresponding transformation func- volumes using any of the aforementioned transformations, and tions are shown in the top row. Notice that, when P = P 0 1 (2) trains a model to restore the original sub-volumes from the and P = P the Bezier ´ Curve is a linear function (shown in 2 3 transformed ones. For convenience, we refer to an individual Columns 2, 5). Besides, we set P = (0; 0) and P = (1; 1) 0 3 training scheme as the scheme using one particular individual to get an increasing function (shown in Columns 2—4) and the transformation. We should emphasize that our ultimate goal is opposite to get a decreasing function (shown in Columns 5—7). not the task of image restoration per se. While restoring images The control points are randomly generated for more variances is advocated and investigated as a training scheme for models (shown in Columns 3, 4, 6, 7). Before applying the transforma- to learn image representation, the usefulness of the learned rep- tion functions, in Genesis CT, we first clip the Hounsfield units resentation must be assessed objectively based on its generaliz- values within the range of [1000; 1000] and then normalize ability and transferability to various target tasks. each CT scan to [0; 1], while in Genesis X-ray, we directly nor- malize each X-ray to [0; 1] without intensity clipping. 2.2. Models Genesis learn from multiple perspectives 1) Learning appearance via non-linear transformation. We 2) Learning texture via local pixel shuing. We propose lo- propose a novel self-supervised training scheme based on non- cal pixel shuing to enrich local variations of a sub-volume linear translation, with which the model learns to restore the without dramatically compromising its global structures, which intensity values of an input image transformed with a set of encourages the model to learn the local boundaries and textures non-linear functions. The rationale is that the absolute intensity of objects. To be specific, for each input sub-volume, we ran- values (i.e., Hounsfield units) in CT scans or relative intensity domly select 1,000 windows and then shue the pixels inside values in other imaging modalities convey important informa- each window sequentially. Mathematically, let us consider a tion about the underlying structures and organs (Buzug, 2011; small window W with a size of m n. The local-shuing acts Zongwei Zhou et al. / Medical Image Analysis (2020) 5 on each window and can be formulated as 2.3. Models Genesis have several unique properties 1) Autodidactic—requiring no manual labeling. Models W = P W P ; (4) Genesis are trained in a self-supervised manner with abundant where W is the transformed window, P and P denote permu- unlabeled image datasets, demanding zero expert annotation ef- tation metrics with the size of m  m and n  n, respectively. fort. Consequently, Models Genesis are fundamentally di erent Pre-multiplying W with P permutes the rows of the window from traditional (fully) supervised transfer learning from Ima- W, whereas post-multiplying W with P results in the permu- geNet (Shin et al., 2016; Tajbakhsh et al., 2016), which o ers tation of the columns of the window W. The size of the local modest benefits to 3D medical imaging applications as well as window determines the diculty of proxy task. In practice, to that from the existing pre-trained, full-supervised models in- preserve the global content of the image, we keep the window cluding I3D (Carreira and Zisserman, 2017), NiftyNet (Gibson sizes smaller than the receptive field of the network, so that the et al., 2018b), and MedicalNet (Chen et al., 2019b), which de- network can learn much more robust image representation by mand a volume of annotation e ort to obtain the source models “resetting” the original pixels positions. Note that our method (statistics given in Table 1). To our best knowledge, this work is quite di erent from PatchShuing (Kang et al., 2017), which represents the first e ort to establish publicly-available, autodi- is a regularization technique to avoid over-fitting. Unlike de- dactic models for 3D medical image analysis. noising (Vincent et al., 2010) and in-painting (Pathak et al., 2) Robust—learning from multiple perspectives. Our com- 2016; Iizuka et al., 2017), our local-shuing transformation bined approach trains Models Genesis from multiple perspec- does not intend to replace the pixel values with noise, which tives (appearance, texture, context, etc.), leading to more ro- therefore preserves the identical global distributions to the orig- bust models across all target tasks, as evidenced in Figure 3, inal sub-volume. In addition, local-shuing within an extent where our combined approach is compared with our individ- keeps the objects perceivable, as shown in Fig. 2(b), benefiting ual schemes. This eclectic approach, incorporating multiple the deep neural network in learning local invariant image repre- tasks into a single image restoration task, empowers Models sentations, which serves as a complementary perspective with Genesis to learn more comprehensive representation. While global patch shuing (Chen et al., 2019a) (discussed in-depth most self-supervised methods devise isolated training schemes in Appendix C.1). to learn from specific perspectives—learning intensity value 3) Learning context via outer and inner cutouts. We devise via colorization, context information via Jigsaw, orientation outer-cutout as a new training scheme for self-supervised learn- via rotation, etc—these methods are reported with mixed re- ing. To realize it, we generate an arbitrary number ( 10) of sults on di erent tasks, in review papers such as Goyal et al. windows, with various sizes and aspect ratios, and superim- (2019), Kolesnikov et al. (2019), Taleb et al. (2020), and Jing pose them on top of each other, resulting in a single window and Tian (2020). It is critical as a multitude of state-of-the-art of a complex shape. When applying this merged window to a results in the literature show the importance of using compo- sub-volume, we leave the sub-volume region inside the window sitions of more than one transformations per image (Graham, exposed and mask its surrounding (i.e., outer-cutout) with a ran- 2014; Dosovitskiy et al., 2015; Wu et al., 2020), which has also dom number. Moreover, to prevent the task from being too dif- been experimentally confirmed in our image restoration task. ficult or even unsolvable, we extensively search for the optimal 3) Scalable—accommodating many training schemes. Con- size of cutout regions spanning from 0% to 90%, incremented solidated into a single image restoration task, our novel self- by 10% (detailed study presented in Appendix C.3). In the supervised schemes share the same encoder and decoder during end, we limit the outer-cutout region to be less than 1/4 of the training. Had each task required its own decoder, due to limited whole sub-volume. By restoring the outer-cutouts, the model memory on GPUs, our framework would have failed to accom- will learn the global geometry and spatial layout of organs in modate a large number of self-supervised tasks. By unifying all medical images via extrapolating within each sub-volume. We tasks as a single image restoration task, any favorable transfor- have illustrated this process step by step in Fig. 2(c). The first mation can be easily amended into our framework, overcoming and last columns denote the original sub-volumes and the final the scalability issue associated with multi-task learning (Doer- transformed sub-volumes, respectively. sch and Zisserman, 2017; Noroozi et al., 2018; Standley et al., Our self-supervised learning framework also utilizes inner- 2019; Chen et al., 2019b), where the network heads are subject cutout as a training scheme, where we mask the inner win- to the specific proxy tasks. dow regions (i.e., inner-cutouts) and leave their surroundings exposed. By restoring the inner-cutouts, the model will learn 4) Generic—yielding diverse applications. Models Genesis, local continuities of organs in medical images via interpolat- trained via a diverse set of self-supervised schemes, learn a ing within each sub-volume. Unlike Pathak et al. (2016), where general-purpose image representation that can be leveraged for in-painting is proposed as a proxy task by restoring only the a wide range of target tasks. Specifically, Models Genesis can central region of the image, we restore the entire sub-volume be utilized to initialize the encoder for the target classification as the model output. Examples of inner-cutout are illustrated in tasks and to initialize the encoder-decoder for the target seg- Fig. 2(d). Following the suggestion from Pathak et al. (2016), mentation tasks, while the existing self-supervised approaches the inner-cutout areas are limited to be less than 1=4 of the are largely focused on providing encoder models only (Jing and whole sub-volume, in order to keep the task reasonably di- Tian, 2020). As shown in Table 3, Models Genesis can be gen- cult. eralized across diseases (e.g., nodule, embolism, tumor), organs 6 Zongwei Zhou et al. / Medical Image Analysis (2020) Table 2: Genesis CT is pre-trained on only LUNA 2016 dataset (e.g., lung, liver, brain), and modalities (e.g., CT and MRI), a (i.e., the source) and then fine-tuned for five distinct medical image generic behavior that sets us apart from all previous works in applications (i.e., the targets). These target tasks are selected such that the literature where the representation is learned via a specific they show varying levels of semantic distance from the source, in terms self-supervised task, and thus lack generality. of organs, diseases, and modalities, allowing us to investigate the trans- ferability of the pre-trained weights of Genesis CT with respect to the domain distance. The cells checked by 7 denote the properties that are 3. Experiments di erent between the source and target datasets. 3.1. Pre-training Models Genesis Task Disease Organ Dataset Modality Our Models Genesis are pre-trained from 623 Chest CT scans NCC in LUNA 2016 (Setio et al., 2017) in a self-supervised manner. NCS ECC 7 7 The reason that we decided not to use all 888 scans provided LCS 7 7 7 by this dataset was to avoid test-image leaks between proxy and BMS 7 7 7 7 target tasks, so that we can confidently use the rest of the images solely for testing Models Genesis as well as the target models, although Models Genesis are trained from only unlabeled im- replacing the last layer with a 1 1 1 convolutional layer for ages, involving no annotation shipped with the dataset. We first target segmentation tasks. For scenarios (2) and (3), it is possi- randomly crop sub-volumes, sized 64 64 32 pixels, from dif- ble to fine-tune all the layers of the model or to keep some of the ferent locations. To extract more informative sub-volumes for earlier layers fixed, only fine-tuning some higher-level portion training, we then intentionally exclude those which are empty of the model. We have evaluated the performance of our self- (air) or contain full tissues. Our Models Genesis 2D are self- supervised representation for transfer learning by fine-tuning supervised pre-trained from LUNA 2016 (Setio et al., 2017) and all layers in the network. In the following, we examine Models ChestX-ray14 (Wang et al., 2017b) using 2D CT slices in an Genesis on five distinct medical applications, covering classi- axial view and X-ray images, respectively. For all proxy tasks fication and segmentation tasks in CT and MRI images with and target tasks, the raw image intensities were normalized to varying levels of semantic distance from the source (Chest CT) the [0; 1] range before training. We use the mean square error to the targets in terms of organs, diseases, and modalities (see (MSE) between input and output images as objective function Table 2) for investigating the transferability of Models Genesis. for the proxy task of image restoration. As suggested by Pathak et al. (2016) and Chen et al. (2019a), the MSE loss is sucient for representation learning, although the restored images may 3.2.1. Lung nodule false positive reduction (NCC) be blurry. The dataset is provided by LUNA 2016 (Setio et al., 2017) When pre-training Models Genesis, we apply each of the and consists of 888 low-dose lung CTs with slice thickness transformations on sub-volumes with a pre-defined probability. less than 2.5mm. Patients are randomly assigned into a train- That being said, the model will encounter not only the trans- ing set (445 cases), a validation set (178 cases), and a test set formed sub-volumes as input, but also the original sub-volumes. (265 cases). The dataset o ers the annotations for a set of This design o ers two advantages: 5,510,166 candidate locations for the false positive reduction task, wherein true positives are labeled as “1” and false posi- ˆ First, the model must distinguish original versus trans- tives are labeled as “0”. Following the prior works (Setio et al., formed images, discriminate transformation type(s), and 2016; Sun et al., 2017c), we evaluate performance via Area Un- restore images if transformed. Our self-supervised learn- der the Curve (AUC) score on classifying true positives and ing framework, therefore, results in pre-trained models false positives. that are capable of handling versatile tasks. ˆ Second, since original images are presented in the proxy 3.2.2. Lung nodule segmentation (NCS) task, the semantic di erence of input images between the The dataset is provided by the Lung Image Database Con- proxy and target task becomes smaller. As a result, the sortium image collection (LIDC-IDRI) (Armato III et al., 2011) pre-trained model can be transferable to process regu- and consists of 1,018 cases collected by seven academic centers lar/normal images in a broad variety of target tasks. and eight medical imaging companies. The cases were split into training (510), validation (100), and test (408) sets. Each case 3.2. Fine-tuning Models Genesis is a 3D CT scan and the nodules have been marked as volu- The pre-trained Models Genesis can be adapted to new imag- metric binary masks. We have re-sampled the volumes to 1-1-1 ing tasks through transfer learning or fine-tuning. There are spacing and then extracted a 64 64 32 crop around each nod- three major transfer learning scenarios: (1) employing the en- ule. These 3D crops are used for model training and evaluation. coder as a fixed feature extractor for a new dataset and follow- As in prior works (Aresta et al., 2019; Tang et al., 2019; Zhou ing up with a linear classifier (e.g., Linear SVM or Softmax et al., 2018), we adopt Intersection over Union (IoU) and Dice classifier), (2) taking the pre-trained encoder and appending a coecient scores to evaluate performance. Note that for this sequence of fully-connected (fc) layers for target classification particular application, we calculate mean of the IoUs at thresh- tasks, and (3) taking the pre-trained encoder and decoder and olds ranging from 0.5 to 0.95 with a step size of 0.05. Zongwei Zhou et al. / Medical Image Analysis (2020) 7 3.2.3. Pulmonary embolism false positive reduction (ECC) 3D versions for a fair comparison (see detailed implementation in Appendix A). To promote the 3D self-supervised learning We utilize a database consisting of 121 computed tomogra- research, we make our own implementation of the 3D extended phy pulmonary angiography (CTPA) scans with a total of 326 methods and their corresponding pre-trained weights publicly emboli. Following the prior works (Liang and Bi, 2007), we available as an open-source tool that can e ectively be used out- utilize their PE candidate generator based on the toboggan al- of-the-box. In addition, we have examined publicly available gorithm, resulting in total of 687 true positives and 5,568 false pre-trained models for 3D transfer learning in medical imaging, positives. The dataset is then divided at the patient-level into 2 3 including NiftyNet (Gibson et al., 2018b), MedicalNet (Chen a training set with 434 true positive PE candidates and 3,406 et al., 2019b), and, the most influential 2D weights initializa- false positive PE candidates, and a test set with 253 true posi- tion, Models ImageNet. We also fine-tune I3D (Carreira and tive PE candidates and 2,162 false positive PE candidates. To Zisserman, 2017) in our five target tasks because it has been conduct a fair comparison with the prior study (Zhou et al., shown to successfully initialize 3D models for lung nodule de- 2017; Tajbakhsh et al., 2016, 2019b), we compute candidate- tection in Ardila et al. (2019). The detailed configurations of level AUC on classifying true positives and false positives. these models can be found in Appendix B. 3D U-Net architecture is used in 3D applications; U-Net 3.2.4. Liver segmentation (LCS) architecture is used in 2D applications. Batch normaliza- The dataset is provided by MICCAI 2017 LiTS Challenge tion (Io e and Szegedy, 2015) is utilized in all 3D/2D deep and consists of 130 labeled CT scans, which we split into train- models. For proxy tasks, SGD method (Zhang, 2004) with an ing (100 patients), validation (15 patients), and test (15 patients) initial learning rate of 1e0 is used for optimization. We use subsets. The ground truth segmentation provides two di erent ReduceLROnPlateau to schedule learning rate, in which if no labels: liver and lesion. For our experiments, we only consider improvement is seen in the validation set for a certain num- liver as positive class and others as negative class and evaluate ber of epochs, the learning rate is reduced. For target tasks, segmentation performance using Intersection over Union (IoU) Adam method (Kinga and Adam, 2015) with a learning rate of and Dice coecient scores. 1e 3 is used for optimization, where = 0:9, = 0:999, 1 2 = 1e 8. We use early-stop mechanism on the validation set 3.2.5. Brain tumor segmentation (BMS) to avoid over-fitting. Simple yet heavy 3D data augmentation The dataset is provided by BraTS 2018 challenge (Menze techniques are employed in all five target tasks, including ran- et al., 2015; Bakas et al., 2018) and consists of 285 patients dom flipping, transposing, rotating, and adding Gaussian noise. (210 HGG and 75 LGG), each with four 3D MRI modalities We run each method ten times on all of the target tasks and (T1, T1c, T2, and Flair) rigidly aligned. We adopt 3-fold cross report the average, standard deviation, and further present sta- validation, in which two folds (190 patients) are for training and tistical analysis based on an independent two-sample t-test. one fold (95 patients) for test. Annotations include background In the proxy task, we pre-train the model using 3D sub- (label 0) and three tumor subregions: GD-enhancing tumor (la- volumes sized 64  64  32, whereas in target tasks, the input bel 4), the peritumoral edema (label 2), and the necrotic and is not limited to sub-volumes with certain size. That being said, non-enhancing tumor core (label 1). We consider those with our pre-trained models can be fine-tuned in the tasks with CT label 0 as negatives and others as positives and evaluate seg- sub-volumes, entire CT volumes, or even MRI volumes as in- mentation performance using Intersection over Union (IoU) and put upon user’s need. The flexibility of input size is attributed Dice coecient scores. to two reasons: (1) our pre-trained models learn generic image representation such as appearance, texture, and context feature, 3.3. Benchmarking Models Genesis and (2) the encoder-decoder architecture is able to process im- For a thorough comparison, we used three di erent tech- ages with arbitrary sizes. niques to randomly initialize the weights of models: (1) a ba- sic random initialization method based on Gaussian distribu- 4. Results tions, (2) a method commonly known as Xavier, which was sug- gested in Glorot and Bengio (2010), and (3) a revised version of In this section, we begin with an ablation study to compare Xavier called MSRA, which was suggested in He et al. (2015). the combined approach with each individual scheme, conclud- They are implemented as uniform, glorot uniform, and ing that the combined approach tends to achieve more robust re- he uniform, respectively, following the Initializers in Keras. sults and consistently exceeds any other training schemes. We We compare Models Genesis with Rubik’s cube (Zhuang et al., then take our pre-trained model from the combined approach 2019), the most recent multi-task and self-supervised learning and present results on five 3D medical applications, compar- method for 3D medical imaging. Considering that most of the ing them against the state-of-the-art approaches found in recent self-supervised learning methods are initially proposed and im- supervised and self-supervised learning literature. plemented in 2D, we have extended five most representative ones (Vincent et al., 2010; Pathak et al., 2016; Noroozi and NiftyNet Model Zoo: github.com/NifTK/NiftyNetModelZoo Favaro, 2016; Chen et al., 2019a; Caron et al., 2018) into their MedicalNet: github.com/Tencent/MedicalNet I3D: github.com/deepmind/kinetics-i3d 3D U-Net: github.com/ellisdg/3DUnetCNN 1 6 Initializers: faroit.com/keras-docs/1.2.2/initializations Segmentation Models: github.com/qubvel/segmentation models 8 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. 3: Comparing the combined training scheme with each of the proposed individual training schemes, we conduct statistical analyses between the top two training schemes as well as between the bottom two. Although some of the individual training schemes could be favorable for certain target tasks, there is no such clear clue to guarantee that any one of the individual training schemes would consistently o er the best performance on every target task. On the contrary, our combined training scheme consistently achieves the best results across all five target tasks. Fig. 4: Models Genesis, as presented with the red vertical lines, achieve higher and more stable performance compared with three popular types of random initialization methods, including MSRA, Xavier, and Uniform. Among three out of the five applications, three di erent types of random distribution reveal no significant di erence with respect to each other. 4.1. The combined learning scheme exceeds each individual the combination of di erent transformations is advantageous because, as discussed, we cannot rely on one single training We have devised four individual training schemes by apply- scheme to achieve the most robust and compelling results across ing each of the transformations (i.e., non-linear, local-shuing, multiple target tasks. It is our novel representation learning outer-cutout, and inner-cutout) individually to a sub-volume framework based on image restoration that allows integrating and training the model to restore the original one. We compare various training schemes into a single training scheme. Our each of these training schemes with identical-mapping, which qualitative assessment of image restoration quality, provided does not involve any image transformation. In three out of the in Fig. D.14, further indicates that the combined scheme is su- five target tasks, as shown in Figs. 3—4, the model pre-trained perior over all four individual schemes in restoring the images by identical-mapping scheme does not perform as well as ran- that have been undergone multiple transformations. In sum- dom initialization. This undesired representation obtained via mary, our combined scheme pre-trains a model from multiple identical-mapping suggests that without any image transforma- perspectives (appearance, texture, context, etc.), empowering tion, the model would not benefit much from the proxy image models to learn a more comprehensive representation, thereby restoration task. On the contrary, nearly all of the individual leading to more robust target models. Based on the above ab- schemes o er higher target task performances than identical- lation studies, in the following sections, we refer the models mapping, demonstrating the significance of the four devised im- pre-trained by the combined scheme to Models Genesis and, in age transformations in learning image representation. particular, refer the model pre-trained on LUNA 2016 dataset Although each of the individual schemes has established the to Genesis Chest CT. capability in learning image representation, its empirical perfor- mance varies from task to task. That being said, given a target 4.2. Models Genesis outperform learning from scratch task, there is no clear winner among the four individual schemes that can always guarantee the highest performance. As a re- Transfer learning accelerates training and boosts perfor- sult, we have further devised a combined scheme, which applies mance, only if the image representation learned from the origi- transformations to a sub-volume with a predefined probability nal (proxy) task is general and transferable to target tasks. Fine- for each transformation and trains a model to restore the origi- tuning models trained on ImageNet has been a great success nal one. To demonstrate the importance of combining these im- story in 2D (Tajbakhsh et al., 2016; Shin et al., 2016), but for age transformations together, we examine the combined train- 3D representation learning, there is no such a massive labeled ing scheme against each of the individual ones. Fig. 3 shows dataset like ImageNet. As a result, it is still common prac- that the combined scheme consistently exceeds any other in- tice to train 3D model from scratch in 3D medical imaging. dividual schemes in all five target tasks. We have found that Therefore, to establish the 3D baselines, we have trained 3D Zongwei Zhou et al. / Medical Image Analysis (2020) 9 Fig. 5: Models Genesis enable better optimization than learning from scratch, evident by the learning curves for the target tasks of reducing false positives in detecting lung nodules (NCC) and pulmonary embolism (ECC) as well as segmenting lung nodule (NCS), liver (LCS), and brain tumor (BMS). We have plotted the validation performance averaged by ten trials for each application, in which accuracy and dice-coecient scores are reported for classification and segmentation tasks, respectively. As seen, initializing with our pre-trained Models Genesis demonstrates benefits in the convergence speed. Table 3: Our Models Genesis lead the best or comparable performance on five distinct medical target tasks over six self-supervised learning approaches (revised in 3D) and three competing publicly available (fully) supervised pre-trained 3D models. For ease of comparison, we evaluate AUC score for the two classification tasks (i.e., NCC and ECC) and IoU score for the three segmentation tasks (i.e., NCS, LCS, and BMS). All of the results, including the mean and standard deviation (means.d.) across ten trials, reported in the table are evaluated using our dataset splitting, elaborated in Sec. 3.2. For every target task, we have further performed independent two sample t-test between the best (bolded) vs. others and highlighted boxes in blue when they are not statistically significantly di erent at p = 0:05 level. The footnotes compare our results with the state-of-the-art performance for each target task, using the evaluation metric for the data acquired from competitions. Target tasks Pre-training Approach 1 2 3 4 5 NCC (%) NCS (%) ECC (%) LCS (%) BMS (%) Random with Uniform Init 94.741.97 75.480.43 80.363.58 78.684.23 60.791.60 No Random with Xavier Init (Glorot and Bengio, 2010) 94.255.07 74.051.97 79.998.06 77.823.87 58.522.61 Random with MSRA Init (He et al., 2015) 96.031.82 76.440.45 78.243.60 79.765.43 63.001.73 I3D (Carreira and Zisserman, 2017) 98.260.27 71.580.55 80.551.11 70.654.26 67.830.75 (Fully) supervised NiftyNet (Gibson et al., 2018b) 94.144.57 52.982.05 77.338.05 83.231.05 60.781.60 MedicalNet (Chen et al., 2019b) 95.800.49 75.680.32 86.431.44 85.520.58 66.091.35 De-noising (Vincent et al., 2010) 95.921.83 73.990.62 85.143.02 84.360.96 57.831.57 In-painting (Pathak et al., 2016) 91.462.97 76.020.55 79.793.55 81.364.83 61.383.84 Jigsaw (Noroozi and Favaro, 2016) 95.471.24 70.901.55 81.791.04 82.041.26 63.331.11 Self-supervised DeepCluster (Caron et al., 2018) 97.220.55 74.950.46 84.820.62 82.661.00 65.960.85 Patch shuing (Chen et al., 2019a) 91.932.32 75.740.51 82.153.30 82.822.35 52.956.92 Rubik’s Cube (Zhuang et al., 2019) 96.241.27 72.870.16 80.494.64 75.590.20 62.751.93 Genesis Chest CT (ours) 98.340.44 77.620.64 87.202.87 85.102.15 67.961.29 The winner in LUNA (2016) holds an ocial score of 0.968 vs. 0.971 (ours) Wu et al. (2018) holds a Dice of 74.05% vs. 75.86%0.90% (ours) Zhou et al. (2017) holds an AUC of 87.06% vs. 87.20%2.87% (ours) The winner in LiTS (2017) with post-processing holds a Dice of 96.60% vs. 93.19%0.46% (ours without post-processing) MRI Flair images are only utilized for segmenting brain tumors, so the results are not submitted to BraTS 2018. Genesis Chest CT is slightly outperformed by MedicalNet in LCS because the latter has been (fully) supervised pre-trained on the LiTS dataset. models with three representative random initialization meth- networks from scratch. A small miscalibration of the ini- ods, including naive uniform initialization, Xavier/Glorot ini- tial weights can lead to vanishing or exploding gradients, tialization proposed by Glorot and Bengio (2010), and He nor- as well as poor convergence properties. mal (MSRA) initialization proposed by He et al. (2015). When ˆ In three out of the five 3D medical applications, the re- comparing deep model initialization by transfer learning and by sults reveal no significant di erence among these ran- controlling mathematical distribution, the former learns more dom initialization methods. Although randomly initial- sophisticated image representation but su ers from a domain izing weights can vary by the behaviors on di erent ap- gap, whereas the latter is task independent yet provides rela- plications, He normal (MSRA), in which the weights are tively less benefit than the former. The hypothesis underneath initialized with a specific ReLU-aware initialization, gen- transfer learning is that transferring deep features across visual erally works the most reliably among all five target tasks. tasks can obtain a semantically more powerful representation, compared with simply initializing weights using di erent dis- ˆ On the other hand, initialization with our pre-trained Gen- tributions. From our comprehensive experiments in Fig. 4, we esis Chest CT stabilizes the overall performance and, more have observed the following: importantly, elevates the average performance over all ˆ Within each method, random initialization of weights has three random initialization methods by a large margin. Our shown large variance in results of ten trials; it is in large statistical analysis shows that the performance gain is sig- part due to the diculty of adequately initializing these nificant for all the target tasks under study. This sug- 10 Zongwei Zhou et al. / Medical Image Analysis (2020) gests that, owing to the representation learning scheme, els. For example, we have adopted MedicalNet with resnet-101 our initial weights provide a better starting point than the as the backbone, which o ers the highest performance based ones generated under particular statistical distributions, on Chen et al. (2019b) but comprises of 85.75M parameters; while being over 13% faster (see Fig. 5). This observa- the pre-trained I3D (Carreira and Zisserman, 2017) contains tion has also been widely obtained in 2D model initializa- 25.35M parameters in the encoder; the pre-trained NiftyNet tion (Tajbakhsh et al., 2016; Shin et al., 2016; Rawat and uses Dense V-Networks (Gibson et al., 2018a) as backbone, Wang, 2017; Zhou et al., 2017; Voulodimos et al., 2018). comprising of only 2.60M parameters, but it does not perform as well as its counterparts in all five target tasks. Taken to- Altogether, in contrast to 3D scratch models, we believe gether, these results indicate that our Models Genesis, with only Models Genesis can potentially serve as a primary source of 16.32M parameters, surpass all existing pre-trained 3D models transfer learning for 3D medical imaging applications. Besides in terms of generalizability, transferability, and parameter e- contrasting with the three random initialization methods, we ciency. further examine our Models Genesis against the existing pre- trained 3D models in the coming section. 4.4. Models Genesis reduce annotation e orts by at least 30% 4.3. Models Genesis surpass existing pre-trained 3D models While critics often stress the need for suciently large We have evaluated our Models Genesis with existing pub- amounts of labeled data to train a deep model, transfer learn- licly available pre-trained 3D models on five distinct medical ing leverages the knowledge about medical images already target tasks. As shown in Table 3, Genesis Chest CT noticeably learned by pre-trained models and therefore requires consider- contrasts with any other existing 3D models, which have been ably fewer annotated data and training iterations than learning pre-trained by full supervision. Note that, in the liver segmen- from scratch. We have simulated the scenarios of using a hand- tation task (LCS), Genesis Chest CT is slightly outperformed by ful of labeled data, which allows investigating the power of our MedicalNet because of the benefit that MedicalNet gained from Models Genesis in transfer learning. Fig. 6 displays the results its (fully) supervised pre-training on the LiTS dataset directly. of training with a partial dataset, demonstrating that fine-tuning Further statistical tests reveal that Genesis Chest CT still yields Models Genesis saturates quickly on the target tasks since it comparable performance with MedicalNet at p = 0:05 level. can achieve similar performance compared with the full dataset For the rest four target tasks, Genesis Chest CT achieves su- training. Specifically, the performance of learning 3D models perior performance against all its counterparts by a large mar- from scratch with entire datasets can be approximated using gin, demonstrating the e ectiveness and transferability of the Models Genesis with only 50%, 5%, 30%, 5%, and 30% of datasets for NCC, NCS, ECC, LCS, and BMS, respectively. This learned features of Models Genesis, which are beneficial for shows that our Models Genesis can mitigate the lack of labeled both classification and segmentation tasks. images, resulting in a more annotation ecient deep learning in More importantly, although Genesis Chest CT is pre-trained the end. on Chest CT only, it can generalize to di erent organs, diseases, datasets, and even modalities. For instance, the target task of Furthermore, the performance gap between fine-tuning and pulmonary embolism false positive reduction is performed in learning from scratch is significant and steady over training Contrast-Enhanced CT scans that can appear di erently from models with each partial data point. For the lung nodule false the proxy tasks in normal CT scans; yet, Genesis Chest CT positive reduction target task (NCC in Fig. 6), using only 49% achieves a remarkable improvement over training from scratch, training data, Models Genesis equal the performance of 70% increasing the AUC by 7 points. Moreover, Genesis Chest CT training data learning from scratch. Therefore, about 30% of continues to yield a significant IoU gain in liver segmentation the annotation cost associated with learning from scratch in NCC even though the proxy task and target task are significantly dif- is recovered by initializing with Models Genesis. For the lung ferent in both, diseases a ecting the organs (lung vs. liver) and nodule segmentation target task (NCS in Fig. 6), with 5% train- the dataset itself (LUNA 2016 vs. LiTS 2017). We have fur- ing data, Models Genesis can achieve the performance equiv- ther examined Genesis Chest CT and other existing pre-trained alent to learning from scratch using 10% training data. Based models using MRI Flair images, which represent the widest do- on this analysis, the cost of annotation in NCS can be reduced main distance between the proxy and target tasks. As reported by half using Models Genesis compared with learning from in Table 3 (BMS), Genesis Chest CT yields nearly a 5-point scratch. For the pulmonary embolism false positive reduction improvement in comparison with random initialization. The target task (ECC), Fig. 6 suggests that with only 30% training increased performance on the MRI imaging task is a particu- samples, Models Genesis achieve performance equivalent to larly strong demonstration of the transfer learning capabilities learning from scratch using 70% training samples. Therefore, of our Genesis Chest CT. To further investigate the behavior nearly 57% of the labeling cost associated with the use of learn- of Genesis Chest CT when encountering medical images from ing from scratch for ECC could be recovered with our Models di erent modalities, we have provided extensive visualization Genesis. For the liver segmentation target task (LCS) in Fig. 6, in Fig. D.15, including example images from CT, X-ray, Ultra- using 8% training data, Models Genesis equal the performance sound, and MRI modalities. of learning from scratch using 50% training samples. There- Considering the model footprint, our Models Genesis take fore, about 84% of the annotation cost associated with learning the basic 3D U-Net as the backbone, carrying much fewer pa- from scratch in LCS is recovered by initializing with Models rameters than the existing open-source pre-trained 3D mod- Genesis. For the brain tumor segmentation target task (BMS) Zongwei Zhou et al. / Medical Image Analysis (2020) 11 Fig. 6: Initializing with our Models Genesis, the annotation cost can be reduced by 30%, 50%, 57%, 84%, and 44% for target tasks NCC, NCS, ECC, LCS, and BMS, respectively. With decreasing amounts of labeled data, Models Genesis (red) retain a much higher performance on all five target tasks, whereas learning from scratch (grey) fails to generalize. Note that the horizontal red and gray lines refer to the performances that can eventually be achieved by Models Genesis and learning from scratch, respectively, when using the entire dataset. Fig. 7: When solving problems in volumetric medical modalities, such as CT and MRI images, 3D volume-based approaches consistently o er superior performance than 2D slice-based approaches empowered by transfer learning. We conduct statistical analyses (circled in blue) between the highest performance achieved by 3D and 2D solutions. Training 3D models from scratch does not necessarily outperform their 2D counterparts (see NCC). However, training the same 3D models from Genesis Chest CT outperforms all their 2D counterparts, including fine-tuning Models ImageNet as well as fine-tuning our Genesis Chest X-ray and Genesis Chest CT 2D. It confirms the e ectiveness of Genesis Chest CT in unlocking the power of 3D models. In addition, we have also provided statistical analyses between the highest and the second highest performances achieved by 2D models, finding that Models Genesis (2D) o er equivalent performances (n.s.) to Models ImageNet in four out of the five applications. in Fig. 6, with less than 28% training data, Models Genesis Besides adopting 3D models, another common strategy to han- achieve the performance equivalent to learning from scratch us- dle limited data in volumetric medical imaging is to reformat ing 50% training data. Therefore, nearly 44% annotation e orts 3D data into a 2D image representation followed by fine-tuning can be reduced using Models Genesis compared with learning pre-trained Models ImageNet (Shin et al., 2016; Tajbakhsh from scratch. Overall, at least 30% annotation e orts have been et al., 2016). This approach increases the training examples reduced by Models Genesis, in comparison with learning a 3D by order of magnitude, but it sacrifices the 3D context. It is model from scratch in five target tasks. With such annotation- interesting to note how Genesis Chest CT compares with this ecient 3D transfer learning paradigm, computer-aided diag- de facto standard in 2D. We have thus implemented two di er- nosis of rare diseases or rapid response to global pandemics, ent methods to reformat 3D data into 2D input: the regular 2D which are severely underrepresented owing to the diculty of representation obtained by extracting adjacent axial slices (Ben- collecting a sizeable amount labeled data, could be eventually Cohen et al., 2016; Sun et al., 2017a), and the 2.5D represen- actualized. tation (Prasoon et al., 2013; Roth et al., 2014, 2015) composed of axial, coronal, and sagittal slices from volumetric data. Both of these 2D approaches seek to use 2D representation to emu- 4.5. Models Genesis consistently top any 2D/2.5D approaches late something three dimensional, in order to fit the paradigm of fine-tuning Models ImageNet. In the inference, classifica- We have thus far presented the power of 3D models in pro- tion and segmentation tasks are evaluated di erently in 2D: for cessing volumetric data, in particular, with limited annotation. 12 Zongwei Zhou et al. / Medical Image Analysis (2020) Table 4: Our 3D approach, initialized by Models Genesis, signifi- used in their pre-training (90 cases for NiftyNet (Gibson et al., cantly elevates the classification performance compared with 2.5D and 2018b) and 1,638 cases for MedicalNet (Chen et al., 2019b)) or 2D approaches in reducing lung nodule and pulmonary embolism false the domain distance (from videos to CT/MRI for I3D (Carreira positives. The entries in bold highlight the best results achieved by and Zisserman, 2017)). Evidenced by a prior study (Sun et al., di erent approaches. For the 2D slice-based approach, we extract in- 2017b) on ImageNet pre-training, large amount of supervision put consisting of three adjacent axial views of the lung nodule or pul- is required to foster a generic, comprehensive image represen- monary embolism and some of their surroundings. For the 2.5D or- tation. Back in 2009, when ImageNet had not been established, thogonal approach, each input is composed of an axial, coronal, and it was challenging to empower a deep model with generic im- sagittal slice and centered at a lung nodule or pulmonary embolism age representation using a small or even medium size of labeled candidate. data, the same situation, we believe, that presents in 3D med- Task: NCC Random ImageNet Genesis ical image analysis today. Therefore, despite the outstanding 2D slice-based input 96.030.86 97.790.71 97.450.61 performance of Models Genesis, there is no doubt that a large, 2.5D orthogonal input 95.761.05 97.241.01 97.070.92 strongly annotated dataset for medical image analysis, like Im- 3D volume-based input 96.031.82 n/a 98.340.44 ageNet (Deng et al., 2009) for computer vision, is still highly Task: ECC Random ImageNet Genesis 2D slice-based input 60.338.61 62.578.04 62.848.78 demanded. One of our goals for developing Models Genesis is 2.5D orthogonal input 71.274.64 78.613.73 78.583.67 to help create such a medical ImageNet. Based on a small set 3D volume-based input 80.363.58 n/a 88.041.40 of expert annotations, models fine-tuned from Models Genesis will be able to help quickly generate initial rough annotations of unlabeled images for expert review, thus reducing the anno- classification, the model predicts labels of slices extracted from tation e orts and accelerating the creation of a large, strongly the center locations because other slices are not guaranteed to annotated, medical ImageNet. In summary, Models Genesis are include objects; for segmentation, the model predicts segmenta- not designed to replace such a large, strongly annotated dataset tion mask slice by slice and form the 3D segmentation volume for medical image analysis, as ImageNet for computer vision, by simply stacking the 2D segmentation maps. but rather to help create one. Fig. 7 exposes the comparison between 3D and 2D models on five 3D target tasks. Additionally, Table 4 compares 2D slice- based, 2.5D orthogonal, and 3D volume-based approaches on 5.2. Same-domain or cross-domain transfer learning? lung nodule and pulmonary embolism false positive reduction tasks. As evidenced by our statistical analyses, the 3D models Same-domain transfer learning is always preferred whenever trained from Genesis Chest CT achieve significantly higher av- possible because a relatively smaller domain gap makes the erage performance and lower standard deviation than 2D mod- learned image representation more beneficial for target tasks. els fine-tuned from ImageNet using either 2D or 2.5D image Even the most recent self-supervised learning approaches in representation. Nonetheless, the same conclusion does not ap- medical imaging were solely evaluated within the same dataset, ply to the models trained from scratch—3D scratch models are such as Chen et al. (2019a); Tajbakhsh et al. (2019a); Zhu et al. outperformed by 2D models in one out of the five target tasks (2020). Same-domain transfer learning strikes as a preferred (i.e., NCC in Fig. 7 and Table 4) and also exhibit an undesirably choice in terms of performance; however, most of the exist- larger standard deviation. We attribute the mixed results of 3D ing medical datasets, with less than hundred cases, are usu- scratch models to the larger number of model parameters and ally too small for deep models to learn reliable image repre- limited sample size in the target tasks, which together impede sentation. Therefore, for our future work, we plan to combine the full utilization of 3D context. In fact, the undesirable perfor- the publicly available datasets from similar domains together mance of the 3D scratch models highlights the e ectiveness of to train modality-oriented models, including Genesis CT, Gen- Genesis Chest CT, which unlocks the power of 3D models for esis MRI, Genesis X-ray, and Genesis Ultrasound, as well as medical imaging. To summarize, we believe that 3D problems organ-oriented models, including Genesis Brain, Genesis Lung, in medical imaging should be solved in 3D directly. Genesis Heart, and Genesis Liver. Cross-domain transfer learning in medical imaging is the Holy Grail. Retrieving a large number of unlabeled images 5. Discussions from a PACS system requires an IRB approval, often a long 5.1. Do we still need a medical ImageNet? process; the retrieved images must be de-identified; organizing In computer vision, at the time this paper is written, no the de-identified images in a way suitable for deep learning is self-supervised learning method outperforms fine-tuning mod- tedious and laborious. Therefore, large quantities of unlabeled els pre-trained on ImageNet (Jing and Tian, 2020; Chen et al., datasets may not be readily available to many target domains. 2019a; Kolesnikov et al., 2019; Zhou et al., 2019b; Hendrycks Evidenced by our results in Table 3 (BMS), Models Genesis have et al., 2019; Zhang et al., 2019; Caron et al., 2019). Therefore, a great potential for cross-domain transfer learning; particu- it may seem surprising to observe from our results in Table 3 larly, our distortion-based approaches (such as non-linear and that (fully) supervised representation learning methods do not local-shuing) take advantage of relative intensity values (in necessarily o er higher performances in some 3D target tasks all modalities) to learn shapes and appearances of various or- than self-supervised representation learning methods. We as- gans. Therefore, as our future work, we will be focusing on cribe this phenomenon to the limited amount of supervision methods that generalize well across domains. Zongwei Zhou et al. / Medical Image Analysis (2020) 13 5.3. Is any data augmentation suitable as a transformation? transformations for automated data augmentation, while pre- serving class labels or null class for all data points. Dao et al. We propose a self-supervised learning framework to learn (2019) introduced a fast kernel alignment metric for augmenta- image representation by discriminating and restoring images tion selection. It requires image labels for computing the ker- undergoing di erent transformations. One might argue that nel target alignment (as the reward) between the feature ker- our image transformations can be interchangeable with exist- nel and the label kernel. Cubuk et al. (2019) used reinforce- ing data augmentation techniques (Gan et al., 2015; Wong et al., ment learning to form an algorithm that autonomously searches 2016; Perez and Wang, 2017; Shorten and Khoshgoftaar, 2019), for preferred augmentation policies, magnitude, and probabil- while we would like to make the distinction between these two ity for specific classification tasks, wherein the resultant accu- concepts clearer. It is critical to assess whether a specific aug- racy of predictions and labels is treated as the reward signal mentation is practical and feasible for the image restoration to train the recurrent network controller. Wu et al. (2020) pro- task when designing image transformations. Simply introduc- posed uncertainty-based sampling to select the most e ective ing data augmentation can make a task ambiguous and lead to augmentation, but it is based on the highest loss that is com- degenerate learning. To this end, we choose image transforma- puted between predictions and labels. While the reward is well- tions based on two principles: defined in the aforementioned works, unfortunately, there is no ˆ First, the transformed sub-volume should not be found in available metric to determine the power of image representation the original CT scan. But it is possible to find a trans- directly; hence, no reward is readily established for representa- formed sub-volume that has undergone such augmenta- tion learning. Rather than constrain the representation directly, tions as rotation, flip, zoom in/out, or translation, as an our paper aims to design an image restoration task to let the alternative sub-volume in the original CT scan. In this model learn generic image representation from 3D medical im- scenario, without additional spatial information, the model ages. To achieve this, inspired by Vincent et al. (2010), we would not be able to “recover” the original sub-volume by modify the definition of a good representation into the follow- seeing the transformed one. As a result, we only elect the ing: “a good representation is one that can be obtained robustly from a transformed input, and that will be useful for restoring augmentations that can be applied to sub-volumes at the the corresponding original input.” Consequently, mean square pixel level rather than the spatial level. error (MSE) between the model’s input and output is defined as ˆ Second, a transformation should be applicable for spe- the objective function in our framework. However, if we adopt cific image properties. The augmentations that manipulate MSE as the reward function, the existing automated augmen- RGB channels, such as color shift and channel dropping, tation strategies will end up selecting identical-mapping. This have little e ect on CT/MRI images without the avail- is because restoring images without any transformation is ex- ability of color information. Instead, we promote bright- pected to give a lower error than restoring those with transfor- ness and contrast into monotonic color curves, resulting in mations. Evidenced by Fig. 3, identical-mapping results in a a novel non-linear transformation, explicitly enabling the poor image representation. To summarize, the key challenge model to learn intensity distribution from medical images. when employing automated augmentation strategies into our framework is how to define a proper reward for restoring im- After filtering out using the above two principles, the remaining ages, and fundamentally, for learning image representation. data augmentation techniques are not as many as expected. We have endeavored to produce learning perspective driven trans- 5.5. How to assess restoration quality and its relationship to formations rather than inviting any types of data augmentation model transferability? into our framework. A recent study from Chen et al. (2020) has also discovered a similar phenomenon: carefully designed Our transfer learning results in Sec. 4 suggest that image augmentations are superior to autonomously discovered aug- restoration is a promising task to learn generic 3D image repre- mentations. This suggests a criterion of transformations driven sentation. This also means that image restoration quality has an by learning perspectives, in capturing a compelling, robust rep- implicit correlation with model transferability to some extent. resentation for 3D transfer learning in medical imaging. To assess restoration quality, we compare the Mean Square Er- ror (MSE) loss with other commonly used loss functions for 5.4. Can algorithms autonomously search for transformations? image restoration, such as Mean Absolute Error (MAE) and Structural Similarity Index (SSIM) (Wang et al., 2004). All We follow two principles when designing suitable image of them compute the distance between input and output im- transformations for our self-supervised learning framework (see ages, while SSIM concentrates more on the restoration quality Sec. 5.3). Potentially, “automated data augmentation” can be in terms of structural similarity than MSE and MAE. Since the considered as an ecient alternative because this line of re- publicly available 3D SSIM loss was implemented in PyTorch , search seeks to strip researchers from the burden of finding to make the comparisons fair, we have adapted our five target good parameterizations and compositions of transformations tasks into PyTorch as well. Fig. 8 shows mixed performances of manually. Specifically, existing automated augmentation strate- the five target tasks among the three alternative loss functions. gies reinforce models to learn an optimal set of augmentation policies by calculating the reward between predictions and im- age labels. To name a few, Ratner et al. (2017) proposed a method for learning how to parameterize and composite the SSIM loss in 3D: github.com/jinh0park/pytorch-ssim-3D 14 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. 8: We compare three di erent losses for the task of image restoration. There is no evidence that the three losses have a decisive impact on the transfer learning results of five target tasks. Note that for this ablation study, all the proxy and target tasks are implemented in PyTorch. 265 CT images from the dataset and present examples in Fig. 9. Specifically, we pass the original CT images to the pre-trained Genesis Chest CT. To visualize the modifications, we have fur- ther plotted the di erence maps by subtracting the input and output. Since the input images involve no image transforma- tion, most of the restored CT scans (see Rows 1—2) can pre- serve the texture and structures of the input images, only en- countering few changes thanks to the identical-mapping train- ing scheme and the skip connections between encoder and de- coder. Nonetheless, we observe some failed cases (see Row 3), especially when the input CT image contains di use disease, which appears as an opacity in the lung. Genesis Chest CT happens to “remove” those opaque regions and restore a much clearer lung. This may be due to the fact that the majority of cropped sub-volumes are normal and are being used as ground truth, which empowers the pre-trained model with capabilities of detecting and restoring “novelties” in the CT scans. More specifically, in our work, these novelties include abnormal in- tensity distribution injected by non-linear transformation, atyp- ical texture and boundary injected by local-shuing, and dis- continuity injected by both inner and outer cutout. Based on Fig. 9: [Better viewed on-line and zoomed in for details] Examples of the surrounding anatomical structure, the model predicts the image restoration using Genesis Chest CT. We pass unseen CT images (Column 1) to the pre-trained model, obtaining the restored images opaque area to be air, therefore restoring darker intensity val- (Column 2). The di erence between input and output has been shown ues. This behavior is certainly a “mistake” in terms of image in Column 3. In most of the normal cases, such as those in Rows 1—2, restoration, but it can also be thought of as an attempt to detect Genesis Chest CT can perform a fairly reasonable identical-mapping. di use diseases in the lung, which is challenging to annotate Meanwhile, for some cases that contain opacity in the lung, as illus- due to their unclear boundary. By training an image restoration trated in Row 3, Genesis Chest CT tends to restore a clearer lung. As task, the diseased area will be revealed by simple subtraction of a result, the di use region is revealed in the di erence map automat- the input and output. More importantly, this suggested detec- ically. We have zoomed in the region for a better visualization and tion approach requires zero human annotation, neither image- comparison. level label nor pixel-level contour, contrasting from the existing weakly supervised disease detection approaches (Zhou et al., 2016; Baumgartner et al., 2018; Cai et al., 2018; Siddiquee As discussed in Sec. 5.4, the ideal loss function for represen- et al., 2019). tation learning is one that can explicitly determine the power of image representation. However, the three losses explored in this section are implicit, based on the premise that the image 6. Related Work restoration quality can indicate a good representation. Further With the splendid success of deep neural networks, trans- studies with restoration quality assessment and its relationship fer learning (Pan and Yang, 2010; Weiss et al., 2016; Yosinski to model transferability are therefore suggested. et al., 2014) has become integral to many applications, espe- cially medical imaging (Greenspan et al., 2016; Litjens et al., 5.6. Could Models Genesis detect infected regions from images 2017; Lu et al., 2017; Shen et al., 2017; Wang et al., 2017a; autonomously? Zhou et al., 2017, 2019b). This immense popularity of transfer As referenced from Sec. 3.1, Genesis Chest CT has been pre- learning is attributed to the learned image representation, which trained using 623 CT images in the LUNA 2016 dataset. To o ers convergence speedups and performance gains for most assess the image restoration quality, we utilize the rest of the target tasks, in particular, with limited annotated data. In the Zongwei Zhou et al. / Medical Image Analysis (2020) 15 following sections, we review the works related to supervised allows models to learn image representation from abundant un- and self-supervised representation learning. labeled medical image data with zero human annotation e ort. 6.1. Supervised representation learning 6.2. Self-supervised representation learning ImageNet contains more than fourteen million images that Aiming at learning image representation from unlabeled have been manually annotated to indicate which objects are data, self-supervised learning research has recently experienced present in each image; and more than one million of the im- a surge in computer vision (Caron et al., 2018; Chen et al., ages have actually been annotated with the bounding boxes of 2019c; Doersch et al., 2015; Goyal et al., 2019; Jing and Tian, the objects in the image. Pre-training a model on ImageNet and 2020; Mahendran et al., 2018; Mundhenk et al., 2018; Noroozi then fine-tuning it on di erent medical imaging tasks has seen et al., 2018; Noroozi and Favaro, 2016; Pathak et al., 2016; the most practical adoption in medical image analysis (Shin Sayed et al., 2018; Zhang et al., 2016, 2017), but it is a rela- et al., 2016; Tajbakhsh et al., 2016). To classify the com- tively new trend in modern medical imaging. The key challenge mon thoracic diseases from ChestX-ray14 dataset, as evidenced for self-supervised learning is identifying a suitable self super- in Irvin et al. (2019), nearly all the leading methods (Guan vision task, i.e., generating input and output instance pairs from and Huang, 2018; Guendel et al., 2018; Ma et al., 2019; Tang the data. Two of the preliminary studies include predicting the et al., 2018) follow the paradigm of “fine-tuning Models Ima- distance and 3D coordinates of two patches randomly sampled geNet” by adopting di erent architectures, such as ResNet (He from the same brain (Spitzer et al., 2018), identifying whether et al., 2016) and DenseNet (Huang et al., 2017), along with two scans belong to the same person, and predicting the level their pre-trained weights. Other representative medical applica- of vertebral bodies (Jamaludin et al., 2017). Nevertheless, these tions include identifying skin cancer from dermatologist level two works are incapable of learning representation from “self- photographs (Esteva et al., 2017), o ering early detection of supervision” because they demand auxiliary information and Alzheimer’s Disease (Ding et al., 2018), and performing e ec- specialized data collection such as paired and registered images. tive detection of pulmonary embolism (Tajbakhsh et al., 2019b). By utilizing only the original pixel/voxel information shipped Despite the remarkable transferability of Models ImageNet, with data, several self-supervised learning schemes have been pre-trained 2D models o er little benefits towards 3D medi- developed for di erent medical applications: Ross et al. (2018) cal imaging tasks in the most prominent medical modalities adopted colorization as proxy task, wherein color colonoscopy (e.g., CT and MRI). To fit this paradigm, 3D imaging tasks images are converted to gray-scale and then recovered using a have to be reformulated and solved in 2D or 2.5D (Roth et al., conditional Generative Adversarial Network (GAN); Alex et al. 2015, 2014; Tajbakhsh et al., 2015), thus losing rich 3D anatom- (2017) pre-trained a stack of denoising auto-encoders, wherein ical information and inevitably compromising the performance. the self-supervision was created by mapping the patches with Annotating 3D medical images at the similar scale with Ima- the injected noise to the original patches; Chen et al. (2019a) geNet requires a significant research e ort and budget. It is designed image restoration as proxy task, wherein small regions currently not feasible to create annotated datasets comparable were shued within images and then let models learn to restore to this size for every 3D medical application. Consequently, for the original ones; Zhuang et al. (2019) and Zhu et al. (2020) in- lung cancer risk malignancy estimation, Ardila et al. (2019) re- troduced a 3D representation learning proxy task by recovering sorted to incorporate 3D spatial information by using Inflated the rearranged and rotated Rubik’s cube; and finally Tajbakhsh 3D (I3D) (Carreira and Zisserman, 2017), trained from the Ki- et al. (2019a) individualized self-supervised schemes for a set of netics dataset, as the feature extractor. Evidenced by Table 3, it target tasks. As seen, the previously discussed self-supervised is not the most favorable choice owing to the large domain gap learning schemes, both in computer vision and medical imag- between the temporal video and medical volume. This limita- ing, are developed individually for specific target tasks, there- tion has led to the development of model zoo in NiftyNet (Gib- fore, the generalizability and robustness of the learned image son et al., 2018b). However, they were trained with small representation have yet to be examined across multiple target datasets for specific applications (e.g., brain parcellation and or- tasks. To our knowledge, we are the first to investigate cross- gan segmentation), and were never intended as source models domain self-supervised learning in medical imaging. for transfer learning. Our experimental results in Table 3 indi- cate that NiftyNet models o er limited benefits to the five target 6.3. Our previous work medical applications via transfer learning. More recently, Chen et al. (2019b) have pre-trained 3D residual network by jointly Zhou et al. (2019b) first presented generic autodidactic mod- segmenting the objects annotated in a collection of eight med- els for 3D medical imaging, which obtain common image repre- ical datasets, resulting in MedicalNet for 3D transfer learning. sentation that is transferable and generalizable across diseases, In Table 3, we have examined the pre-trained MedicalNet on organs and modalities, overcoming the scalablity issue associ- five target tasks in comparison with our Models Genesis. As ated with multiple tasks. This paper extends the preliminary reviewed, each and every aforementioned pre-trained model re- version substantially with the following improvements. quires massive, high-quality annotated datasets. However, sel- dom do we have a perfectly-sized and systematically-labeled 1. We have introduced notations, formulas, and diagrams, as dataset to pre-train a deep model in medical imaging, where well as detailed methodology descriptions along with their both data and annotations are expensive to acquire. We over- learning objectives, for a succinct framework overview in come the above limitation via self-supervised learning, which Sec. 2. 16 Zongwei Zhou et al. / Medical Image Analysis (2020) 2. We have extended the brain tumor segmentation experi- Number R01HL128785. The content is solely the responsibil- ment using MRI Flair images in Sec. 3.2, highlighting the ity of the authors and does not necessarily represent the o- transfer learning capabilities of Models Genesis from CT cial views of the NIH. This work has utilized the GPUs pro- to MRI Flair domains. vided partially by the ASU Research Computing and partially by the Extreme Science and Engineering Discovery Environ- 3. We have conducted comprehensive ablation studies be- ment (XSEDE) funded by the National Science Foundation tween the combined scheme and each of the individual (NSF) under grant number ACI-1548562. We thank Z. Guo for learning schemes in Sec. 4.1, demonstrating that learning implementing Rubik’s Cube (Zhuang et al., 2019) and the 3D from multiple perspectives leads to a more robust target version of Jigsaw (Noroozi and Favaro, 2016) and DeepClus- task performance. ter (Caron et al., 2018); F. Haghighi and M. R. Hosseinzadeh Taher for implementing the 3D version of in-painting (Pathak 4. We have investigated three di erent random initialization et al., 2016), patch-shuing (Chen et al., 2019a), and work- methods for 3D models in Sec. 4.2, suggesting that initial- ing with Z. Guo in evaluating the performance of Medical- izing with Models Genesis can o er much higher perfor- Net (Chen et al., 2019b); M. M. Rahman Siddiquee for exam- mances and faster convergences. ining NiftyNet (Gibson et al., 2018b) with our Models Gen- 5. We have examined Models Genesis with the existing pre- esis; P. Zhang for comparing two additional random initializa- trained 3D models on five distinct medical target tasks tion methods with our Models Genesis; S. Bajpai for comparing in Sec. 4.3, showing that with fewer parameters, Models three loss functions of the proxy task; N. Tajbakhsh for revis- Genesis surpass all publicly available 3D models in both ing our conference paper; R. Feng for valuable discussions; and generalizability and transferability. S. Tatapudi for helping improve the writing of this paper. The content of this paper is covered by patents pending. 6. We have provided experimental results on five target tasks using limited annotated data in Sec. 4.4, indicating that transfer learning from our Models Genesis can reduce an- References notation e orts by at least 30%. Alex, V., Vaidhya, K., Thirunavukkarasu, S., Kesavadas, C., Krishnamurthi, 7. We have investigated 3D sub-volume based approaches G., 2017. Semisupervised learning using denoising autoencoders for brain compared with 2D approaches fine-tuning from Models lesion detection and segmentation. Journal of Medical Imaging 4, 041311. Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse, ImageNet using 2D/2.5D representation, underlining the D., Etemadi, M., Ye, W., Corrado, G., et al., 2019. End-to-end lung cancer power of pre-trained 3D models in Sec. 4.5. screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine 25, 954–961. Aresta, G., Jacobs, C., Araujo, ´ T., Cunha, A., Ramos, I., van Ginneken, B., 7. Conclusion Campilho, A., 2019. iw-net: an automatic and minimalistic interactive lung nodule segmentation deep network. Scientific reports 9, 1–9. A key contribution of ours is a collection of generic source Armato III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., models, nicknamed Models Genesis, built directly from unla- Reeves, A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Ho man, E.A., et al., beled 3D imaging data with our novel unified self-supervised 2011. The lung image database consortium (lidc) and image database re- source initiative (idri): a completed reference database of lung nodules on ct method, for generating powerful application-specific target scans. Medical physics 38, 915–931. models through transfer learning. While the empirical results Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shino- are strong, surpassing state-of-the-art performances in most of hara, R.T., Berger, C., Ha, S.M., Rozycki, M., et al., 2018. Identifying the the applications, our goal is to extend our Models Genesis to best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv modality-oriented models, such as Genesis MRI and Genesis preprint arXiv:1811.02629 . Ultrasound, as well as organ-oriented models, such as Gene- Baumgartner, C.F., Koch, L.M., Can Tezcan, K., Xi Ang, J., Konukoglu, E., sis Brain and Genesis Heart. We envision that Models Genesis 2018. Visual feature attribution using wasserstein gans, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. may serve as a primary source of transfer learning for 3D med- 8309–8319. ical imaging applications, in particular, with limited annotated Ben-Cohen, A., Diamant, I., Klang, E., Amitai, M., Greenspan, H., 2016. Fully data. To benefit the research community, we make the develop- convolutional network for liver segmentation and lesions detection, in: Deep ment of Models Genesis open science, releasing our codes and learning and data labeling for medical applications. Springer, pp. 77–85. Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou, Q., Fu, C.W., models to the public. Creating all Models Genesis, an ambi- Han, X., Heng, P.A., Hesser, J., et al., 2019. The liver tumor segmentation tious undertaking, takes a village; therefore, we would like to benchmark (lits). arXiv preprint arXiv:1901.04056 . invite researchers around the world to contribute to this e ort, Buzug, T.M., 2011. Computed tomography, in: Springer Handbook of Medical and hope that our collective e orts will lead to the Holy Grail Technology. Springer, pp. 311–342. Cai, J., Lu, L., Harrison, A.P., Shi, X., Chen, P., Yang, L., 2018. Iterative of Models Genesis, all powerful across diseases, organs, and attention mining for weakly supervised thoracic disease pattern localization modalities. in chest x-rays, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 589–598. Caron, M., Bojanowski, P., Joulin, A., Douze, M., 2018. Deep clustering for Acknowledgments unsupervised learning of visual features, in: Proceedings of the European Conference on Computer Vision, pp. 132–149. This research has been supported partially by ASU and Mayo Caron, M., Bojanowski, P., Mairal, J., Joulin, A., 2019. Unsupervised pre- Clinic through a Seed Grant and an Innovation Grant, and par- training of image features on non-curated data, in: Proceedings of the IEEE tially by the National Institutes of Health (NIH) under Award International Conference on Computer Vision, pp. 2959–2968. Zongwei Zhou et al. / Medical Image Analysis (2020) 17 Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpass- and the kinetics dataset, in: Proceedings of the IEEE Conference on Com- ing human-level performance on imagenet classification, in: Proceedings puter Vision and Pattern Recognition, pp. 6299–6308. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D., 2019a. 1026–1034. Self-supervised learning for medical image analysis using image context He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image restoration. Medical image analysis 58, 101539. recognition, in: Proceedings of the IEEE Conference on Computer Vision Chen, S., Ma, K., Zheng, Y., 2019b. Med3d: Transfer learning for 3d medical and Pattern Recognition, pp. 770–778. image analysis. arXiv preprint arXiv:1904.00625 . Hendrycks, D., Mazeika, M., Kadavath, S., Song, D., 2019. Using self- Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple frame- supervised learning can improve model robustness and uncertainty, in: Ad- work for contrastive learning of visual representations. arXiv preprint vances in Neural Information Processing Systems, pp. 15637–15648. arXiv:2002.05709 . Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2017. Densely Chen, T., Zhai, X., Ritter, M., Lucic, M., Houlsby, N., 2019c. Self-supervised connected convolutional networks, in: Proceedings of the IEEE Conference gans via auxiliary rotation loss, in: Proceedings of the IEEE Conference on on Computer Vision and Pattern Recognition, p. 3. Computer Vision and Pattern Recognition, pp. 12154–12163. Hurst, R.T., Burke, R.F., Wissner, E., Roberts, A., Kendall, C.B., Lester, S.J., Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V., 2019. Autoaugment: Somers, V., Goldman, M.E., Wu, Q., Khandheria, B., 2010. Incidence of Learning augmentation strategies from data, in: Proceedings of the IEEE subclinical atherosclerosis as a marker of cardiovascular risk in retired pro- conference on computer vision and pattern recognition, pp. 113–123. fessional football players. The American journal of cardiology 105, 1107– Dao, T., Gu, A., Ratner, A.J., Smith, V., De Sa, C., Re, ´ C., 2019. A kernel theory 1111. of modern data augmentation. Proceedings of machine learning research 97, Iizuka, S., Simo-Serra, E., Ishikawa, H., 2017. Globally and locally consistent 1528. image completion. ACM Transactions on Graphics (ToG) 36, 107. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A Io e, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network large-scale hierarchical image database, in: Proceedings of the IEEE Con- training by reducing internal covariate shift, in: Bach, F., Blei, D. (Eds.), ference on Computer Vision and Pattern Recognition, IEEE. pp. 248–255. Proceedings of the 32nd International Conference on Machine Learning, Ding, Y., Sohn, J.H., Kawczynski, M.G., Trivedi, H., Harnish, R., Jenkins, PMLR, Lille, France. pp. 448–456. URL: http://proceedings.mlr. N.W., Lituiev, D., Copeland, T.P., Aboian, M.S., Mari Aparici, C., et al., press/v37/ioffe15.html. 2018. A deep learning model to predict a diagnosis of alzheimer disease by Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Mark- using 18f-fdg pet of the brain. Radiology 290, 456–464. lund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al., 2019. Chexpert: Doersch, C., Gupta, A., Efros, A.A., 2015. Unsupervised visual representation A large chest radiograph dataset with uncertainty labels and expert compar- learning by context prediction, in: Proceedings of the IEEE International ison. arXiv preprint arXiv:1901.07031 . Conference on Computer Vision, pp. 1422–1430. Jamaludin, A., Kadir, T., Zisserman, A., 2017. Self-supervised learning for Doersch, C., Zisserman, A., 2017. Multi-task self-supervised visual learning, spinal mris, in: Deep Learning in Medical Image Analysis and Multimodal in: Proceedings of the IEEE International Conference on Computer Vision, Learning for Clinical Decision Support. Springer, pp. 294–302. pp. 2051–2060. Jing, L., Tian, Y., 2020. Self-supervised visual feature learning with deep neural Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T., 2015. networks: A survey. IEEE Transactions on Pattern Analysis and Machine Discriminative unsupervised feature learning with exemplar convolutional Intelligence . neural networks. IEEE transactions on pattern analysis and machine intelli- Kang, G., Dong, X., Zheng, L., Yang, Y., 2017. Patchshue regularization. gence 38, 1734–1747. arXiv preprint arXiv:1707.07103 . Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, Kinga, D., Adam, J.B., 2015. Adam: A method for stochastic optimization, in: S., 2017. Dermatologist-level classification of skin cancer with deep neural International Conference on Learning Representations (ICLR). networks. Nature 542, 115. Kolesnikov, A., Zhai, X., Beyer, L., 2019. Revisiting self-supervised visual rep- Forbes, G.B., 2012. Human body composition: growth, aging, nutrition, and resentation learning, in: Proceedings of the IEEE conference on Computer activity. Springer Science & Business Media. Vision and Pattern Recognition, pp. 1920–1929. Gan, Z., Henao, R., Carlson, D., Carin, L., 2015. Learning deep sigmoid belief Liang, J., Bi, J., 2007. Computer aided detection of pulmonary embolism with networks with data augmentation, in: Artificial Intelligence and Statistics, tobogganing and mutiple instance classification in ct pulmonary angiogra- pp. 268–276. phy, in: Biennial International Conference on Information Processing in Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., David- Medical Imaging, Springer. pp. 630–641. son, B., Pereira, S.P., Clarkson, M.J., Barratt, D.C., 2018a. Automatic multi- Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, organ segmentation on abdominal ct with dense v-networks. IEEE transac- M., Van Der Laak, J.A., Van Ginneken, B., Sanchez, ´ C.I., 2017. A survey tions on medical imaging 37, 1822–1834. on deep learning in medical image analysis. Medical image analysis 42, Gibson, E., Li, W., Sudre, C., Fidon, L., Shakir, D.I., Wang, G., Eaton-Rosen, 60–88. Z., Gray, R., Doel, T., Hu, Y., et al., 2018b. Niftynet: a deep-learning plat- LiTS, 2017. Results of all submissions for liver segmentation. URL: https: form for medical imaging. Computer methods and programs in biomedicine //competitions.codalab.org/competitions/17094#results. 158, 113–122. Lu, L., Zheng, Y., Carneiro, G., Yang, L., 2017. Deep learning and convolu- Glorot, X., Bengio, Y., 2010. Understanding the diculty of training deep tional neural networks for medical image computing: precision medicine, feedforward neural networks, in: Proceedings of the Thirteenth International high performance and large-scale datasets. Springer. Conference on Artificial Intelligence and Statistics, pp. 249–256. LUNA, 2016. Results of all submissions for nodule false positive reduction. Goyal, P., Mahajan, D., Gupta, A., Misra, I., 2019. Scaling and bench- URL: https://luna16.grand-challenge.org/results/. marking self-supervised visual representation learning. arXiv preprint Ma, Y., Zhou, Q., Chen, X., Lu, H., Zhao, Y., 2019. Multi-attention network arXiv:1905.01235 . for thoracic disease classification and localization, in: ICASSP 2019-2019 Graham, B., 2014. Fractional max-pooling. arXiv preprint arXiv:1412.6071 . IEEE International Conference on Acoustics, Speech and Signal Processing Greenspan, H., van Ginneken, B., Summers, R.M., 2016. Guest editorial deep (ICASSP), IEEE. pp. 1378–1382. learning in medical imaging: Overview and future promise of an exciting Mahendran, A., Thewlis, J., Vedaldi, A., 2018. Cross pixel optical-flow similar- new technique. IEEE Transactions on Medical Imaging 35, 1153–1159. ity for self-supervised learning, in: Asian Conference on Computer Vision, Guan, Q., Huang, Y., 2018. Multi-label chest x-ray image classification via Springer. pp. 99–116. category-wise residual attention learning. Pattern Recognition Letters . Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Guendel, S., Grbic, S., Georgescu, B., Liu, S., Maier, A., Comaniciu, D., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al., 2015. The multimodal 2018. Learning to recognize abnormalities in chest x-rays with location- brain tumor image segmentation benchmark (brats). IEEE transactions on aware dense networks, in: Iberoamerican Congress on Pattern Recognition, medical imaging 34, 1993. Springer. pp. 757–765. Mortenson, M.E., 1999. Mathematics for computer graphics applications. In- Hara, K., Kataoka, H., Satoh, Y., 2018. Can spatiotemporal 3d cnns retrace the dustrial Press Inc. history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference Mundhenk, T.N., Ho, D., Chen, B.Y., 2018. Improvements to context based on Computer Vision and Pattern Recognition, pp. 6546–6555. self-supervised learning., in: Proceedings of the IEEE Conference on Com- 18 Zongwei Zhou et al. / Medical Image Analysis (2020) puter Vision and Pattern Recognition, pp. 9339–9348. Sun, C., Guo, S., Zhang, H., Li, J., Chen, M., Ma, S., Jin, L., Liu, X., Li, NLST, 2011. Reduced lung-cancer mortality with low-dose computed tomo- X., Qian, X., 2017a. Automatic segmentation of liver tumors from multi- graphic screening. New England Journal of Medicine 365, 395–409. phase contrast-enhanced ct images based on fcns. Artificial intelligence in Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations medicine 83, 58–66. by solving jigsaw puzzles, in: European Conference on Computer Vision, Sun, C., Shrivastava, A., Singh, S., Gupta, A., 2017b. Revisiting unreason- Springer. pp. 69–84. able e ectiveness of data in deep learning era, in: Proceedings of the IEEE Noroozi, M., Vinjimoor, A., Favaro, P., Pirsiavash, H., 2018. Boosting self- international conference on computer vision, pp. 843–852. supervised learning via knowledge transfer, in: Proceedings of the IEEE Sun, W., Zheng, B., Qian, W., 2017c. Automatic feature learning using multi- Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. channel roi based on deep structured algorithms for computerized lung can- Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on cer diagnosis. Computers in biology and medicine 89, 530–539. knowledge and data engineering 22, 1345–1359. Tajbakhsh, N., Gotway, M.B., Liang, J., 2015. Computer-aided pulmonary em- Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A., 2016. Con- bolism detection using a novel vessel-aligned multi-planar image represen- text encoders: Feature learning by inpainting, in: Proceedings of the IEEE tation and convolutional neural networks, in: International Conference on Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Medical Image Computing and Computer-Assisted Intervention, Springer. Perez, L., Wang, J., 2017. The e ectiveness of data augmentation in image pp. 62–69. classification using deep learning. arXiv preprint arXiv:1712.04621 . Tajbakhsh, N., Hu, Y., Cao, J., Yan, X., Xiao, Y., Lu, Y., Liang, J., Terzopoulos, Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., Nielsen, M., 2013. D., Ding, X., 2019a. Surrogate supervision for medical image analysis: Ef- Deep feature learning for knee cartilage segmentation using a triplanar con- fective deep learning from limited quantities of labeled data. arXiv preprint volutional neural network, in: International conference on medical image arXiv:1901.08707 . computing and computer-assisted intervention, Springer. pp. 246–253. Tajbakhsh, N., Shin, J.Y., Gotway, M.B., Liang, J., 2019b. Computer-aided Ratner, A.J., Ehrenberg, H., Hussain, Z., Dunnmon, J., Re, ´ C., 2017. Learn- detection and visualization of pulmonary embolism using a novel, com- ing to compose domain-specific transformations for data augmentation, in: pact, and discriminative image representation. Medical image analysis 58, Advances in neural information processing systems, pp. 3236–3246. 101541. Rawat, W., Wang, Z., 2017. Deep convolutional neural networks for image clas- Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, sification: A comprehensive review. Neural computation 29, 2352–2449. M.B., Liang, J., 2016. Convolutional neural networks for medical image Ross, T., Zimmerer, D., Vemuri, A., Isensee, F., Wiesenfarth, M., Bodenstedt, analysis: Full training or fine tuning? IEEE transactions on medical imaging S., Both, F., Kessler, P., Wagner, M., Muller ¨ , B., et al., 2018. Exploiting the 35, 1299–1312. potential of unlabeled endoscopic video data with self-supervised learning. Taleb, A., Loetzsch, W., Danz, N., Severin, J., Gaertner, T., Bergner, B., Lip- International journal of computer assisted radiology and surgery 13, 925– pert, C., 2020. 3d self-supervised methods for medical imaging. arXiv 933. preprint arXiv:2006.03829 . Roth, H.R., Lu, L., Liu, J., Yao, J., Se , A., Cherry, K., Kim, L., Summers, Tang, H., Zhang, C., Xie, X., 2019. Nodulenet: Decoupled false positive re- R.M., 2015. Improving computer-aided detection using convolutional neu- duction for pulmonary nodule detection and segmentation, in: International ral networks and random view aggregation. IEEE transactions on medical Conference on Medical Image Computing and Computer-Assisted Interven- imaging 35, 1170–1181. tion, Springer. pp. 266–274. Roth, H.R., Lu, L., Se , A., Cherry, K.M., Ho man, J., Wang, S., Liu, J., Tang, Y., Wang, X., Harrison, A.P., Lu, L., Xiao, J., Summers, R.M., 2018. Turkbey, E., Summers, R.M., 2014. A new 2.5 d representation for lymph Attention-guided curriculum learning for weakly supervised classification node detection using random sets of deep convolutional neural network ob- and localization of thoracic diseases on chest radiographs, in: International servations, in: International conference on medical image computing and Workshop on Machine Learning in Medical Imaging, Springer. pp. 249–258. computer-assisted intervention, Springer. pp. 520–527. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., 2010. Sayed, N., Brattoli, B., Ommer, B., 2018. Cross and learn: Cross-modal self- Stacked denoising autoencoders: Learning useful representations in a deep supervision, in: German Conference on Pattern Recognition, Springer. pp. network with a local denoising criterion. Journal of machine learning re- 228–243. search 11, 3371–3408. Setio, A.A.A., Ciompi, F., Litjens, G., Gerke, P., Jacobs, C., Van Riel, S.J., Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E., 2018. Deep Wille, M.M.W., Naqibullah, M., Sanchez, ´ C.I., van Ginneken, B., 2016. Pul- learning for computer vision: A brief review. Computational intelligence monary nodule detection in ct images: false positive reduction using multi- and neuroscience 2018. view convolutional networks. IEEE transactions on medical imaging 35, Wang, H., Zhou, Z., Li, Y., Chen, Z., Lu, P., Wang, W., Liu, W., Yu, L., 2017a. 1160–1169. Comparison of machine learning methods for classifying mediastinal lymph Setio, A.A.A., Traverso, A., De Bel, T., Berens, M.S., van den Bogaard, C., node metastasis of non-small cell lung cancer from 18 f-fdg pet/ct images. Cerello, P., Chen, H., Dou, Q., Fantacci, M.E., Geurts, B., et al., 2017. Val- EJNMMI research 7, 11. idation, comparison, and combination of algorithms for automatic detection Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M., 2017b. of pulmonary nodules in computed tomography images: the luna16 chal- Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on lenge. Medical image analysis 42, 1–13. weakly-supervised classification and localization of common thorax dis- Shen, D., Wu, G., Suk, H.I., 2017. Deep learning in medical image analysis. eases, in: Proceedings of the IEEE Conference on Computer Vision and Annual review of biomedical engineering 19, 221–248. Pattern Recognition, pp. 2097–2106. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mol- Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., 2004. Image quality lura, D., Summers, R.M., 2016. Deep convolutional neural networks for assessment: from error visibility to structural similarity. IEEE transactions computer-aided detection: CNN architectures, dataset characteristics and on image processing 13, 600–612. transfer learning. IEEE transactions on medical imaging 35, 1285–1298. Weiss, K., Khoshgoftaar, T.M., Wang, D., 2016. A survey of transfer learning. Shorten, C., Khoshgoftaar, T.M., 2019. A survey on image data augmentation Journal of Big Data 3, 9. for deep learning. Journal of Big Data 6, 60. Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D., 2016. Understand- Siddiquee, M.M.R., Zhou, Z., Tajbakhsh, N., Feng, R., Gotway, M.B., Bengio, ing data augmentation for classification: when to warp?, in: 2016 interna- Y., Liang, J., 2019. Learning fixed points in generative adversarial networks: tional conference on digital image computing: techniques and applications From image-to-image translation to disease detection and localization, in: (DICTA), IEEE. pp. 1–6. Proceedings of the IEEE International Conference on Computer Vision, pp. Wu, B., Zhou, Z., Wang, J., Wang, Y., 2018. Joint learning for pulmonary 191–200. nodule segmentation, attributes and malignancy prediction, in: 2018 IEEE Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S., Dickscheid, T., 2018. 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE. Improving cytoarchitectonic segmentation of human brain areas with self- pp. 1109–1113. supervised siamese networks, in: International Conference on Medical Im- Wu, S., Zhang, H.R., Valiant, G., Re, ´ C., 2020. On the generalization age Computing and Computer-Assisted Intervention, Springer. pp. 663–671. e ects of linear transformations in data augmentation. arXiv preprint Standley, T., Zamir, A.R., Chen, D., Guibas, L., Malik, J., Savarese, S., 2019. arXiv:2005.00695 . Which tasks should be learned together in multi-task learning? arXiv Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are preprint arXiv:1905.07553 . features in deep neural networks?, in: Advances in neural information pro- Zongwei Zhou et al. / Medical Image Analysis (2020) 19 cessing systems, pp. 3320–3328. Appendix A. Implementation details of revised baselines Zhang, L., Qi, G.J., Wang, L., Luo, J., 2019. Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data, This work is among the first e ort to create a comprehen- in: Proceedings of the IEEE Conference on Computer Vision and Pattern sive benchmark for existing self-supervised learning methods Recognition, pp. 2547–2555. Zhang, R., Isola, P., Efros, A.A., 2016. Colorful image colorization, in: Pro- for 3D medical image analysis. We have extended the six ceedings of the European Conference on Computer Vision, Springer. pp. most representative self-supervised learning methods into their 649–666. 3D versions, including De-noising (Vincent et al., 2010), In- Zhang, R., Isola, P., Efros, A.A., 2017. Split-brain autoencoders: Unsupervised painting (Pathak et al., 2016), Jigsaw (Noroozi and Favaro, learning by cross-channel prediction, in: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 1058–1067. 2016), and Patch-shuing (Chen et al., 2019a). These meth- Zhang, T., 2004. Solving large scale linear prediction problems using stochastic ods were originally introduced for the purpose of 2D imag- gradient descent algorithms, in: Proceedings of the twenty-first international ing. On the other hand, the most recent 3D self-supervised conference on Machine learning, ACM. p. 116. method (Zhuang et al., 2019) learns representation by playing a Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A., 2016. Learning deep features for discriminative localization, in: Proceedings of the IEEE Rubik’s cube. We have reimplemented it because their ocial Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. implementation is not publicly available at the time this paper is Zhou, Z., Shin, J., Feng, R., Hurst, R.T., Kendall, C.B., Liang, J., 2019a. Inte- written. All of the models are pre-trained using the LUNA 2016 grating active learning and transfer learning for carotid intima-media thick- dataset (Setio et al., 2017) with the same sub-volumes extracted ness video interpretation. Journal of digital imaging 32, 290–299. Zhou, Z., Shin, J., Zhang, L., Gurudu, S., Gotway, M., Liang, J., 2017. Fine- from CT scans as our models (see Sec. 3.1). The detailed im- tuning convolutional neural networks for biomedical image analysis: ac- plementations of the baselines are elaborated in the following tively and incrementally, in: Proceedings of the IEEE Conference on Com- sections. puter Vision and Pattern Recognition, pp. 7340–7349. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2018. Unet++: A nested u-net architecture for medical image segmentation, in: Deep Learning Appendix A.1. Extended 3D De-noising in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 3–11. In our 3D De-noising, which is inspired by its 2D coun- Zhou, Z., Sodha, V., Rahman Siddiquee, M.M., Feng, R., Tajbakhsh, N., Got- terpart (Vincent et al., 2010), the model is trained to restore way, M.B., Liang, J., 2019b. Models genesis: Generic autodidactic models the original sub-volume from its transformed one with addi- for 3d medical image analysis, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Springer International Publishing, tive Gaussian noise (randomly sampling  2 [0; 0:1]). To cor- Cham. pp. 384–393. URL: https://link.springer.com/chapter/ rectly restore the original sub-volume, models are required to 10.1007/978-3-030-32251-9_42. learn Gabor-like edge detectors when denoising transformed Zhu, J., Li, Y., Hu, Y., Ma, K., Zhou, S.K., Zheng, Y., 2020. Rubik’s cube+: A sub-volumes. Following the proposed image restoration train- self-supervised feature learning framework for 3d medical image analysis. Medical Image Analysis , 101746. ing scheme, the auto-encoder network is replaced with a 3D Zhuang, X., Li, Y., Hu, Y., Ma, K., Yang, Y., Zheng, Y., 2019. Self-supervised U-Net, wherein the input is a 64 64 32 sub-volume that has feature learning for 3d medical images by playing a rubik’s cube, in: Inter- undergone Gaussian noise and the output is the restored sub- national Conference on Medical Image Computing and Computer-Assisted volume. The L2 distance between input and output is used as Intervention, Springer. pp. 420–428. the loss function. Appendix A.2. Extended 3D In-painting In our 3D In-painting, which is inspired by its 2D counter- part (Pathak et al., 2016), the model is trained to in-paint ar- bitrary cutout regions based on the rest of the sub-volume. A qualitative illustration of the image in-painting task is shown in the right panel of Fig. A.11(a). To correctly predict missing regions, networks are required to learn local continuities of or- gans in medical images via interpolation. Unlike the original in-painting, the adversarial loss and discriminator are excluded from our implementation of the 3D version because our primary goal is to empower models with generic representation, rather than generating sharper and realistic sub-volumes. The genera- tor is a 3D U-Net, consisting of an encoder and a decoder. The input of the encoder is a 64 64 32 sub-volume that needs to be in-painted. Their decoder works di erently than our inner- cutout because it predicts the missing region only, and there- fore, the loss is just computed on the cutout region—an ablation study on the loss has been further presented in Appendix C.2. Appendix A.3. Extended 3D Jigsaw In our 3D Jigsaw, which is inspired by its 2D counter- part (Noroozi and Favaro, 2016), we utilize the implementation 20 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. A.10: A direct comparison between global patch shuing (Chen et al., 2019a) and our local pixel shuing. (a) illustrates ten example images undergone local-shuing and patch-shuing independently. As seen, the overall anatomical structure such as individual organs, blood vessels, lymph nodes, and other soft tissue structures are preserved in the transformed image through local-shuing. (b) presents the performance on five target tasks, showing that models pre-trained by our local-shuing noticeably outperform those pre-trained by patch-shuing for cross-domain transfer learning (BMS). by Taleb et al. (2020) , wherein the puzzles are created by sam- tecture, where the encoder and decoder serve as analysis and pling a 3  3  3 grid of 3D patches. Then, these patches are restoration parts, respectively. shued according to an arbitrary permutation, selected from a set of predefined permutations. This set with size P = 100 is Appendix A.5. Extended 3D DeepCluster chosen out of the (333)! possible permutations, by following In our 3D DeepCluster, which is inspired by its 2D coun- the Hamming distance based algorithm, and each permutation terpart (Caron et al., 2018), we iteratively cluster deep fea- is assigned an index. As a result, the problem is cast as a P-way tures extracted from sub-volumes by k-means and use the sub- classification task, i.e., the model is trained to recognize the ap- sequent assignments as supervision to update the weights of plied permutation index, allowing us to solve the 3D puzzles the model. Through clustering, the model can obtain useful eciently. We build the classification model by taking the en- general-purpose visual features, requiring little domain knowl- coder of 3D U-Net and appending a sequence of f c layers. In edge and no specific signal from the inputs. We replaced orig- the implementation, we minimize the cross-entropy loss of the inal AlexNet/VGG architecture with the encoder of 3D U-Net list of extracted puzzles. to process 3D input sub-volumes. The number of clusters that works best for 2D tasks may not be a good choice for 3D tasks. Appendix A.4. Extended 3D Patch-shuing To ensure a fair comparison, we extensively tune this hyper- parameter in f10; 20; 40; 80; 160; 320g and finally set to 260 In our 3D Patch-shuing, which is inspired by its 2D coun- from the narrowed down search space of f240; 260; 280g. Un- terpart (Chen et al., 2019a), the model learns image represen- like ImageNet models for 2D imaging tasks, there is no avail- tation by restoring the image context. Given a sub-volume, we able pre-trained 3D feature extractor for medical imaging tasks; randomly select two isolated small 3D patches and swap their therefore, we randomly initialize the model weights at the be- context. We set the length, width, and height of the 3D patch ginning. Our Models Genesis, the first generic 3D pre-trained to be proportional to those in the entire sub-volume by 25% to models, could potentially be used as the 3D feature extractor 50%. Repeating this process for T = 10 times can generate the and co-trained with 3D DeepCluster. transformed sub-volume (see examples in Fig. A.10(a)). The model is trained to restore the original sub-volume, where L2 Appendix A.6. Rubik’s cube distance between input and output is used as the loss function. To process volumetric input and ensure a fair comparison with We implement Rubik’s Cube with respect to Zhuang et al. other baselines, we replace their U-Net with 3D U-Net archi- (2019), which consists of cube rearrangement and cube rota- tion. Like playing a Rubik’s cube, this proxy task enforces mod- els to learn translational and rotational invariant features from Self-Supervised 3D Tasks: github.com/HealthML/self-supervised-3d-tasks raw 3D data. Given a sub-volume, we partition it into a 2 2 2 Zongwei Zhou et al. / Medical Image Analysis (2020) 21 Fig. A.11: A direct comparison between image in-painting (Pathak et al., 2016) and our inner-cutout. (a) contrasts our inner-cutout with in- painting, wherein the model in the former scheme computes loss on the entire image and the model in the latter scheme computes loss only for the cutout area. (b) presents the performance on five target tasks, showing that inner-cutout is better suited for target classification tasks (e.g., NCC and ECC), while in-painting is more helpful for target segmentation tasks (e.g., NCS, LCS, and BMS). grid of cubes. In addition to predicting orders (3D Jigsaw), this chest region in CT modality and applied an encoder-decoder proxy task permutes the cubes with random rotations, forcing architecture that is similar to our work. We directly adopt the models to predict the orientation. Following the original pa- pre-trained weights of the dense V-Net architecture provided by per, we limit the directions for cube rotation, i.e., only allowing NiftyNet, so it carries a smaller number of parameters than our 180 horizontal and vertical rotations, to reduce the complexity 3D U-Net (2.60M vs. 16.32M). For target classification tasks, of the task. The eight cubes are then fed into a Siamese network we use the dense V-Net encoder by appending a sequence of f c with eight branches sharing the same weight to extract features. layers; for target segmentation tasks, we use the entire dense V- The feature maps from the last fully-connected or convolution Net. Since NiftyNet is developed in Tensorflow, all five target layer of all branches are concatenated and given as input to the tasks are re-implemented using their build-in configuration. For fully-connected layer of separate tasks, i.e., cube ordering and each target task, we have tuned hyper-parameters (e.g., learning orienting, which are supervised by permutation loss and rota- rate and optimizer) and applied extensive data augmentations tion loss, respectively, with equal weights. (e.g., rotation and scaling). Appendix B.2. Inflated 3D Appendix B. Configurations of publicly available models We download the Inflated 3D (I3D) model pre-trained from Flow streams in the Kinetics dataset (Hara et al., 2018) and fine- For publicly available models, we do not re-train their proxy tune it on our five target tasks. The input sub-volume is copied tasks and instead simply endeavor to find the best hyper- into two channels to align with the required input shape. For parameters for each of them in target tasks. We compare them target classification tasks, we take the pre-trained I3D and ap- with our Models Genesis in a user perspective, which might pend a sequence of randomly initialized fully-connected layers. seem to be unfair in a research perspective because many vari- For target segmentation tasks, we take the pre-trained I3D as ables are asymmetric among the competitors, such as program- the encoder and expand a decoder to predict the segmentation ming platform, model architecture, number of parameters, etc. map, resulting in a U-Net like architecture. The decoder is the However, the goal of this section is to experiment with existing same as that implemented in our 3D U-Net, consisting of up- ready-to-use pre-trained models under di erent medical tasks; sampling layers followed by a sequence of convolutional layers, therefore, we presume that all of the publicly available models batch normalization, and ReLU activation. Besides, four skip and their configurations have been carefully composed to the connections are built between the encoder and decoder, wherein optimal setting. feature maps before each pooling layer in the encoder are con- catenated with same-scale feature maps in the decoder. All of Appendix B.1. NiftyNet the layers in the model are trainable during transfer learning. Adam method (Kinga and Adam, 2015) with a learning rate of We examine the e ectiveness of fine-tuning from NiftyNet 1e 4 is used for optimization. in five target tasks. We should note that NiftyNet is not ini- tially designed for transfer learning but is one of the few pub- Appendix B.3. MedicalNet licly available supervised pre-trained 3D models. The model from Gibson et al. (2018a) has been considered as the baseline We download MedicalNet models (Chen et al., 2019b) that in our experiments because it has also been pre-trained on the have been pre-trained on eight publicly available 3D segmenta- 22 Zongwei Zhou et al. / Medical Image Analysis (2020) ˆ Global patch shuing preserves local information while distorting global structure; local pixel shuing maintains global structure but loses local details. ˆ For same-domain transfer learning (e.g., pre-training and fine-tuning in CT images), global-shuing and local- shuing reveal no significant di erence in terms of target task performance. Note that local-shuing is preferable when recognizing small objects in target tasks (e.g., pulmonary nodule and embolism), whereas patch- shuing is beneficial for large objects (e.g., brain tumor and liver). ˆ For cross-domain transfer learning (e.g., pre-training in CT and fine-tuning in MRI images), models pre-trained by our local-shuing noticeably outperform those pre-trained by patch-shuing. Appendix C.2. Compute loss on cutouts vs. entire images The results of our ablation study for in-painting and inner- cutout on five target tasks are presented in Fig. A.11. We set all the hyper-parameters the same except for one factor: where to compute MSE loss, only cutout areas or the entire image. In general, there is a marginal di erence in target segmentation Fig. C.12: We extensively search for the optimal size of cutout regions tasks, but inner-cutout is superior to in-painting in target clas- spanning from 0% to 90%, incremented by 10%. The points plotted sification tasks. These results are in line with our hypothesis within the red shade denote no significant di erence ( p > 0:05) from in Sec. 3.1: the model must distinguish original versus trans- the pinnacle from the curve. The horizontal red and gray lines refer formed parts within the image, preserving the context if it is to the performances achieved by Models Genesis and learning from original and, otherwise, in-painting the context. Seemingly, in- scratch, respectively. This ablation study reveals that cutting 20%— painting that only computes loss on cutouts can fail to learn 40% of regions out could produce the most robust performance of tar- comprehensive representation as it is unable to leverage ad- get tasks. As a result, in our implementation, we cutout around 25% vancements from both ends. regions from each sub-volume. Appendix C.3. Masked area size in outer-cutout tion datasets. ResNet-50 and ResNet-101 backbones are chosen because they are reported by Chen et al. (2019b) as the most When applying cutout transformations to our self-supervised compelling backbones for target segmentation and classifica- learning framework, we have one hyper-parameter to evaluate, tion tasks, respectively. Like I3D, we append a decoder at the i.e., the size of cutout regions. Intuitively, it can influence the end of the pre-trained encoder, randomly initialize its weights, diculty of the image restoration task. To explore the impact and link the encoder with the decoder using skip connections. of this parameter on the performance of target tasks, we have Owing to the 3D ResNet backbones, the resultant segmentation conducted an ablation study to extensively search for the opti- network for MedicalNet is much heavier than our 3D U-Net. To mal value, spanning from 0% to 90%, incremented by intervals be consistent with the original programming platform of Medi- of 10%. Fig. C.12 shows the performance of all five target tasks calNet, we re-implement all five target tasks in PyTorch, using under di erent settings, suggesting that outer-cutout is robust the same data separation and augmentation. We report the high- to hyper-parameter changes to some extent. This finding is also est results achieved by any of the two backbones in Table 3. consistent with that recommended in the original in-painting paper, where Pathak et al. (2016) removed a number of smaller possibly overlapping masks, covering up to 1/4 of the image. Appendix C. Ablation experiments Altogether, we finally cutout less than 1/4 of the entire sub- volume in both outer and inner cutout implementations. Appendix C.1. Local pixel shuing vs. global patch shuing In the main paper, we have reported results of patch- Appendix D. Qualitative assessment of image restoration shuing (Chen et al., 2019a) as a baseline in Table 3 and our local-shuing in Fig. 3. To underline the value of preserving local and global structural consistency in the proxy task, we Since there is no such metric to directly determine the power provide an explicit comparison between the two counterparts in of image representation, rather than constrain the representa- Fig. A.10, arriving at three findings: tion, our paper aims to design an image restoration task to let the Zongwei Zhou et al. / Medical Image Analysis (2020) 23 model learn generic image representation from 3D medical im- ages. In doing so, as seen in Sec. 5.4, we have modified the def- inition of a good representation. As presented in Sec. 3.1, Gen- esis CT and Genesis X-ray are pre-trained on LUNA 2016 (Se- tio et al., 2017) and ChestX-ray8 (Wang et al., 2017b), respec- tively, using a series of self-supervised learning schemes with di erent image transformations. In this section, we have (1) il- lustrated more examples of our four individual transformations (i.e., non-linear, local-shuing, outer-cutout, and inner-cutout) in Fig. D.13; (2) evaluated the power of the pre-trained model by assessing restoration quality on previously unseen patients’ images not only from the LUNA 2016 dataset (see Fig. D.14), but also from di erent modalities, covering CT, X-ray, and MRI (see Fig. D.15). The qualitative assessment shows that our pre- trained model is not merely overfitting on anatomical patterns in specific patients, but indeed can be robustly used for restoring images, thus can be generalized to many target tasks. To assess the image restoration quality at the time of infer- ence, we pass the transformed images to the models that have been trained with di erent self-supervised learning schemes, including four individual and one combined schemes. In our visualization, the input images have undergone four individual transformations as well as eight di erent combined transforma- tions, including the identity mapping (i.e., no transformation). As shown in Fig. D.14, the combined scheme can restore the unseen image by handling a variety of transformations (framed in red), whereas the models trained with the individual scheme can only restore unseen images that have undergone the same transformation that they were trained on (framed in green). This qualitative observation is consistent with our experimental find- ing in Sec. 4.1: the combined learning scheme achieves superior and more robust results over the individual scheme in transfer learning. In Fig. D.15, we have further provided a qualitative assess- ment of image restoration quality by Genesis CT and Gene- sis X-ray, across medical imaging modalities. In our visualiza- tion, the input images are selected from four di erent medical modalities, covering X-ray, CT, Ultrasound, and MRI. It is clear from the figure that even though the models are only trained on single image modality, they can largely maintain the texture and structures during restoration, not only within the same modality but also across di erent ones. These observations suggest that Models Genesis are of great potential in transferring learned im- age representation across diseases, organs, datasets, and modal- ities. 24 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. D.13: Illustration of the proposed four image transformations. For simplicity and clarity, we illustrate the transformation on a 2D CT slice, but our Genesis Chest CT is trained using 3D sub-volumes directly, transformed in a 3D manner; our 3D image transformations, with an exception of non-linear transformation, cannot be approximated in 2D. For ease of understanding, in (a) non-linear transformation, we have displayed an image undergoing di erent translating functions in Columns 2—7; in (b) local-shuing, (c) outer-cutout, and (d) inner-cutout transformation, we have illustrated each of the processes step by step in Columns 2—6, where the first and last columns denote the original images and the final transformed images, respectively. In local-shuing, a di erent window (b) is automatically generated and used in each step. Zongwei Zhou et al. / Medical Image Analysis (2020) 25 Fig. D.14: The left and right panels show the qualitative assessment of image restoration quality using Genesis CT and Genesis X-ray, respectively. These models are trained with di erent training schemes, including four individual schemes (Columns 3—6) and a combined scheme (Column 7). As discussed in Fig. 1, each original image x can possibly undergo twelve di erent transformations. We test the models with all these possible twelve transformed images x . We specify types of the image transformation f () for each row and the training scheme g() for each column. First of all, it can be seen that the models trained with individual schemes can restore previously unseen images that have undergone the same transformation very well (framed in green), but fail to handle other transformations. Taking non-linear transformation f () as an example, (NL) any individual training scheme besides non-linear transformation itself cannot invert the pixel intensity from transformed whitish to the original blackish. As expected, the model trained with the combined scheme successfully restores original images from various transformations (framed in red). Second, the model trained with the combined scheme shows it is superior to other models even if they are trained with and tested on the same transformation. For example, in the local-shuing case f (), the image recovered from the local-shuing pre-trained model g () is noisy and (LS) (LS) lacks texture. However, the model trained with the combined scheme g () generates an image with more underlying structures, which (NL,LS,OC,IC) demonstrates that learning with augmented tasks can even improve the performance on each of the individual tasks. Third, the model trained with the combined scheme significantly outperforms models trained with individual training schemes when restoring images that have undergone seven di erent combined transformations (Rows 6—12). For example, the model trained with non-linear transformation g () can only recover (NL) the intensity distribution in the transformed image undergone f () but leaves the inner cutouts unchanged. These observations suggest that (NL,IC) the model trained with the proposed unified self-supervised learning framework can successfully learn general anatomical structures and yield promising transferability on di erent target tasks. The quality assessment of image restoration further confirms our experimental observation, provided in Sec. 4.1, that the combined learning scheme exceeds each individual in transfer learning. 26 Zongwei Zhou et al. / Medical Image Analysis (2020) Fig. D.15: The left and right panels visualize the qualitative assessment of image restoration quality by Genesis CT and Genesis X-ray, respec- tively, across medical imaging modalities. For testing, we use the pre-trained model to directly restore images from LUNA 2016 (CT) (Setio et al., 2017), ChestX-ray8 (X-ray) (Wang et al., 2017b), CIMT (Ultrasound) (Hurst et al., 2010; Zhou et al., 2019a), and BraTS (MRI) (Menze et al., 2015). Though the models are only trained on single image modality, they can largely maintain the texture and structures during restoration not only within the same modality (framed in red), but also across di erent modalities.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Apr 9, 2020

There are no references for this article.