Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

GlyMDB: Glycan Microarray Database and analysis toolset

GlyMDB: Glycan Microarray Database and analysis toolset Motivation: Glycan microarrays are capable of illuminating the interactions of glycan-binding proteins (GBPs) against hundreds of defined glycan structures, and have revolutionized the investigations of protein–carbohydrate interactions underlying numerous critical biological activities. However, it is difficult to interpret microarray data and identify structural determinants promoting glycan binding to glycan-binding proteins due to the ambiguity in micro- array fluorescence intensity and complexity in branched glycan structures. To facilitate analysis of glycan micro- array data alongside protein structure, we have built the Glycan Microarray Database (GlyMDB), a web-based re- source including a searchable database of glycan microarray samples and a toolset for data/structure analysis. Results: The current GlyMDB provides data visualization and glycan-binding motif discovery for 5203 glycan micro- array samples collected from the Consortium for Functional Glycomics. The unique feature of GlyMDB is to link microarray data to PDB structures. The GlyMDB provides different options for database query, and allows users to upload their microarray data for analysis. After search or upload is complete, users can choose the criterion for bind- er versus non-binder classification. They can view the signal intensity graph including the binder/non-binder thresh- old followed by a list of glycan-binding motifs. One can also compare the fluorescence intensity data from two differ- ent microarray samples. A protein sequence-based search is performed using BLAST to match microarray data with all available PDB structures containing glycans. The glycan ligand information is displayed, and links are provided for structural visualization and redirection to other modules in GlycanStructure.ORG for further investigation of glycan-binding sites and glycan structures. Availability and implementation: http://www.glycanstructure.org/glymdb. Contact: wonpil@lehigh.edu Supplementary information: Supplementary data are available at Bioinformatics online. three major classes of biomolecules (proteins, nucleic acids and lip- 1 Introduction ids), glycans are the most heterogeneous by virtue of different Glycans are abundant on the surface of both eukaryotic and pro- anomeric configurations (a and b) of glycosidic linkages, multiple karyotic cells, serving as the first cellular components encountered branched structures and various chemical modifications. Due to the by approaching molecules, cells and pathogens. Therefore, glycans complex nature of glycans, characterizing the binding specificities have critical roles in various biological processes, such as cell adhe- and identifying the binding determinants are major questions in gly- sion, signal transduction, host–pathogen interactions and immune cobiology research. activities (Varki, 2017). Glycosylation, the most common post- In the past two decades, glycan microarrays have revolutionized translational modification, is the enzymatic process to attach carbo- the analysis of protein–glycan binding specificities. Glycan microar- hydrates covalently to proteins and lipids to form glycoproteins and rays are composed of various saccharides, either chemically synthe- glycolipids (Apweiler et al., 1999). Glycans can also be specifically sized or purified from natural sources, immobilized on array recognized and non-covalently bound by a set of proteins, known as surfaces (Rillahan and Paulson, 2011). They are incubated with glycan-binding proteins (GBPs) or lectins. Compared to the other increasing concentrations of a GBP, and fluorescence emission of V The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 2438 Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 GlyMDB 2439 either the fluorescent-tagged GBP or secondary reagent is measured, which illuminates the binding specificity and indirectly the affinity of the GBP toward various glycan structures (Heimburg-Molinaro et al., 2011). The Consortium for Functional Glycomics (CFG) has greatly expanded the availability of glycan array data by making the results of glycan array experiments publicly available. Though these data provide a rich insight into the specificities of GBPs, data inter- pretation remains challenging, particularly when the proteins have more than one binding motif or the glycans have subunits that block protein–glycan binding. Manual interpretation of complicated gly- can array data is tedious and error-prone, and hence automated methods need to be developed to solve these problems in an efficient and robust manner. There are a few web resources for glycan microarray database and related bioinformatics tools. Glycosciences Laboratory (https:// glycosciences.med.ic.ac.uk/glycanLibraryIndex.html), the National Center for Functional Glycomics (https://ncfg.hms.harvard.edu/ ncfg-data) and the CFG (http://functionalglycomics.org/) provide ac- cess to the microarray data from the experiments that they have per- formed. However, they do not provide analysis for protein–glycan interactions in terms of three-dimensional structures. GlycoPattern (Agravat et al., 2014) provides a set of tools supporting the analysis of glycan microarray data. While highly useful in identifying glycan determinants bound by GBPs, it only allows users to upload their own microarray data. MCAW-DB (Hosoda et al., 2018) is a glycan profile database containing the multi-sequence alignment analysis of 1081 glycan microarray samples collected from the CFG, and it only focuses on glycan sequence alignment. In addition, there are several reported tools, such as GlycoViewer (Joshi et al., 2010) and GLAD (Mehta and Cummings, 2019) for microarray data visualization and mining, and GlycanMotifMiner (Cholleti et al., 2012) and GlycoSearch (Kletter et al., 2015) for glycan patterns discovery. In this work, we present the Glycan Microarray DataBase (GlyMDB), a database of glycan microarray samples with a toolset for data analysis. The GlyMDB provides a user-friendly web inter- Fig. 1. The GlyMDB search interface. (A) Upload glycan microarray data. Shown face that enables users to search from the database or upload their above, a spreadsheet named ‘my_microarray.xls’ is selected and protein sequence in- formation is written in the textbox. (B) Search database. Users can select to make a own microarray samples for data visualization, binder/non-binder query by the name (e.g. ‘DC-SIGN’), protein sequence or PDB ID (e.g. ‘1SL5’). classification, binding motif discovery and data comparison of two When searching by protein sequence or PDB ID, users need to set the threshold of different samples. When protein sequence information is available, sequence identity. Three options can be used to filter the results by GBP species, either included in the queried samples or provided by users when family and glycan array version uploading data, BLAST (Camacho et al., 2009) search is performed to find relevant PDB entries based on the protein sequence identity. sequence associated to the queried PDB ID against the protein We discuss the GlyMDB web interface and usage of each tool in the sequences of all stored microarray samples. By default, the threshold following sections. A stepwise guide to use the GlyMDB is also of sequence identity is 95%, but users can change the value based on available in glycanstructure.org/glymdb/howto. their requirements. There are several filters available to narrow the query result by the species, family and glycan array version (Fig. 1). When the upload or search is complete, the GlyMDB shows the 2 Materials and methods result page (Figs 2 and 3). The result page starts with a list of sam- 2.1 Web interface ples (Figs 2A and 3A), which include the information extracted from We obtained publicly available glycan array data from the CFG in the uploaded sample or the database query result. The same protein the form of spreadsheet files, which are parsed automatically by can have multiple entries in the sample list if there are data from the Python scripts and the extracted information is stored and managed experiments on different versions of glycan arrays or under different using sqlite3. The web interface of GlyMDB is built with Django protein concentrations. By selecting one entry from the sample list and JavaScript. There are two options on the search interface: users and one criterion for binder/non-binder classification (Fig. 2A), can (i) upload microarray data in a spreadsheet file or (ii) search users can view the bar chart of fluorescence intensity and the thresh- microarray data stored in the database (Fig. 1). To upload their own old distinguishing binders from non-binders (Fig. 2B). In addition, microarray data, users need to prepare a spreadsheet file containing the GlyMDB also provides users with the GLAD format input file three columns: glycan number, structure and signal intensity. An ex- (below the bar chart) in order to utilize the recently developed ample is shown in the Supplementary Material S1. Optionally, users GLAD web application for the data visualization and analysis of can specify the protein sequence information of the uploaded micro- glycan microarray data (Mehta and Cummings, 2019). The array data (if available), and the GlyMDB provides the option to GlyMDB also shows a list of common motifs that make positive or perform PDB matching in the result page. To search the microarray negative contributions to binding interactions (Fig. 2C), which will data stored in GlyMDB, users can make a query by protein name, be discussed in the following section. If users provide the protein se- sequence or PDB ID. When users select to make a query by protein quence when uploading their own microarray data, or select a sam- name, the GlyMDB attempts to match the keywords in both the lec- ple (stored in GlyMDB) with protein sequence available, the tin sample name and the description originally provided on the CFG GlyMDB shows the option for matching the microarray data with website. Since not all microarray samples include protein sequence PDB structures. BLAST search is performed to generate a list of PDB information, a query is made among the samples with protein se- IDs with sequence identity above the selected threshold, and Glycan quence available if users select to search by protein sequence or PDB Reader (Jo et al., 2011, Park et al., 2017) is used to process the PDB ID. BLAST is used to align the queried protein sequence or the files and to extract the information of glycan ligands (Fig. 2D). The Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 2440 Y.Cao et al. Fig. 2. The GlyMDB result interface. (A) To view the results of binder/non-binder classification and motifs discovery, users should select only one sample from the list. The PDB file matching function works only if the selected sample has protein sequence information available. (B) The bar chart of fluorescence intensity and the sorted lists of bind- ers and non-binders. (C) Top ranking motifs that make significant contributions to protein–glycan binding. (D) The list of PDB files matched to the selected microarray sample. There are links for visualizing PDB structures, downloading PDB files from RCSB, searching PDB IDs in our glycan-binding site database and searching glycan ligands in our glycan fragment database. (E) Structural visualization of the PDB files links for structural visualization and PDB file download are pro- kernel rather than other kernels because it outputs the weights vided (Fig. 2E). In addition, users are able to compare two samples assigned to input features (i.e. glycan fragments), and we use these by selecting two entries from the sample list, which can be two dif- weights as the measurement for the importance of each glycan frag- ferent GBPs, or different samples of the same GBP (Fig. 3A). The ment. As shown in Figure 4A, a given glycan sequence is fragmented GlyMDB uses a heatmap to show the similarity and difference of by enumerating all connected substructures (Jo and Im, 2013). After fluorescence intensity between two selected samples (Fig. 3B). that, we obtain a set of unique fragments, each of which is present in at least one glycan. For example, the CFG version 5.1 array has 610 glycans in total. For the same glycan attached to different spacer 2.2 Glycan microarray data analysis toolset arms, we only keep the one with the highest fluorescence intensity. 2.2.1 Binder/non-binder classification Consequently, there are 541 unique glycans and we have 14 973 To split the glycan array into binders and non-binders, users need to unique fragments. For each of 541 glycans, the GlyMDB generates a select a threshold of fluorescence intensity. In GlyMDB, there are fingerprint showing whether the glycan contains each of 14 973 two options: P-value of z-score, and percentage of maximum inten- fragments or not. The footprints are combined into a 541 14 973 sity. A z-score is a measure of how many standard deviations a data binary matrix. Meanwhile, a binary array with a length of 541 is point is above or below the mean of population. It is used as a statis- generated to indicate whether each glycan is a binder or not tical test for the significance of a sample with the null hypothesis (Fig. 4B). They are the input for training an SVM classifier, and we that a randomly selected sample is a non-binder. The threshold is set use the SVM module in the Scikit-learn package, which is a free ma- to a P-value converted from a z-score. Though the default value is chine learning library for Python. After training the SVM classifier, 0.15, which is same as the one used in GLYMMR (Cholleti et al., we make a record of the weights assigned to each glycan fragment. 2012), users can choose any number between 0 and 1. As the second The fragments with positive weights are considered to be the motifs option, users can use a certain percentage of the highest fluorescence that contribute to protein–glycan binding and those with negative intensity as the threshold. The default value is 10%, which means weights are considered to be the motifs that prevent protein–glycan that a glycan is classified to be a binder if the fluorescence intensity binding. Since the number of features (i.e. fragments) is much is >10%  the maximum intensity observed on the glycan array. greater than the number of training samples, recursive feature elim- ination (Guyon et al., 2002) is used to recursively reduce the number 2.2.2 Glycan-binding motif discovery of features (e.g. from 14 973 to 256 features for the CFG version 5.1 Though glycan array data can illuminate the binding affinity of array). The final SVM classifier is trained on the remaining 256 fea- GBPs toward a variety of defined glycans on the array, additional tures, and fragments with positive and negative weights are ranked data interpretation is necessary to identify the glycan structure deter- separately. More details are given in the Supplementary Material S2. minants for GBP specificities. To discover the motifs that make sig- The GlyMDB ignores a fragment if it is a substructure of a larger nificant contributions to protein–glycan binding, the GlyMDB first fragment and the sub-fragment’s weight is less than or equal to the breaks each glycan into fragments and then takes advantage of the super-fragment’s weight. In addition, the top-ranked fragments with support vector machine (SVM) algorithm to select the fragments positive (or negative) weights are ignored if they are present in <5 with highest importance. We chose to use the SVM with a linear or 1/3 of the total number of binders (or non-binders), whichever is Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 GlyMDB 2441 Fig. 4. Glycan fragment and fingerprint. (A) Glycan fragments are generated by enu- merating all connected substructures of a given glycan sequence. (B) Glycan finger- prints are binary arrays indicating whether a glycan includes each fragment or not Fig. 5. PDB search and visualization. PDB ID 4D4U is matched to the microarray data of Aspergillus fumigatus lectin that shows specificity to glycans with terminal Fig. 3. The GlyMDB result interface for two sample comparison. (A) To compare fucose. The PDB structure shows that the dimer in 4D4U contains multiple binding two different microarray samples, users should select two samples from the list. (B) sites and binds Fuca1-2Galb1-4(Fuca1-3)GlcNAc with different binding poses The heat map with one-to-one comparison of the intensity of each glycan in two selected samples. If users click a glycan Id, the glycan sequence is shown structures, the GlyMDB attempts to find all relevant PDB structures for each given microarray sample. less. In the final output, the GlyMDB displays up to five top-ranked When users select or upload a microarray sample with protein positive and negative weight fragments (if any). sequence information available, we first use BLAST to query this se- quence against all PDB sequences and record the PDB IDs if the se- 2.2.3 Glycan array sample comparison quence identities are above the user-specified threshold. Glycan To investigate whether a GBP has consistent binding specificity Reader is used to automatically detect and annotate glycan units, when an experiment is performed under different protein concentra- glycosidic linkages and chemical modifications. By default, the list tions or using different glycan arrays, or to investigate whether two of PDB IDs is sorted by the length of the largest glycan ligand con- different lectins have similar binding specificity, one needs to com- tained in each PDB structure, and users can filter the list by glycan pare the glycan microarray data of two different samples. The ligand length. The PDB list can also be sorted in the order of se- GlyMDB allows users to compare the microarray data of two differ- quence identity and PDB resolution. In addition, the links are pro- ent samples, and a heat map based on the fluorescence intensity of vided for downloading PDB files from RCSB, searching PDB IDs in each glycan is made for one-to-one comparison (Fig. 3). The heat our glycan-binding site database and searching glycan ligands in our map is generated independently for two selected samples. For each glycan fragment database (Jo and Im, 2013). The PDB search and sample, red corresponds to the maximum intensity and white corre- visualization are illustrated with Aspergillus fumigatus lectin that sponds to the intensity lower than the threshold for binder/non- shows specificity for glycans with terminal fucose, particularly the binder classification. Thus, only binders are highlighted with red glycans with terminal Fuca1-2Galb1-4(Fuca1-3)GlcNAc substruc- color and all non-binders are shown in white. ture. After PDB search, we found four PDB entries—4D4U, 4AH4, 4AGT and 4AHA. As shown in Figure 5, 4D4U is a dimer contain- ing multiple glycan-binding sites, and it can bind Fuca1-2Galb1- 2.2.4 Cross-linking glycan microarray to PDB 4(Fuca1-3)GlcNAc with two different binding poses. Though microarray data contain substantial information about the specificity of GBPs, it does not provide three-dimensional structural information, such as the binding site on a protein for a glycan. In 3 Results and discussion contrast, PDB files contain such general structural information, yet they are not enough to elucidate the specificity of proteins, since the As of June 2019, the GlyMDB contains 5203 glycan microarray glycan ligands in PDB structures are generally limited in length and samples collected from the CFG. Multiple experimental data of the variety. To bridge the gap between microarray data and PDB same GBP on different glycan arrays (from version 1.0 to 5.2) or Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 2442 Y.Cao et al. Fig. 6. Statistics of GlyMDB and related PDB files. (A) Numbers of microarray samples and numbers of microarray samples with protein sequence information available, grouped by CFG glycan array versions. (B) Numbers of PDB structures and numbers of PDB structures containing glycan ligands, grouped by CFG glycan array versions. The sequence identity of the PDB structure(s) to the corresponding microarray sample is >95%. (C) Length distribution of the largest glycan ligand in each PDB file. Numbers of unique proteins are calculated by removing multiple PDB entries corresponding to the same protein under different concentrations are counted as multiple samples. glycan interactions in both sequence and structural levels. The data- Among 5203 microarray samples, 1849 have protein sequence infor- base will be updated quarterly and is freely available at http://www. mation available (Fig. 6A). We performed BLAST search against all glycanstructure.org. protein sequences from PDB protein structures (as of June 2019) with sequence similarity >95%, and the numbers of matched PDB Funding entries are shown in Figure 6B. We extracted the glycan information from each PDB file and the numbers of PDB entries containing gly- This work was supported by the National Science Foundation [DBI-1707207 can ligands are also shown in Figure 6B. Since multiple microarray to W.I.]; and National Institutes of Health Grants [P41GM103694 to R.D.C., samples can have the same protein sequence that is matched to the U01GM125267 to R.D.C., Will York). same PDB file, we removed redundancy, and there are 1965 unique PDB entries. A total of 771 out of 1965 PDB entries contain glycan Conflict of Interest: none declared. ligands, and the length distribution of the largest glycan ligand in each PDB file is shown in Figure 6C. Multivalency is common in protein–glycan interactions. In these References cases, the glycan-binding sites occur between protein monomers in- Agravat,S.B. et al. (2014) GlycoPattern: a web platform for glycan array min- stead of within one monomer. However, it is difficult, merely from ing. Bioinformatics, 30, 3417–3418. the sequence, to identify whether the binding interaction is multiva- Apweiler,R. et al. (1999) On the frequency of protein glycosylation, as lent, which is one of the reasons that we wished to bridge micro- deduced from analysis of the SWISS-PROT database. Biochim. Biophys. array data to PDB structures. BLAST protein searches can find Acta, 1473, 4–8. available multimeric protein structures even if the queried protein Camacho,C. et al. (2009) BLASTþ: architecture and applications. BMC sequence only covers one monomer. With the multimeric structures, Bioinformatics, 10, 421. one can perform further investigations to locate the glycan-binding Cholleti,S.R. et al. (2012) Automated motif discovery from glycan array data. sites and deduce how the protein interacts with the glycan ligand. OMICS, 16, 497–512. In the current release of GlyMDB, we have some requirements Guyon,I. et al. (2002) Gene selection for cancer classification using support on the format of user-uploaded microarray data. The glycan se- vector machines. Mach. Learn., 46, 389–422. quence should be represented by the text nomenclature recom- Heimburg-Molinaro,J. et al. (2011) Preparation and analysis of glycan micro- mended by CFG (http://www.functionalglycomics.org/static/ arrays. Curr. Protoc. Protein Sci., 12, 10. consortium/Nomenclature.shtml). To make it easier for users to spe- Hosoda,M. et al. (2018) MCAW-DB: a glycan profile database capturing the cify glycan structures, we will support more glycan sequence formats ambiguity of glycan recognition patterns. Carbohydr. Res., 464, 44–56. and glycan accession numbers, such as GlyTouCan ID (Tiemeyer Jo,S. and Im,W. (2013) Glycan fragment database: a database of PDB-based et al., 2017), in the future release of GlymDB. For structural visual- glycan 3D structures. Nucleic Acids Res., 41, D470–D474. ization, we utilize NGL viewer (Rose et al., 2018), which is a web Jo,S. et al. (2011) Glycan Reader: automated sugar identification and simula- application for macromolecular structure visualization, and it is also tion preparation for carbohydrates and glycoproteins. J. Comput. Chem., the 3D structure viewer embedded in the RCSB Protein Data Bank 32, 3135–3141. website. NGL provides a comprehensive set of molecular representa- Joshi,H.J. et al. (2010) GlycoViewer: a tool for visual summary and compara- tions and allows to highlight and focus on each selected ligand. We tive analysis of the glycome. Nucleic Acids Res., 38, W667–W670. plan to embedded other 3D structure viewers, such as LiteMol Kletter,D. et al. (2015) Exploring the specificities of glycan-binding proteins (Sehnal et al., 2017), into our website, and users can take advantage using glycan array data and the GlycoSearch software. Methods Mol. Biol., of the features in different structure viewers and choose the one that 1273, 203–214. Mehta,A.Y. and Cummings,R.D. (2019) GLAD: GLycan Array Dashboard, a satisfies their requirements. visual analytics tool for glycan microarrays. Bioinformatics, 35, 3536–3537. Park,S.J. et al. (2017) Glycan Reader is improved to recognize most sugar 4 Summary types and chemical modifications in the Protein Data Bank. Bioinformatics, We have described the development and usage of GlyMDB, which is 33, 3051–3057. an integrated platform for database query, user upload and data/ Rillahan,C.D. and Paulson,J.C. (2011) Glycan microarrays for decoding the structure analysis. It can assist glycoscientists in searching for pub- glycome. Annu. Rev. Biochem., 80, 797–823. licly available microarray data and processing their own data. In Rose,A.S. et al. (2018) NGL viewer: web-based molecular graphics for large addition to the functional features of binder/non-binder classifica- complexes. Bioinformatics, 34, 3755–3758. tion, glycan-binding motif discovery and glycan array sample com- Sehnal,D. et al. (2017) LiteMol suite: interactive web-based visualization of parison, the GlyMDB is the first tool to cross-link microarray large-scale macromolecular structure data. Nat. Methods, 14, 1121–1122. samples to PDB structures, and this can supplement the structural Tiemeyer,M. et al. (2017) GlyTouCan: an accessible glycan structure reposi- information that is not included in microarray data. These tools are tory. Glycobiology, 27, 915–919. expected to be useful in investigating the specificity of protein– Varki,A. (2017) Biological roles of glycans. Glycobiology, 27, 3–49. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

GlyMDB: Glycan Microarray Database and analysis toolset

Loading next page...
 
/lp/oxford-university-press/glymdb-glycan-microarray-database-and-analysis-toolset-6205SK44gW

References (19)

Publisher
Oxford University Press
Copyright
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
eISSN
1367-4811
DOI
10.1093/bioinformatics/btz934
Publisher site
See Article on Publisher Site

Abstract

Motivation: Glycan microarrays are capable of illuminating the interactions of glycan-binding proteins (GBPs) against hundreds of defined glycan structures, and have revolutionized the investigations of protein–carbohydrate interactions underlying numerous critical biological activities. However, it is difficult to interpret microarray data and identify structural determinants promoting glycan binding to glycan-binding proteins due to the ambiguity in micro- array fluorescence intensity and complexity in branched glycan structures. To facilitate analysis of glycan micro- array data alongside protein structure, we have built the Glycan Microarray Database (GlyMDB), a web-based re- source including a searchable database of glycan microarray samples and a toolset for data/structure analysis. Results: The current GlyMDB provides data visualization and glycan-binding motif discovery for 5203 glycan micro- array samples collected from the Consortium for Functional Glycomics. The unique feature of GlyMDB is to link microarray data to PDB structures. The GlyMDB provides different options for database query, and allows users to upload their microarray data for analysis. After search or upload is complete, users can choose the criterion for bind- er versus non-binder classification. They can view the signal intensity graph including the binder/non-binder thresh- old followed by a list of glycan-binding motifs. One can also compare the fluorescence intensity data from two differ- ent microarray samples. A protein sequence-based search is performed using BLAST to match microarray data with all available PDB structures containing glycans. The glycan ligand information is displayed, and links are provided for structural visualization and redirection to other modules in GlycanStructure.ORG for further investigation of glycan-binding sites and glycan structures. Availability and implementation: http://www.glycanstructure.org/glymdb. Contact: wonpil@lehigh.edu Supplementary information: Supplementary data are available at Bioinformatics online. three major classes of biomolecules (proteins, nucleic acids and lip- 1 Introduction ids), glycans are the most heterogeneous by virtue of different Glycans are abundant on the surface of both eukaryotic and pro- anomeric configurations (a and b) of glycosidic linkages, multiple karyotic cells, serving as the first cellular components encountered branched structures and various chemical modifications. Due to the by approaching molecules, cells and pathogens. Therefore, glycans complex nature of glycans, characterizing the binding specificities have critical roles in various biological processes, such as cell adhe- and identifying the binding determinants are major questions in gly- sion, signal transduction, host–pathogen interactions and immune cobiology research. activities (Varki, 2017). Glycosylation, the most common post- In the past two decades, glycan microarrays have revolutionized translational modification, is the enzymatic process to attach carbo- the analysis of protein–glycan binding specificities. Glycan microar- hydrates covalently to proteins and lipids to form glycoproteins and rays are composed of various saccharides, either chemically synthe- glycolipids (Apweiler et al., 1999). Glycans can also be specifically sized or purified from natural sources, immobilized on array recognized and non-covalently bound by a set of proteins, known as surfaces (Rillahan and Paulson, 2011). They are incubated with glycan-binding proteins (GBPs) or lectins. Compared to the other increasing concentrations of a GBP, and fluorescence emission of V The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 2438 Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 GlyMDB 2439 either the fluorescent-tagged GBP or secondary reagent is measured, which illuminates the binding specificity and indirectly the affinity of the GBP toward various glycan structures (Heimburg-Molinaro et al., 2011). The Consortium for Functional Glycomics (CFG) has greatly expanded the availability of glycan array data by making the results of glycan array experiments publicly available. Though these data provide a rich insight into the specificities of GBPs, data inter- pretation remains challenging, particularly when the proteins have more than one binding motif or the glycans have subunits that block protein–glycan binding. Manual interpretation of complicated gly- can array data is tedious and error-prone, and hence automated methods need to be developed to solve these problems in an efficient and robust manner. There are a few web resources for glycan microarray database and related bioinformatics tools. Glycosciences Laboratory (https:// glycosciences.med.ic.ac.uk/glycanLibraryIndex.html), the National Center for Functional Glycomics (https://ncfg.hms.harvard.edu/ ncfg-data) and the CFG (http://functionalglycomics.org/) provide ac- cess to the microarray data from the experiments that they have per- formed. However, they do not provide analysis for protein–glycan interactions in terms of three-dimensional structures. GlycoPattern (Agravat et al., 2014) provides a set of tools supporting the analysis of glycan microarray data. While highly useful in identifying glycan determinants bound by GBPs, it only allows users to upload their own microarray data. MCAW-DB (Hosoda et al., 2018) is a glycan profile database containing the multi-sequence alignment analysis of 1081 glycan microarray samples collected from the CFG, and it only focuses on glycan sequence alignment. In addition, there are several reported tools, such as GlycoViewer (Joshi et al., 2010) and GLAD (Mehta and Cummings, 2019) for microarray data visualization and mining, and GlycanMotifMiner (Cholleti et al., 2012) and GlycoSearch (Kletter et al., 2015) for glycan patterns discovery. In this work, we present the Glycan Microarray DataBase (GlyMDB), a database of glycan microarray samples with a toolset for data analysis. The GlyMDB provides a user-friendly web inter- Fig. 1. The GlyMDB search interface. (A) Upload glycan microarray data. Shown face that enables users to search from the database or upload their above, a spreadsheet named ‘my_microarray.xls’ is selected and protein sequence in- formation is written in the textbox. (B) Search database. Users can select to make a own microarray samples for data visualization, binder/non-binder query by the name (e.g. ‘DC-SIGN’), protein sequence or PDB ID (e.g. ‘1SL5’). classification, binding motif discovery and data comparison of two When searching by protein sequence or PDB ID, users need to set the threshold of different samples. When protein sequence information is available, sequence identity. Three options can be used to filter the results by GBP species, either included in the queried samples or provided by users when family and glycan array version uploading data, BLAST (Camacho et al., 2009) search is performed to find relevant PDB entries based on the protein sequence identity. sequence associated to the queried PDB ID against the protein We discuss the GlyMDB web interface and usage of each tool in the sequences of all stored microarray samples. By default, the threshold following sections. A stepwise guide to use the GlyMDB is also of sequence identity is 95%, but users can change the value based on available in glycanstructure.org/glymdb/howto. their requirements. There are several filters available to narrow the query result by the species, family and glycan array version (Fig. 1). When the upload or search is complete, the GlyMDB shows the 2 Materials and methods result page (Figs 2 and 3). The result page starts with a list of sam- 2.1 Web interface ples (Figs 2A and 3A), which include the information extracted from We obtained publicly available glycan array data from the CFG in the uploaded sample or the database query result. The same protein the form of spreadsheet files, which are parsed automatically by can have multiple entries in the sample list if there are data from the Python scripts and the extracted information is stored and managed experiments on different versions of glycan arrays or under different using sqlite3. The web interface of GlyMDB is built with Django protein concentrations. By selecting one entry from the sample list and JavaScript. There are two options on the search interface: users and one criterion for binder/non-binder classification (Fig. 2A), can (i) upload microarray data in a spreadsheet file or (ii) search users can view the bar chart of fluorescence intensity and the thresh- microarray data stored in the database (Fig. 1). To upload their own old distinguishing binders from non-binders (Fig. 2B). In addition, microarray data, users need to prepare a spreadsheet file containing the GlyMDB also provides users with the GLAD format input file three columns: glycan number, structure and signal intensity. An ex- (below the bar chart) in order to utilize the recently developed ample is shown in the Supplementary Material S1. Optionally, users GLAD web application for the data visualization and analysis of can specify the protein sequence information of the uploaded micro- glycan microarray data (Mehta and Cummings, 2019). The array data (if available), and the GlyMDB provides the option to GlyMDB also shows a list of common motifs that make positive or perform PDB matching in the result page. To search the microarray negative contributions to binding interactions (Fig. 2C), which will data stored in GlyMDB, users can make a query by protein name, be discussed in the following section. If users provide the protein se- sequence or PDB ID. When users select to make a query by protein quence when uploading their own microarray data, or select a sam- name, the GlyMDB attempts to match the keywords in both the lec- ple (stored in GlyMDB) with protein sequence available, the tin sample name and the description originally provided on the CFG GlyMDB shows the option for matching the microarray data with website. Since not all microarray samples include protein sequence PDB structures. BLAST search is performed to generate a list of PDB information, a query is made among the samples with protein se- IDs with sequence identity above the selected threshold, and Glycan quence available if users select to search by protein sequence or PDB Reader (Jo et al., 2011, Park et al., 2017) is used to process the PDB ID. BLAST is used to align the queried protein sequence or the files and to extract the information of glycan ligands (Fig. 2D). The Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 2440 Y.Cao et al. Fig. 2. The GlyMDB result interface. (A) To view the results of binder/non-binder classification and motifs discovery, users should select only one sample from the list. The PDB file matching function works only if the selected sample has protein sequence information available. (B) The bar chart of fluorescence intensity and the sorted lists of bind- ers and non-binders. (C) Top ranking motifs that make significant contributions to protein–glycan binding. (D) The list of PDB files matched to the selected microarray sample. There are links for visualizing PDB structures, downloading PDB files from RCSB, searching PDB IDs in our glycan-binding site database and searching glycan ligands in our glycan fragment database. (E) Structural visualization of the PDB files links for structural visualization and PDB file download are pro- kernel rather than other kernels because it outputs the weights vided (Fig. 2E). In addition, users are able to compare two samples assigned to input features (i.e. glycan fragments), and we use these by selecting two entries from the sample list, which can be two dif- weights as the measurement for the importance of each glycan frag- ferent GBPs, or different samples of the same GBP (Fig. 3A). The ment. As shown in Figure 4A, a given glycan sequence is fragmented GlyMDB uses a heatmap to show the similarity and difference of by enumerating all connected substructures (Jo and Im, 2013). After fluorescence intensity between two selected samples (Fig. 3B). that, we obtain a set of unique fragments, each of which is present in at least one glycan. For example, the CFG version 5.1 array has 610 glycans in total. For the same glycan attached to different spacer 2.2 Glycan microarray data analysis toolset arms, we only keep the one with the highest fluorescence intensity. 2.2.1 Binder/non-binder classification Consequently, there are 541 unique glycans and we have 14 973 To split the glycan array into binders and non-binders, users need to unique fragments. For each of 541 glycans, the GlyMDB generates a select a threshold of fluorescence intensity. In GlyMDB, there are fingerprint showing whether the glycan contains each of 14 973 two options: P-value of z-score, and percentage of maximum inten- fragments or not. The footprints are combined into a 541 14 973 sity. A z-score is a measure of how many standard deviations a data binary matrix. Meanwhile, a binary array with a length of 541 is point is above or below the mean of population. It is used as a statis- generated to indicate whether each glycan is a binder or not tical test for the significance of a sample with the null hypothesis (Fig. 4B). They are the input for training an SVM classifier, and we that a randomly selected sample is a non-binder. The threshold is set use the SVM module in the Scikit-learn package, which is a free ma- to a P-value converted from a z-score. Though the default value is chine learning library for Python. After training the SVM classifier, 0.15, which is same as the one used in GLYMMR (Cholleti et al., we make a record of the weights assigned to each glycan fragment. 2012), users can choose any number between 0 and 1. As the second The fragments with positive weights are considered to be the motifs option, users can use a certain percentage of the highest fluorescence that contribute to protein–glycan binding and those with negative intensity as the threshold. The default value is 10%, which means weights are considered to be the motifs that prevent protein–glycan that a glycan is classified to be a binder if the fluorescence intensity binding. Since the number of features (i.e. fragments) is much is >10%  the maximum intensity observed on the glycan array. greater than the number of training samples, recursive feature elim- ination (Guyon et al., 2002) is used to recursively reduce the number 2.2.2 Glycan-binding motif discovery of features (e.g. from 14 973 to 256 features for the CFG version 5.1 Though glycan array data can illuminate the binding affinity of array). The final SVM classifier is trained on the remaining 256 fea- GBPs toward a variety of defined glycans on the array, additional tures, and fragments with positive and negative weights are ranked data interpretation is necessary to identify the glycan structure deter- separately. More details are given in the Supplementary Material S2. minants for GBP specificities. To discover the motifs that make sig- The GlyMDB ignores a fragment if it is a substructure of a larger nificant contributions to protein–glycan binding, the GlyMDB first fragment and the sub-fragment’s weight is less than or equal to the breaks each glycan into fragments and then takes advantage of the super-fragment’s weight. In addition, the top-ranked fragments with support vector machine (SVM) algorithm to select the fragments positive (or negative) weights are ignored if they are present in <5 with highest importance. We chose to use the SVM with a linear or 1/3 of the total number of binders (or non-binders), whichever is Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 GlyMDB 2441 Fig. 4. Glycan fragment and fingerprint. (A) Glycan fragments are generated by enu- merating all connected substructures of a given glycan sequence. (B) Glycan finger- prints are binary arrays indicating whether a glycan includes each fragment or not Fig. 5. PDB search and visualization. PDB ID 4D4U is matched to the microarray data of Aspergillus fumigatus lectin that shows specificity to glycans with terminal Fig. 3. The GlyMDB result interface for two sample comparison. (A) To compare fucose. The PDB structure shows that the dimer in 4D4U contains multiple binding two different microarray samples, users should select two samples from the list. (B) sites and binds Fuca1-2Galb1-4(Fuca1-3)GlcNAc with different binding poses The heat map with one-to-one comparison of the intensity of each glycan in two selected samples. If users click a glycan Id, the glycan sequence is shown structures, the GlyMDB attempts to find all relevant PDB structures for each given microarray sample. less. In the final output, the GlyMDB displays up to five top-ranked When users select or upload a microarray sample with protein positive and negative weight fragments (if any). sequence information available, we first use BLAST to query this se- quence against all PDB sequences and record the PDB IDs if the se- 2.2.3 Glycan array sample comparison quence identities are above the user-specified threshold. Glycan To investigate whether a GBP has consistent binding specificity Reader is used to automatically detect and annotate glycan units, when an experiment is performed under different protein concentra- glycosidic linkages and chemical modifications. By default, the list tions or using different glycan arrays, or to investigate whether two of PDB IDs is sorted by the length of the largest glycan ligand con- different lectins have similar binding specificity, one needs to com- tained in each PDB structure, and users can filter the list by glycan pare the glycan microarray data of two different samples. The ligand length. The PDB list can also be sorted in the order of se- GlyMDB allows users to compare the microarray data of two differ- quence identity and PDB resolution. In addition, the links are pro- ent samples, and a heat map based on the fluorescence intensity of vided for downloading PDB files from RCSB, searching PDB IDs in each glycan is made for one-to-one comparison (Fig. 3). The heat our glycan-binding site database and searching glycan ligands in our map is generated independently for two selected samples. For each glycan fragment database (Jo and Im, 2013). The PDB search and sample, red corresponds to the maximum intensity and white corre- visualization are illustrated with Aspergillus fumigatus lectin that sponds to the intensity lower than the threshold for binder/non- shows specificity for glycans with terminal fucose, particularly the binder classification. Thus, only binders are highlighted with red glycans with terminal Fuca1-2Galb1-4(Fuca1-3)GlcNAc substruc- color and all non-binders are shown in white. ture. After PDB search, we found four PDB entries—4D4U, 4AH4, 4AGT and 4AHA. As shown in Figure 5, 4D4U is a dimer contain- ing multiple glycan-binding sites, and it can bind Fuca1-2Galb1- 2.2.4 Cross-linking glycan microarray to PDB 4(Fuca1-3)GlcNAc with two different binding poses. Though microarray data contain substantial information about the specificity of GBPs, it does not provide three-dimensional structural information, such as the binding site on a protein for a glycan. In 3 Results and discussion contrast, PDB files contain such general structural information, yet they are not enough to elucidate the specificity of proteins, since the As of June 2019, the GlyMDB contains 5203 glycan microarray glycan ligands in PDB structures are generally limited in length and samples collected from the CFG. Multiple experimental data of the variety. To bridge the gap between microarray data and PDB same GBP on different glycan arrays (from version 1.0 to 5.2) or Downloaded from https://academic.oup.com/bioinformatics/article-abstract/36/8/2438/5678782 by DeepDyve user on 04 May 2020 2442 Y.Cao et al. Fig. 6. Statistics of GlyMDB and related PDB files. (A) Numbers of microarray samples and numbers of microarray samples with protein sequence information available, grouped by CFG glycan array versions. (B) Numbers of PDB structures and numbers of PDB structures containing glycan ligands, grouped by CFG glycan array versions. The sequence identity of the PDB structure(s) to the corresponding microarray sample is >95%. (C) Length distribution of the largest glycan ligand in each PDB file. Numbers of unique proteins are calculated by removing multiple PDB entries corresponding to the same protein under different concentrations are counted as multiple samples. glycan interactions in both sequence and structural levels. The data- Among 5203 microarray samples, 1849 have protein sequence infor- base will be updated quarterly and is freely available at http://www. mation available (Fig. 6A). We performed BLAST search against all glycanstructure.org. protein sequences from PDB protein structures (as of June 2019) with sequence similarity >95%, and the numbers of matched PDB Funding entries are shown in Figure 6B. We extracted the glycan information from each PDB file and the numbers of PDB entries containing gly- This work was supported by the National Science Foundation [DBI-1707207 can ligands are also shown in Figure 6B. Since multiple microarray to W.I.]; and National Institutes of Health Grants [P41GM103694 to R.D.C., samples can have the same protein sequence that is matched to the U01GM125267 to R.D.C., Will York). same PDB file, we removed redundancy, and there are 1965 unique PDB entries. A total of 771 out of 1965 PDB entries contain glycan Conflict of Interest: none declared. ligands, and the length distribution of the largest glycan ligand in each PDB file is shown in Figure 6C. Multivalency is common in protein–glycan interactions. In these References cases, the glycan-binding sites occur between protein monomers in- Agravat,S.B. et al. (2014) GlycoPattern: a web platform for glycan array min- stead of within one monomer. However, it is difficult, merely from ing. Bioinformatics, 30, 3417–3418. the sequence, to identify whether the binding interaction is multiva- Apweiler,R. et al. (1999) On the frequency of protein glycosylation, as lent, which is one of the reasons that we wished to bridge micro- deduced from analysis of the SWISS-PROT database. Biochim. Biophys. array data to PDB structures. BLAST protein searches can find Acta, 1473, 4–8. available multimeric protein structures even if the queried protein Camacho,C. et al. (2009) BLASTþ: architecture and applications. BMC sequence only covers one monomer. With the multimeric structures, Bioinformatics, 10, 421. one can perform further investigations to locate the glycan-binding Cholleti,S.R. et al. (2012) Automated motif discovery from glycan array data. sites and deduce how the protein interacts with the glycan ligand. OMICS, 16, 497–512. In the current release of GlyMDB, we have some requirements Guyon,I. et al. (2002) Gene selection for cancer classification using support on the format of user-uploaded microarray data. The glycan se- vector machines. Mach. Learn., 46, 389–422. quence should be represented by the text nomenclature recom- Heimburg-Molinaro,J. et al. (2011) Preparation and analysis of glycan micro- mended by CFG (http://www.functionalglycomics.org/static/ arrays. Curr. Protoc. Protein Sci., 12, 10. consortium/Nomenclature.shtml). To make it easier for users to spe- Hosoda,M. et al. (2018) MCAW-DB: a glycan profile database capturing the cify glycan structures, we will support more glycan sequence formats ambiguity of glycan recognition patterns. Carbohydr. Res., 464, 44–56. and glycan accession numbers, such as GlyTouCan ID (Tiemeyer Jo,S. and Im,W. (2013) Glycan fragment database: a database of PDB-based et al., 2017), in the future release of GlymDB. For structural visual- glycan 3D structures. Nucleic Acids Res., 41, D470–D474. ization, we utilize NGL viewer (Rose et al., 2018), which is a web Jo,S. et al. (2011) Glycan Reader: automated sugar identification and simula- application for macromolecular structure visualization, and it is also tion preparation for carbohydrates and glycoproteins. J. Comput. Chem., the 3D structure viewer embedded in the RCSB Protein Data Bank 32, 3135–3141. website. NGL provides a comprehensive set of molecular representa- Joshi,H.J. et al. (2010) GlycoViewer: a tool for visual summary and compara- tions and allows to highlight and focus on each selected ligand. We tive analysis of the glycome. Nucleic Acids Res., 38, W667–W670. plan to embedded other 3D structure viewers, such as LiteMol Kletter,D. et al. (2015) Exploring the specificities of glycan-binding proteins (Sehnal et al., 2017), into our website, and users can take advantage using glycan array data and the GlycoSearch software. Methods Mol. Biol., of the features in different structure viewers and choose the one that 1273, 203–214. Mehta,A.Y. and Cummings,R.D. (2019) GLAD: GLycan Array Dashboard, a satisfies their requirements. visual analytics tool for glycan microarrays. Bioinformatics, 35, 3536–3537. Park,S.J. et al. (2017) Glycan Reader is improved to recognize most sugar 4 Summary types and chemical modifications in the Protein Data Bank. Bioinformatics, We have described the development and usage of GlyMDB, which is 33, 3051–3057. an integrated platform for database query, user upload and data/ Rillahan,C.D. and Paulson,J.C. (2011) Glycan microarrays for decoding the structure analysis. It can assist glycoscientists in searching for pub- glycome. Annu. Rev. Biochem., 80, 797–823. licly available microarray data and processing their own data. In Rose,A.S. et al. (2018) NGL viewer: web-based molecular graphics for large addition to the functional features of binder/non-binder classifica- complexes. Bioinformatics, 34, 3755–3758. tion, glycan-binding motif discovery and glycan array sample com- Sehnal,D. et al. (2017) LiteMol suite: interactive web-based visualization of parison, the GlyMDB is the first tool to cross-link microarray large-scale macromolecular structure data. Nat. Methods, 14, 1121–1122. samples to PDB structures, and this can supplement the structural Tiemeyer,M. et al. (2017) GlyTouCan: an accessible glycan structure reposi- information that is not included in microarray data. These tools are tory. Glycobiology, 27, 915–919. expected to be useful in investigating the specificity of protein– Varki,A. (2017) Biological roles of glycans. Glycobiology, 27, 3–49.

Journal

BioinformaticsOxford University Press

Published: Dec 16, 2019

There are no references for this article.