The goal of the RBVI is to advance the frontiers of biomedical research while developing new applications of immediate use to scientists. Accordingly, technology research and development is a major activity at our Center. However, the development of new biomedical research tools is most effective when pursued in the context of specific challenging problems. Thus we have identified a set of "driving biomedical research projects" (DBPs - listed below) that we are pursuing in collaboration with outside investigators. In addition, we endeavor to make our advanced software tools easily accessible to others. One approach to this is to actively engage the greater research community to collaborate, and thus we have identified several "collaboration and service projects" below as well. Lastly, we recognize that biomedical research is inheriently dynamic and although we do our best to keep these project descriptions up to date it is unavoidable that some become stale or obsolete.
Technology Research and Development Projects
TR&D 1: The Chimera Molecular Modeling EnvironmentUCSF Chimera is a versatile molecular visualization and analysis system that is freely available to the academic community, government, and public and has been cited in more than 7,700 published scientific papers. Chimera has been used for visualizing molecular models (ranging in size from small molecules to virus capsids), molecular ensembles, and volumetric data. In the past five years, we have distributed Chimera extensively throughout the world and have added many new and requested features. In TR&D 1, we plan to enhance Chimera further through addition of new visualization and analysis functionality, including several new web services, a web-based version that takes advantage of new browser technology, improved ease of use, and increases in performance. Note that our previous project "Tools for Integrated Sequence/Structure Analysis" has been folded into the overall Chimera project, but work continues on adding functionality to Chimera's sequence/structure capabilities. Impact of this Research: By providing a single extensible platform for applying analysis techniques and visually examining results in a facile manner, Chimera aids scientists in understanding their data and generating new hypotheses. Furthermore, our planned enhancements will be useful in creating images and animations that can more effectively communicate data, results, and hypotheses to collaborators and the scientific community.
TR&D 2: Software for Interactive Analysis of Large Molecular AssembliesThis project continues our successful development of software for analyzing large molecular assemblies such as viruses, ribosomes, and chromosomes. The endeavor to solve and understand structures of large molecular assemblies has undergone phenomenal growth in recent years, especially by the cutting-edge collaborations between crystallography and electron microscopy. There are two equally important aspects of this challenging problem, both now handled very well by the Chimera system: first, the research issues of modeling and analyzing the data to obtain and interpret these complex structures; and second, the production of visualizations that can effectively communicate the results to both technical and general audiences. In this project we will develop new tools within Chimera that support analysis of quaternary levels of structure, serving researchers using the techniques of electron cryo-microscopy (cryoEM) single particle reconstruction and tomography, x-ray crystallography, and high-resolution three-dimensional light microscopy. The proposed new capabilities will enable making animations illustrating the architecture and function of molecular assemblies, interpreting density maps covering a wide range of resolutions, modeling of structurally heterogeneous assemblies imaged by tomography and light microscopy, and analysis of atomic models of assemblies composed of tens to thousands of macromolecules. Impact of this Research: Chimera has become the leading package for cryoEM researchers analyzing density maps and atomic models of large molecular assemblies. It is used by hundreds of such labs, and is the primary software for visualization and analysis used by archivists at the public data repositories (PDB, EM Data Bank, VIPER Database). Additional details for this project are available here.
TR&D 3: The Structure-Function Linkage DatabaseThe prediction of protein function from structure or sequence data remains a problem best addressed by leveraging information available from previously determined structure-function relationships. The Structure- Function Linkage Database (SFLD) is a resource designed to address difficult problems in linking sequence and structural information for divergent enzymes to their associated functions. The core problem addressed by the SFLD is functional inference in mechanistically diverse enzyme superfamilies, which are difficult to annotate because their sequences, structures and conserved active sites are similar yet their overall functions, substrates and products vary widely. Our approach is based upon the principles of chemistry-constrained evolution, which shows that some step or characteristic of chemical activity tends to be the conserved functional element retained in enzyme evolution rather than the ability to bind a specific substrate or catalyze an overall chemical transformation. This principle has allowed for the organization of many enzymes into their appropriate superfamilies and families and elaboration of principles for enzyme annotation consistent with the aspects of function conserved at each of these levels. The resulting framework provides different levels of granularity at which function can be assigned and thereby is less prone to over-prediction of functional characteristics than are representations based only on overall reaction. Impact of this Research: TR&D 3 provides a formalized and rules-based approach for assigning function to sequences and structures that are less than 30% identical to characterized enzymes, for correcting misannotations prevalent in public databases, and for accessing information about similarities in ligand structures, operon context, and mechanism for applications in protein engineering and methods development for sequence and structure analysis. The formal ontology we developed to describe functional characteristics in a structurally relevant context has provided the means to extend these principles to other types of proteins.
TR&D 4: Integrated Analysis and Visualization of Biological ContextTo address emerging opportunities for biological inquiry at the systems level, TR&D 4 aims to create visualization and analysis tools that will allow users to view biological systems at multiple levels of granularity and move in a seamless way across them. The goal is to provide biologist users with improved methods for managing large amounts of data derived from high throughput experiments or computation, viewing that data in the broad context provided by network representations, and simple ways to focus on more detailed parts of a network by clicking on the constituent nodes or edges (or segments of the network). Consistent with our core mission to enable research focused on the sequence-structure-function paradigm, the tools we have previously developed and that we propose here will integrate, for example, viewing of systems-level relationships among sequences, structures, or function provided as network representations, with the ability to “drill down” to view subsets of those relationships, including, for example, the individual component structures of protein molecules or the structural elements of a virus. Conversely, a user could begin with a view of a single structure or sequence alignment, then expand to view that information in the context of other related structures or sequence alignments, or to an even larger context representing all the superfamilies in a fold class, viewed as a network. This project integrates well with our other TR&D projects, especially the SFLD, and together they provide the foundation for viewing data from their specific domain areas at multiple levels of granularity. In many cases, these efforts will result in enhancements to Cytoscape or the development of new Cytoscape apps. Impact of this Research: Although network level representations for systems analysis have developed substantial sophistication over the last few years with great rapidity, applications that provide user-friendly tools tailored specifically for managing and relating sequence and structural information and associating it with function do not currently exist in one broadly useful package. The tools we have implemented as part this project have extended the offerings of the RBVI to now include the broader context required by systems-level research. In addition, we have provided the growing developer community in systems biology with specialized capabilities in handling and representing sequence, structure, and functional information.
Driving Biomedical Projects
DBP 1: Modeling Macromolecular Assemblies (PI: Andrej Sali)The primary aim of the Integrative Modeling Platform (IMP) is to design and implement a software platform for the modeling of macromolecular assemblies that will support an integrative approach for interpreting a large variety of low-resolution experimental and computational information. Complete lists of macromolecular components of biological systems are increasingly becoming available. We now need to explain the functionality of the system as a whole in terms of the properties of its components and interactions between them (Figure 1). To do so, a comprehensive characterization of the structures and dynamics of macromolecular assemblies is essential.
UCSF Chimera is an ideal complement to IMP. Where IMP will provide the computational toolkit for constructing plausible models from restraints, Chimera will provide the visualization capabilities that are needed for end users in analyzing the modeling results and developers in improving the modeling process. End users will want to view the ensemble of models generated from the restraints, as well as how well restraints are satisfied by various combinations of models. Developers will want to examine intermediate states in the optimization process to see whether algorithms are behaving in expected patterns. Both these tasks may be greatly facilitated by the implementation of Chimera tools that:
The combination of IMP and Chimera tools will make restraint optimization a viable technology for determining the structure of macromolecular assemblies, both for researchers to apply to systems of interest and to develop new methods.
- Graphically render spatial results such as molecular structures and their relative orientations.
- Display spatial restraints such as NMR distance constraints in the same view.
- Display how well ambiguous restraints are satisfied. (An example of an “ambiguous restraints” is an immunoprecipitation experiment that indicates which protein types but not which instances interact with each other, resulting in an ambiguity when there is more than one copy of a protein per assembly.)
- Interact with the IMP computation process to extract and display intermediate results, which can greatly aid developers in understanding how well or poorly an algorithm is performing.
There are two main areas of development for IMP-Chimera linkage: creation of a shared data format suitable for both computation and visualization, and the development of tools for invoking and monitoring IMP tools, including visualization of the resulting ensembles of models and associated restraints.
Shared Data FormatThere is no standard file format capable of storing ensembles of large molecular assemblies along with the markup and extra information necessary to analyze the ensemble. To address this shortfall, we are developing the RMF (Rich Molecular Format) file format. Currently, the file format uses the Avro library for data serialization and supports:
- coarse-grained and multi-scale representation of molecules
- multiple conformations of each molecule
- spheres, cylinders, surfaces and other geometry
- association of properties, such as scores, with parts of the molecules
- support for very large data sets
Integration of RMF into Chimera has the potential to improve the modeling process in several ways. Most basically, by providing format that is rich enough to store the data IMP needs that can be read by Chimera, the amount of conversion between data formats and corresponding loss of dexterity and management difficulties are removed. More interestingly, having such a format will increase the amount of data accessible to Chimera, allowing for richer displays and better control of how the data is displayed (eg, relationships between particles making up the coarse grained molecules are captured in the file and so Chimera can take advantage of them in the same way it can take advantage of assumed relationships between atoms in a PDB file to provide cartoon, surface and other viewing modes). Finally, the file format provides a conduit for data from Chimera back to IMP, creating the possibility of simple interactive editing of the molecules, sampling spaces and other aspects of the modeling process.
Invocation and Visualization ToolsThere are a number of challenges that we encounter when applying integrative modeling tools:
- Modeling tools may require a variety of input data, such as restraints derived from a variety of experiments. The conversion of experimental data into modeling restraints ranges from simple, e.g., NMR distance constraints between atoms, to very complex, e.g., protein proximity inferred from protein interaction networks.
- The sampling-based modeling approach typically results in a large ensemble of structures that fit the data sufficiently well. Each of these models consists of structural data at multiple scales, each of which may have associated restraint satisfaction criteria. Visualization of these results is sometimes difficult. For example, structures are often not at high enough resolution to provide coordinates for each atom, so they must be approximated with simple geometric primitives such as spheres. The types and number of restraints is typically large, so they must be displayed selectively. The relationship between structures and restraints may be difficult to display graphically, e.g., ambiguous restraints as described above.
- Comparison of models of large ensembles requires capabilities for (a) clustering the models based on user-defined distance metrics, and (b) displaying the structure variability and similarity within and between clusters. Practically, large ensembles also place great demands on storage capacity and computational and graphics performance.
- Modeling runs often take a long time due to the large solution space. During execution, the ability to monitor intermediate results is important (a) to the end user who will want to know how far a calculation has progressed, and (b) to the developer who will want to know whether the tool is performing properly. For example, one of the components of IMP is MultiFit, the first computational method for simultaneous fitting of protein structures into a cryo-EM density map of their assembly. MultiFit combines computational fitting and computational docking techniques with an inferential optimization framework for fast and guided sampling of many possible configurations. During execution, an end user would want to know how many configurations will be sampled, as well as how many have been completed, as an estimate for the required computation time. A developer would want to know whether configuration selection and evaluation are working as expected without waiting until the calculation is complete, particularly if the job will run for a long time.
To address these challenges, we will collaboratively construct front end tools for IMP applications that:
- Simplify invocation by providing a graphical interface for specifying the required input data files and parameters. For example, Chimera already supports a basic interface that enables users to run MultiFit by selecting displayed atomic models and density maps as input data. We will add support for more complex input data from Chimera. The Chimera-Cytoscape linkage, in particular, will be of great use in combining multiple data types. For example, to expand the MultiFit tool, we can add functionality that (a) derives proximity restraints from a protein interaction network displayed in Cytoscape, (b) enables user specification of more optimization parameters, (c) and invokes MultiFit.
- Monitor calculation progress. Typically, applications produce and store intermediate results in forms that require minimal development effort. To simplify progress monitoring, we will evaluate standardizing the format for intermediate results. We can then build libraries for accessing standard intermediate results, whether the application is invoked via web service or locally.
- Display intermediate and final results. By exchanging data with IMP applications using RMF files, Chimera will have access to organizational information such as the relationship between restraints, atomic structures and coarse-grained geometry. Using this information, we will construct interfaces for visualizing structural aspects of models (either as atomic structures, if available, or as coarse-grained geometry) and selectively displaying restraint data associated with individual or groups of components. For ensemble analysis, Chimera already has some tools for clustering and displaying ensembles, and these may be expanded for basic clustering analysis; for more complex analysis, using a combination of Cytoscape’s clusterMaker (for clustering) and structureViz (for displaying structures in Chimera) tools will provide a good solution.
By cooperatively developing Chimera tools that complement IMP applications, we will provide the scientific community with easy access to integrative modeling methods for determination of macromolecular assemblies, both for research and development.
DBP 2: Incorporation of SFLD into the InterPro resource of the EMBL-EBI (PIs: Rob Finn & Alex Bateman)
SignificanceThe genome projects represent a transformative achievement of our generation, revolutionizing the way we frame and pursue biological inquiry and its application to human health. The enormous volume and density of genomic information requires new ways of thinking, along with new approaches to access and exploit it. Fundamental to achieving these goals is our ability to determine the protein functions associated with genomic data. As of March 2015, there are over 90 million protein sequences in the UniProtKB/TrEMBL database, the primary worldwide repository (along with Genbank) of such information. As this number continues to increase, the identification of the functions of these proteins falls further and further behind, and even using very high throughput methods, a vanishingly small proportion of these proteins will even be experimentally characterized. Thus, large-scale computational methods are required to infer their molecular and systems level functions and interpret their roles in health and disease.
InterPro is a primary resource of the European Bioinformatics Institute and provides protein family data to the UniProt Consortium, a major data resource delivering protein function information to a worldwide scientific community (200,000 visitors and 50 million searches per month). The goal of InterPro is to enable functional inference of proteins by using the signatures (predictive models) provided by the different member databases, for classifying proteins into families and predicting many of their functionally relevant properties. There are currently 11 InterPro member databases, which include the Prosite database of protein families, Pfam (probably the widest available collection of multiple sequence alignments and hidden Markov Models (HMMs) describing protein domains), TIGRFAMs, the Superfamily library of profile HMMs representing all proteins of known structure, and the CATH-Gene3D database.
The Structure–Function Linkage Database (SFLD, http://sfld.rbvi.ucsf.edu/) is a core technological research and development project of the RBVI. It links evolutionarily related sequences and structures of functionally diverse enzyme superfamilies to their chemical reactions [Akiva, 2014, 24271399]. SFLD differs from other major resources for protein annotation such as the InterPro and UniProtKB in its focus on the annotation of conserved molecular functional features that define reaction families in many different functionally diverse enzyme superfamilies. The SFLD hierarchy identifies the members of homologous superfamilies using a combination a methods, then classifies the members of each superfamily into multiple levels of subgroups as needed, and into families.
Subgroups distinguish sequence and structural properties and are comprised both of unknowns and families of known reaction specificity. Families are defined as sets of proteins that are predicted to catalyze the same reaction in the same way, i.e., using similar mechanisms and structural determinants of reaction specificity. The SFLD hierarchy is illustrated in Figure 1 using the Enolase superfamily as a model.
The goal of this DBP is to incorporate SFLD data into the InterPro database.
ApproachPreliminary work already initiated between InterPro and SFLD indicates that the annotations provided by SFLD will enhance InterPro functional assignments in ways that are not currently offered by other member databases, especially with respect to the detail at which family reaction specificities are defined and mapped to a hierarchical classification of superfamily members based on their sequence and structural features.
Figure 2 illustrates the relationships among the 11 different member databases in InterPro and SFLD, following the integration process. Adopting into InterPro SFLD profile HMMs and post-processing methods should allow SFLD to be readily integrated into InterPro as a member database.
Figure 2. Schematic showing the relationship of the 11 different member databases in InterPro and SFLD. The different databases, shown on the blue arrows, annotate using different methods (dark blue boxes) and to different definitions of a protein family (grey boxes). InterPro members that are part of this proposal are shown in yellow. SFLD (red) indicates features we will add to InterPro.
We are initially concentrating on the SFLD Enolase and Radical SAM superfamilies to develop and refine our pipeline, prior to rolling out the process to all Core SFLD superfamilies. There are two main tasks that are under-way:
- Formalization of the rules for assigning enzymes to specific superfamilies, subgroups and families. We are designing and creating an interface for storing the protocols used for assigning a sequence to a specific level of our hierarchy, including the specific E-value and Bit Score cut-offs used, whether there are length restrictions or specific amino acid residue motifs. These formalizations can then be exported into annotated multiple sequence alignments (MSAs) which are ultimately passed onto the InterPro team. These annotated MSAs will form the basis of the InterPro HMMs. The new format MSAs have been designed and will soon be implemented in the SFLD.
- Assessment of the accuracy of the post processing rules. This is a critical step in determining whether our HMMs and other post-processing rules have a sufficient accuracy to add value to not only the SFLD but also to InterPro. This involves two steps, determination of the accuracy of our HMMs alone, and the value added from the post-processing steps. For example, in some cases where no family specific amino acid residues have been assigned (due to lack of the necessary mechanistic or other relevant information), the post-processing steps add no further specificity. The sets of proteins for which this applies can either be re-visited to see if there is any further information now available in the literature, or simply processed at the HMM plus bit score cut-off value.
Once these two tasks are complete, we will pass the annotated MSAs to InterPro and they will perform their own analyses on them, which will include running the HMMs against all of UniProtKB and InterPro. From this we will learn where we overlap with other InterPro member databases. Their results will also be useful in development of new automated update protocols for the SFLD.
Future studies are planned (and funding is currently being sought) to directly compare the complementary views of functional annotation between InterPro, TIGRFAMs, PANTHER, Pfam and the SFLD. Thus, a major future outcome will be the creation of an interoperable resource InterPro in which these complementary data sources work together proactively to improve consistency, accuracy, depth of functional annotation, and targeting of effort. Beyond simply indicating which particular protein families are found in which sequence data sets, this approach will assert which complexes and pathways are present, expanding InterPro’s reach from molecular functions of single genes to biological processes mediated by systems of genes. Applications range from small, targeted sequencing projects to genome annotation to large-scale metagenomics analyses.
InterPro is also interested in exploring the use of sequence similarity networks [Atkinson, 2009, 19190775] as a visualization tool for viewing similarities between proteins of known and unknown function such as those created and provided for download by the SFLD. These tools could enhance dissemination of the broader family information available from InterPro and could be particularly valuable for metagenomic sequence annotation under development by the EBI Metagenomics Portal. That effort will be considered after SFLD has been incorporated into InterPro.
DBP 3: Interactive Visualization of Intramolecular Contact Networks in Protein Structures (PI: James Fraser)
Figure 1. Non-interactive visualization of a CONTACT network in CypA using the networkx library in Python. The thickness of each edge indicates the number of CONTACT pathways within the network that include that inter-residue interaction. With an interactive Cytoscape approach, the user will be able to adjust steric-overlap and other energetic parameters and visualize how inter-residue connectivity changes in real-time, which will help build intuition about the complex, dynamic networks of atoms that underlie all protein structures.There is an unmet need to model protein conformational heterogeneity to improve rational drug design and protein engineering. Indeed, many fundamental biological processes, such as protein folding, enzymatic catalysis, ligand binding, and allostery can require a significant degree of conformational flexibility. Statistical inference methods aid in evaluating the precision of structural distributions determined by NMR; however, these ensembles are generally underdetermined. X-ray diffraction provides detailed experimental readout of protein structure, but is subject to crystal lattice constraints. Despite lattice constraints, both discrete alternative conformations and harmonic displacements are likely present in the crystal and can be modeled using standard techniques. This project will examine the coupling between these discrete alternative conformations in a new way, by leveraging the network visualization technologies of Cytoscape and Chimera.
To meet this need, we will couple molecular modeling methods and network visualization. Modeling contacts between alternative conformations in multiconformer protein structures using the CONTACT algorithm is a recent development with massive potential for understanding the functional relevance of “dynamic close packing” in proteins. However, the networks of these contacts within proteins are complex, and methods for visualizing them are currently underdeveloped. We propose to overcome this problem by leveraging the powerful visualization capabilities of Cytoscape, which has been successfully used to study intracellular signaling networks, genetic interaction data, etc., to visualize a new type of network: steric contacts between atoms within flexible protein structures. Many of the same well-developed network topology exploration tools in Cytoscape can be applied to steric contact networks, but will gain new meaning when they are applied to van der Waals interactions between sterically overlapping atoms in mutually exclusive alternative sidechain conformations instead of, for example, genes with epistatic interactions from systems biology data. For example, an edge weight might indicate the number of all-atom contact pathways that link a pair of tightly coupled residues (Figure 1), instead of the number of genes that interact with a given gene. In line with the paradigm that looking at protein structures in new ways can lead to new insights into their architectures and functions, this work will establish a new way of looking at steric contact networks in multiconformer protein structures, and may inspire new insights into the contributions of protein dynamics to biology.
Improving the CONTACT algorithmCurrently, CONTACT analysis of multiconformer models is based entirely on steric repulsions between protein atoms. This is reasonable as a first approximation, since overlapping van der Waals spheres indeed imply highly unfavorable energies and thus essentially mutually exclusive pairs of local protein conformations. However, other fundamental forces such as favorable hydrogen bonds and electrostatic attractions and repulsions (Figure 2) are currently neglected. Furthermore, explicit water molecules often accompany one but not the other alternative conformation, acting as integral components of either the A or B state (Figure 2) – but current methods do not model interactions with them. To improve these aspects of CONTACT, we will incorporate favorable hydrogen bonds and electrostatic interactions in addition to unfavorable van der Waals repulsions, as they may play an equally important role in defining which conformers at adjacent resides comfortably co-exist. To account for sidechain-water couplings, we will build waters at expected hydrogen-bonding geometries relative to protein chemical groups when discrete but low-contour electron density corresponds to those positions. Importantly, in many cases these waters will bridge alternative sidechains via multiple hydrogen bonds, thereby linking apparently disconnected regions of coupled alternative networks. Together, these improvements will yield more realistic intramolecular contact networks; however, because of the greater variety of inter-atomic interaction types, these networks will be more complex -- therefore, new visualization methods will be important for understanding them.
Figure 2. A network of alternative conformations in catalase (PDB ID 1gwe) with diverse properties. Multiple phenomena define the network: van der Waals interactions (blue dots and line segments) between sidechains, a hydrogen bond (dotted green line) through a partial-occupancy water (brown), coupling through the locally mobile backbone (black), and perhaps electrostatic forces between the Lys (green) and nearby polar residues (blue: Glu, yellow: Asp, purple: Ser). This particular network is distal from the active site and is therefore putatively not critical for function.
Visualizing complex contact networksTo interpret the complex atomic-interaction networks from our upgraded CONTACT algorithm, we will take advantage of the network visualization functionality of Cytoscape. For data transfer between the two programs, network data will simply be exported from CONTACT in JSON format for easy import into Cytoscape. In Cytoscape, we will implement interactive sliders to vary the primary input parameters to CONTACT -- especially the Tstress value, which indicates the threshold van der Waals overlap for defining a steric clash. CONTACT network connectivity is particularly sensitive to this value, so having an interactive slider will give the invaluable ability to interactively visualize changes in network structure as clash sensitivity is varied. Moreover, we will compare this sensitivity in networks (1) from the current CONTACT based entirely on steric contacts vs. (2) from our new, more complex CONTACT with additional hydrogen-bond, electrostatic, and water interactions. We hypothesize that the sensitivity to clash threshold will drop in these more complex networks, which would indicate the additional interaction types have helped build a more robust model of the complex interactions inside protein structures. Although such explorations could in principle be performed in a less interactive, less graphical way, we feel that Cytoscape interactivity will vastly accelerate the process of evaluating the relative robustness of different CONTACT variants. Furthermore, interactive exploration of these networks across proteins in Cytoscape will be a powerful tool for building intuition about the general principles of energetic coupling in proteins -- and the idiosyncratic deviations in specific proteins that underlie their unique roles in biology.
DBP 4: HIV Accessory and Regulatory Complexes (PIs: Nevan Krogan & Yifan Cheng)
The HIV Accessory and Regulatory Complexes (HARC) Center aims to create a comprehensive structural picture of the interactions between human immunodeficiency virus (HIV) proteins and intracellular host molecules during the viral lifecycle. HIV has a small genome and therefore relies heavily on the host cellular machinery to replicate; HIV and human proteins act together to propogate the viral genetic information from RNA to DNA to mRNA. Also crucial in the battle between HIV and the human host is the innate immune response, with host antiviral factors acting to restrict infection, but in many cases being blocked by viral countermeasures. Identifying the macromolecular complexes involved in these processes and and determining their high-resolution structures would provide new opportunities for targeted drug design against AIDS. This collaborative project involves two of the nine HARC Center principal investigators: Nevan Krogan in systems biology and Yifan Cheng in cryo-electron microscopy (cryoEM).
We have combined novel experimental and computational techniques to obtain a system-wide view of the immediate innate response to HIV1 infection. In the first systematic application of affinity tagging/purification mass spectrometry (AP-MS) to host-pathogen interactions, we identified interactions of all 18 HIV1 proteins and polyproteins in two different cell lines (HEK293 and Jurkat). A novel scoring algorithm was used to identify 497 HIV-human protein-protein interactions with high confidence. Most of the interactions had not been previously described, and we explored the biological significance of: (i) HIV protease cleaving part of the eIF3 translation initiation complex that would otherwise inhibit HIV replication; (ii) recruitment by viral Vif of a new subunit to assemble an active ubiquitin ligase that targets the restriction factor APOBEC3G for degradation. Datasets enriched for innate immunity factors and results from genome-wide RNAi screens for human proteins that affect HIV1 replication are also being assimilated into the systems view.
We are applying single-particle cryoEM to determine the structures of HIV-host macromolecular complexes. Density maps from cryoEM are combined with atomic structures to create high-resolution models of larger assemblies.
Cytoscape-related tools and techniques being developed by the RBVI are essential for our visualization and analysis of systems as networks. Progress will rely on:
- Identifying and implementing clustering methods and parameters appropriate for identifying complexes
- Simultaneous visualization and comparison of multiple datasets, such as from different conditions
- Analyzing 3D structures in concert with network views
Chimera developments in the following areas will be important for gauging the correctness of cryoEM density maps and derived atomic models based on limited-resolution data:
- Comparing density maps to data from small-angle X-ray scattering (SAXS)
- Evaluating orientations of atomic structures within a density map
- Assessing the handedness of a density map
- Identifying structural heterogeneity
Additional genetic, proteomic, and structural analyses of the host factors identified in this study will be combined with pathway information to build a functionally validated network for the cellular response to HIV. Network modules will be identified by clustering regions of the network with dense connectivity, including “fuzzy” clustering to find proteins that may participate in multiple complexes. The same proteomics pipeline as described above will be used to discern host interactions with different viruses, and methods for comparing the resulting networks (as well as host networks with and without infection, or host interactions with the same virus but under different conditions) will be investigated. We have already begun to use RBVI-developed tools that connect the network view with 3D structure analyses, with possible applications including the design of site-directed mutations to modify host-pathogen interactions.
Single-particle cryoEM involves aligning and averaging over many thousands of particles. Especially for the smaller (<300 kDa) complexes targeted by the HARC Center, low signal-to-noise ratios limit our ability to correctly judge the orientation of the different 2D views of individual particles (Figure 1), and different methods often generate different results. One check of an EM map is to compare its expected SAXS profile with an experimental SAXS profile for the same sample. Chimera can already calculate profiles from atomic models via the FoXS program, but to enable this comparison for EM maps it will be necessary to add an algorithm available from the Svergun lab.
A common measure of the quality of fit of an atomic model to an EM map is the cross-correlation coefficient (CCC) between the experimental map and one predicted from the atomic coordinates. We will investigate another approach based on Fourier shell correlation (FSC). Instead of a single number, FSC gives a 1D plot of how well the atomic model matches the experimental map as a function of Fourier space frequency. By comparing the FSC curves of alternate fits, we can assess what resolution would be needed to distinguish the fits; if the experimental resolution is insufficient to distinguish the fits, the alternate fits are compatible with the data. An advantage of the FSC curve is that (unlike the CCC), it does not depend on the resolution used to predict the map of the atomic model. FSC can also be used to evaluate fits to maps vs. their mirror images and to evaluate whether maps of opposite handedness are distinguishable given the resolution.
Finally, we will explore the assessment of heterogeneity by fitting an ensemble of a few conformations to a map. The conformations to use can be predicted by normal mode analysis of the best-fit single configuration, and chosen to make large atomic motions coincide with areas of larger map variance. If a superposition of several conformations provides a sufficiently improved fit, it suggests that flexibility is producing a smeared density. Fitting multiple conformations will always produce some margin of improvement, and thus the degree of improvement that suggests flexibility will need to be studied.
DBP 5: Multiscale 3D Architecture of Organelles, Cells, Tissues and Microbial Communities (PI: Manfred Auer)
The Auer lab studies the architecture and function of cellular systems and communities, primarily with electron tomography (ET) and focused ion beam scanning electron microscopy (FIB/SEM). Our continuing collaboration with the RBVI aims to develop novel software to visualize and characterize these complex, morphologically diverse structures. We describe three ongoing projects:
Hair Cell Stereocilia: A Mechanosensitive Organelle
Hearing and balance rely on the proper development and maintenance of the hair bundles of inner ear hair cells. Deafness is the most common sensory impairment in humans, with approximately 1 in 1000 being born deaf and another 1 in 10 later experiencing hearing loss, mainly due to hair cell malfunction and/or degeneration. The hair bundle is an organelle located at the apical surface of a hair cell, typically composed of dozens of stereocilia arranged in rows of different heights. Development is highly organized, producing stereocilia with precise dimensions and mechanical properties that correlate with stimulus sensitivity. A hair bundle exemplifies the fragility and complexity of elaborate molecular machines, as it requires the correct assembly of extracellular, transmembrane and cytoplasmic proteins. Such marvelous machines can only be truly understood by analyzing their molecular compositions, protein-protein interactions, and 3D architecture in situ. Our state-of-the art ET study will yield the first molecularly detailed description of the 3D organization of the hair bundle, providing a structural framework in which genetic, immunolocalization, and electrophysiological studies can be interpreted.
Breast Cancer Cells Cultured in 3D Matrigel
Our study of breast cancer cells concerns the organization of the cellular intermediate filament network. We have employed FIB/SEM and Serial Block Face Scanning Electron Microscopy (SBF/SEM) to study entire cells. These methods give lower resolution than electron tomography but can cover larger physical dimensions, with resulting data sets up to a Tbyte in size. In preliminary experiments, we have observed striking differences in the intermediate filament network when comparing a premalignant (S1) with a malignant (T4-2) cell line, and we are in the process of quantifying these findings through segmentation and geometric analysis. The ability to depict the entire cytoskeletal network within cells will be important, and while it is now possible to create such data sets, visualization and quantitative analysis will require significant advances, as proposed below.
Organization of Microbial Communities
We are interested in the multiscale organization within microbial communities, ranging from macromolecular complexes to the spatial arrangements of cells in biofilms. We have performed an extensive 3D analysis of the lignocellulose-degrading termite hindgut community, with its ~200 species identified by metagenome sequencing. Preliminary analysis has spurred Chimera developments in segmentation and in bacterial classification by diameter, shape, overall appearance, and characteristic internal features. Clustering of certain classes near the plant biomass has revealed the likely mechanism for biomass degradation, including biomass-facing complexes, vesicles, and enzymes. This preliminary work both illustrates the need for advances in visualizing very large volumes and exemplifies how the biology drove advances in analysis tools, which in turn enabled biological discoveries that otherwise would have remained unnoticed.
This collaboration will drive enhancements to analysis and modeling tools in Chimera. We foresee three fundamental areas of innovation:
- Real-time navigation through large density maps
- Abstract geometric model-building to reduce the complexity of the data
- Methods to understand the heterogeneity of functionally equivalent structures and recognize key common features
An issue with tomographic data is that typically only a small portion can be displayed at full resolution at a time, making real-time exploration difficult. The ability to steer through the data in 3D and follow features of interest in arbitrary directions would be highly desirable. The complexity of the data can be reduced by extracting features of interest and representing them as simple geometric objects instead of their detailed densities. For example, microtubules could be shown as cylinders, serving to both simplify the data and clarify its contents to viewers. Although we have routinely used IMOD and VolumeRover for certain tasks, it is the versatile visualization and interactive segmentation tools in Chimera that have enabled us to build effective models and obtain quantitative geometric information. Further improvements such as user-guided semi-automated segmentation, alternative segmentation algorithms, skeletonization, and data abstraction into geometric models would further enhance the power of Chimera for analyzing our data.
Another challenge is how to handle the conformational and compositional heterogeneity of molecular machines, including identifying substates of the same entity, discerning different functional states, and evaluating which states are compatible with function and which are pathological or unphysiological. Standard methods of comparison such as cross-correlation have been of limited utility. Our collaborators in the Sethian group have been developing algorithms for comparing similar but nonidentical architectures. This area of structure analysis is the least developed and will require serious work, but we feel confident that the team we have assembled, including the Chimera team, will be able to devise and implement solutions.
Some of the most spectacular results in tomography have been obtained by motif averaging. Motif averaging is likely to be of great value for the analysis of the stereocilia actin core and its different cross-links, where one can expect a pattern of repeating units suitable for classification and averaging. Where local averaging cannot be performed, one can still attempt template matching. Clearly, adding such capabilities to Chimera would be highly desirable.
The proposed advances in Chimera will be of significant importance for our research, and in turn, the RBVI will benefit from our experience in ET and SEM, state-of-the-art data sets, and access to advanced algorithms for tackling the problems we have laid out. What makes Chimera so powerful and useful is that it can integrate developments from a variety of different academic labs into a single software package, on top of existing high-quality visualization and analysis tools, thus providing a one-stop-shop environment for data visualization, exploration, and quantitative analysis.
DBP 6: Modeling Biological Assemblies from cryoEM and cryoET Maps (PI: Wah Chiu)
The Chiu lab has been pioneering the methodology of structure determination by single particle cryo-electron microscopy (cryoEM) for over 2 decades. The lab is supported by the NIH as a P41 resource, the National Center for Macromolecular Imaging (NCMI). We have made substantial advances that enable imaging features as fine as individual protein residues, solving structures of very large asymmetric machines, and revealing the operation of molecular assemblies in their native environments.
Recently, we have shown the feasibility of tracing the Cα backbones of several proteins determined at resolutions near 4Å. Although it is possible to attain resolutions approaching those of X-ray crystallography for molecular assemblies with high symmetry, many biological machines are only approximately symmetric. For example, the capsid structure of bacteriophage P-SSP7 can be solved with icosahedral symmetry, but the phage also has a tail that breaks this symmetry. We have developed a new reconstruction algorithm that does not impose symmetry, and we can now delineate the details of nearly all of the tail proteins, including the portal vertex complex through which the viral genome is packaged and released (Figure 1). To validate this structural model, we are using a cryo-electron tomography (cryoET) of P-SSP7 bacteriophages infecting their host, a cyanobacterium. Averaging the subtomograms of phage particles reveals multiple snapshots of the full and empty states, but at low resolution (~40Å). We are striving to improve the resolution (by averaging more data) to allow discerning structural details at different stages of infection.
New tools will be developed within the Chimera package to:
- build Cα backbone models in cryoEM maps at resolutions that have become attainable only recently
- interactively explore functional states seen in cryoET
- create animations to illustrate hypotheses about how molecular machines function
Advances in experimental methodology and software processing have led to single-particle cryoEM maps at 4Å resolution. This resolution is too low for automated X-ray crystallography model-building software because only the largest residues are visible, but sufficiently high that protein backbones can be traced. The technique threads secondary structure elements (α-helices and β-strands) predicted from protein sequence into the observed densities. A diverse collection of algorithmic tools factor into this process: segmentation, helix and sheet density identification, density skeletonization, secondary structure prediction (using web services), path finding, canonical atom placement, and atom position refinement. These calculations are fast, but because the modeling is at the limit of what is possible, human intervention is often needed, and Chimera is ideal for this combination of real-time calculations with manual intervention where needed. Initial algorithms for all of the steps have been created by NCMI in a development software package called Gorgon. We plan to incorporate production versions of the tools into the standard Chimera distribution, an effective means of dissemination to the broader structural biology community.
The highest-resolution cryoEM structures are obtained using purified preparations of a molecular assembly, all identical, allowing thousands to millions of copies to be averaged together. The complementary technique of cryoET can image in vivo samples containing many different states of a molecular assembly, but a very low signal-to-noise ratio limits the observed detail to ~40-50Å unless equivalent structures can be averaged. However, it is difficult to decide which of the many structures are in equivalent states because of the poor resolution, and further, the accurate alignment of the structures for averaging poses severe challenges. NCMI is developing algorithms, some of which are computationally intensive, to address these problems. NCMI-developed subtomogram alignment and averaging methods will be incorporated into Chimera. Morphing between the aligned densities, a current Chimera capability, will provide a powerful visual check on the similarity of the observed structures. These exploratory tools will help to identify the best data for full computational analyses.
Observing several discrete states of a molecular machine carrying out its function gives clues about its temporal workings, but the picture is usually far from complete. During bacteriophage P-SSP7 infection of its host, for example, we believe that the interaction of tail fibers with host cell receptors causes a cascade of structural rearrangements at the portal vertex, leading to the release of viral DNA. This transition is sufficiently fast that we do not observe the intermediate states, although specialized experiments may be able to trap the virus in mid-process. To explore our structural hypotheses, explain them to others, and plan experiments, it is extremely valuable to create an animation of the hypothetical motions. The NCMI creates about 100 animations each year illustrating the architecture and function of molecular machines, almost all being produced using Chimera and then combined and narrated in commercial movie-editing software. Most animations are confined to showing details that have been experimentally observed. A new direction is to animate hypothetical mechanisms of action of a molecular machine. Potentially useful capabilities include more complicated morphing (for example, only moving one component of an assembly at a time) and depiction of stochastic behaviors such as Brownian motion. The RBVI will develop new animation tools, guided by specific NCMI animation needs and experience, and will refine them to assure that they are easy to use.
DBP 7: Structural Data Validation, Improved Formats, and Visualization at RCSB PDB (PIs: Helen Berman, Cathy Lawson, John Westbrook)The Protein Data Bank (PDB) is a key research resource and a central component to our understanding of living systems. It archives the 3D structures of biological macromolecules determined by X-ray crystallography, nuclear magnetic resonance (NMR) and cryo-electron microscopy (cryoEM). The RCSB PDB is a founding member and archive keeper of the world-wide PDB (wwPDB), and is actively working with wwPDB partners (PDBe, PDBj, and BMRB) to create a single worldwide system for the collection, annotation, validation and archival of PDB data. To support this effort, the wwPDB has created task forces in X-ray crystallography, NMR, and cryoEM to develop recommendations for the validation of experimental structure data. In addition, wwPDB partners are involved in developing new data format standards. The RCSB PDB also strives to increase the familiarity of students, teachers, scientists and the general public with the 3D structures of proteins, nucleic acids, and macromolecular complexes, with outreach efforts including online resources, courses, workshops, exhibits, and other educational materials.
Molecular visualization and data archiving are of vital importance to structural biology and biomedicine. For significant positive impact, we see the following major areas of collaboration between RCSB PDB and RBVI/Chimera:
- Implementation of the recommendations of the wwPDB task forces for data validation by leading software applications such as Chimera will enable researchers to make use of best practices while visualizing structural data.
- Robust support in Chimera for improved data formats will accelerate their adoption and full use of their capabilities for visualization, analysis and data exchange.
- New visualization and analysis tools developed by the RBVI will enable and enhance the RCSB PDB's efforts in education and outreach. For example, a widget for examining structures in a web browser and tools for easy movie creation can have a significant impact.
Structural Data ValidationSupporting the wwPDB task force recommendations is an ongoing activity that will need to closely track the progress of each expert group. The X-ray Task Force, established in 2008, has submitted a paper for publication; recommendations include new metrics to assess the relative quality of X-ray structure models. The new statistics will be maintained and updated routinely by the wwPDB. Easy access to and (where applicable) display of these new relative metrics within Chimera will facilitate evaluation of structural data quality by Chimera users. The NMR Task Force, established in 2009, and the EM Task Force, established in 2010, are in the earlier stages of developing standards. When available, their recommendations will be shared with RBVI.
Improved Data FormatsAreas of focus for data format improvement are the wwPDB working format (PWF) and cryoEM map and segmentation standards.
PWF is being developed to accommodate very large structures and assemblies, structures determined by multiple methods and hybrid experimental techniques at varying levels of accuracy, detailed covalent chemical descriptions, and quantities specific to the experimental method (for example, disorder will be described differently for X-ray, NMR and EM experiments); it also provides a new mechanism for defining and annotating groups of atoms and residues. These types of information will require new software to provide appropriate visual representations and interfaces for the user to examine structures and their parts for reliability, precision, and other properties.
CryoEM is a relatively new method that allows determining the structures of large complexes in important functional states. While the crystallographic and NMR communities have essentially reached consensus on common data formats, most software tools for working with EM maps use proprietary data formats, and their incompatibility is a serious impediment to the exchange of data and algorithms. Two areas of EM format development in which Chimera is poised to contribute significantly are (a) map exchange, and (b) segmentation and coarse-grained structure annotation. We aim to develop a map exchange format that represents all map types (single-particle, helical, crystal, tomogram) with appropriate symmetry parameters, uses a consistent standard for map position relative to coordinate origin, is extensible to additional types of information (e.g., per-voxel density error), and is compact for efficient storage.
Segmentations indicate map regions belonging to individual proteins or subunits. Segmenting a primary map is typically labor-intensive, especially for large tomograms, but tools are being developed to speed the process. Currently there is no accepted format for archival/exchange of segmentation results; we plan to investigate the following: (a) map with zeroes everywhere except for the identified region, (b) map with different integer values representing different regions, and (c) single file containing all segmentation regions and annotations for a given primary map. Ways of describing coarse-grained models such as locations of secondary structure elements or domains will also be investigated. Proof-of-concept examples will be provided and community input sought.
Education and OutreachThe overall goal of outreach by the RCSB PDB is to promote a structural view of biology. This requires software that can display molecular structures easily and make them attractive and interesting to all audiences. The following examples highlight how Chimera has been instrumental in outreach by the RCSB PDB:
Going forward, the RCSB PDB faculty and their students would like to explore morphing and any new functions in Chimera that can facilitate making animations for education. Tools for building molecular assemblies from their component parts would also be useful for developing an understanding of intermolecular interactions and the structure and function of molecular machines.
- The Molecular Anatomy Project (MAP) was created to present a structural perspective on human proteins, with web articles featuring a specific molecule or several associated with a specific disease. Ease of use, versatility, and high-quality images have made Chimera an obvious choice for illustrating this resource. RCSB PDB faculty have been teaching a course based on the MAP resource called “Molecular View of Human Anatomy and Diseases.” Students identify and analyse the structures of molecules related to a chosen biological theme and present their research in written and oral formats. In the initial offering of this course, RasMol and Chimera were both taught. However, the students finally chose Chimera to create their images, and many even reported that learning to use Chimera was one of the best parts of the course since it enabled them to explore structures independently and easily. Figure 1 shows an image made by a freshman in the course.
- Middle- and high-school classes often prefer hands-on activities to computer graphics. Using the “Flatten Icosahedron” function in Chimera, a flat version of the dengue virus capsid (to be printed on paper and folded into 3D) was created from the coordinates in the PDB archive. The RCSB PDB distributes this hand-held virus model online and at workshops. Other flattened virus templates have also been created for educational purposes using Chimera.
DBP 8: Network Visualization of Whole Genome Sequence Assembly (PI: Joseph DeRisi)
SignificanceThe emergence of ultra-deep sequencing technologies is a hallmark of the last five years of biomedical science. Previously, large sequencing projects were the exclusive domain of a few dozen genome centers in the US and around the world. Now, after only a few years, the actual process of generating gigabases of DNA sequencing data has been effectively democratized. Technologies from Roche, ABI, and Illumina in particular, are accessible to labs of even modest means, or through local core facilities. The Illumina HiSeq-2000, for example, routinely produces greater than 100 gigabases of sequence, enough for 30x coverage of the human genome in a single paired-end run. Indeed, the cost of sequencing, on a per-base basis, has dropped approximately 50,000-fold in the last six years. Importantly, the increase in sequencing capacity shows no sign of abatement, and at the current pace, we can continue to expect at least a doubling in sequence output and drop in cost.
A significant result of this paradigm shift in technology is that virtually any laboratory with modest funding can sequence an entire vertebrate genome at many fold coverage, or participate in deep metagenomics sequencing, SNP discovery, or any of the other genomic endeavors that were once exclusive to large sequencing centers. However, having the data in hand is merely the first step, and it is certainly the case that most labs are poorly equipped to deal with the torrent of incoming sequence. For many applications, such as re-sequencing the human genome, the bioinformatic challenges are straightforward, but the real challenges lie in truly de novo sequencing projects, in which the target genome is completely uncharacterized. With the advent of major projects like Genomes-10k (http://genome10k.soe.ucsc.edu/) (the sequencing of 10,000 vertebrate genomes), the need for a new generation of analysis tools is absolutely acute. At the base level, de novo assembly tools are needed to build contigs of sequence, and then secondary tools are required to build scaffolds, linking contigs together. Many such tools are under development, yet interpreting the results of de novo assembly and scaffold building is significantly hampered by the utter lack of appropriate visualization tools. The reason for this is simple – the scale of the data is enormous, the number of connections are enormous, and there exists no standard for visually evaluating the results.
Thus, the significance and goal of this DBP is the development of Cytoscape as a state-of-the-art visualization tool for ultra-deep sequencing data and assemblies. To go beyond visualization, we propose a Cytoscape implementation that actually facilitates the assembly process by making use of embedded clustering and similarity tools, allowing the researcher to efficiently pinpoint problematic assemblies, resolve scaffold conflicts, and so on. In this implementation, Cytoscape becomes much more than a visualization tool – it becomes an integral component of the assembly process, interfacing with multiple external programs and de novo assembly algorithms, without the need for scripting and programming. It becomes the much need GUI that opens the use of these advanced tools to the general research community. Ultimately, having such a tool will complete the promise of democratizing whole genome sequencing.
Figure 1. Example of node layout, wherein contig size is denoted by node diameter. BLAST similarities are encoded by color, the red nodes all being similar (repeat elements). Zooming into these nodes, the underlying structure of the assembly is revealed. Contig 2 is linked to contig one at segment 1.2, with a unidirectional paired end. Internally, both contigs are supported by concordant paired end bidirection reads. Edge colors denote strength/type of linkage.
The process of assembling a genome necessarily involves the production of extremely large networks. It is reasonable to expect networks, where each node represents a contig, and edges represent connections between contigs, to contain literally tens of thousands of nodes. The underlying data may contain billions of individual reads, each with paired connections or other meta-data. Thus, the innovation and advances that will result from the participation of this DBP is the efficient visualization of very large networks, onto which is superimposed a variety of metadata. In addition, this DBP will result in Cytoscape extensions that will allow the user to interact with underlying assembly and alignment tools directly through the visualization. In effect, Cytoscape becomes the primary GUI for analyzing, optimizing, and implementing whole genome assemblies.
The implications of achieving this goal are significant. Previously, end-users of ultra deep sequencing data had to rely on multiple command line tools, and baroque scripts and data format conversions. For the majority of biomedical researchers, outside of bioinformatics/genomics specialties, the activation barrier and learning curve associated with collecting/running the appropriate tools and interpreting the results is simply too high. As it stands, only a small population of researchers are actually capable of assembling whole genome sequences despite the widespread availability of the data and technology. There are nascent open source collections of assembly tools, such as “AMOS” (http://amos.sf.net), however the visualization and guidance of assembly tends to be the weakest part. For example, in AMOS, the HawkEye visualizer is incapable of rendering large amounts of data, and was intended for sequence at a much smaller scale. A Cytoscape visualization and interaction tool would place these algorithms into a context where they can be used easily and the results interpreted simply.
The specific innovations require a basic overview of the whole genome assembly process as follows. The raw material of a whole genome assembly are the sequence reads themselves. These often come in linked pairs (so called “paired-end” reads, or “mate-pairs”). Using the Illumina platform as a reference, paired-end reads are typically 100nt in length and separated by 100-200nt. Mate-pair libraries may be made wherein the sequence between pairs is much larger – anywhere from 3kb to 10kb. Contigs are built by alignment, and then contigs are joined or linked using the paired-end or mate-pair information. In addition, collections of reads may derive from an upstream process meant to reduce complexity. This includes bar-coded libraries built from sub-fractions of the genome (like bands from pulsed-field gels, or fosmids). These variations on library preparation and strategy (often referred to as the “recipe”), often contain metadata essential for proper assembly of the reads into contigs. The metadata may be the barcodes, physical chromosome numbers, etc. The incorporation of metadata into visualization is discussed below. There are many algorithms that have been built for de novo assembly. Regardless of the algorithm, major difficulties are encountered with repeat elements, duplicated regions, and other sequence features that produce ambiguous assemblies. Furthermore, the process by which reads are assembled into contigs is prone to a variety of errors and can result in chimeric, or bogus, contigs or assemblies.
Thus, de novo assembly of genomes usually entails an iterative approach in which an assembly is evaluated, parts of the data are extracted, re-assembled with varying parameters, and so on. In some cases, an assembly can only be completed with additional wet-bench experiments, but the evaluation pinpoints where those experiments must be done. In other cases, different assembly approaches or algorithms are employed to extend contigs, close gaps, and resolve repeat regions.
Inherent to this process is the need to visualize the assembly results. Contigs may be represented as individual nodes, and connections between contigs would be paried-end, mate-pair, or barcode linkages. These may be directional, and may have fine granularity. For example, the exact location within a contig that has paired-end or mate pair linkage could be visualized, and is important for resolving false assemblies. In the ideal case, contigs would appear as beads on a string, each with a defined number of linkages to the contig before and after it on the physical map. However, real-world assemblies appear much more complex due to the intrinsic ambiguities of DNA sequence mapping (repeats, etc).
Visualization in Cytoscape would be a facilitator by which the assembly process could proceed efficiently. For example, query tools that would allow subsets of the network to be selected out based on certain characteristics, such as number of “impossible” linkages, sequence similarity of the contigs themselves, and so on. Thus, an innovative aspect of this DBP is having the Cytoscape platform dynamically interact with underlying tools. A node (contig) could be selected, and then the user could select a blast or blat search against all other contigs in the network. The nodes would then be colored on the fly as blast/blat results came in. In this mode, Cytoscape is simply executing the search locally, by taking care of the database formatting, the execution of the search, and the parsing of the results. Parts of the network that contain highly similar sequences could be removed for sub-assembly jobs, or otherwise resolved. Clustering tools could further be used on the network and the overlaid results to further parse the network in to manageable pieces for resolution. The major innovation here is using Cytoscape as a front end for tools that previously required laborious steps to script, execute, parse, and load into Cytoscape. Also, given the shear scale of the data, it will be necessary to push the efficiencies and performance to achieve reasonable on-screen performance.
The incorporation of additional metadata adds tremendous value to Cytoscape as a visualization tool for assembly. This would include, but not be limited to, the underlying sequence, barcodes, physical map anchors, read coverage, sequence variations, gene finding, and blast/blat alignments/similarities. The mapping and visualization of diverse metadata on a very large network will require new approaches to make the results maps useful and interpretable. To achieve this end, this DBP will need to interact closely with the RBVI, as described in the Approach section below.
ApproachThis DBP project necessarily demands a close working relationship with the RBVI. The DeRisi lab has several ongoing whole genome and metagenomic sequencing and assembly projects that can provide real-world, raw data in the context of research projects. The DeRisi lab co-owns a HiSeq-2000 with the Weissman lab, and thus has total access to the actual technology.
Currently, the DeRisi Lab has the following multi-scale sequencing projects running:
- Viral metagenomics: (3kb -100kb viruses in the context of host sequences)
- Bacterial genomes: Streptomyces, Bartonella. (3 megabase – 8 megabase)
- Protozoan genomes: Plasmodium, Crithidia (24 – 35 megabase)
- Vertebrate genomes: Red tailed boa, Northern Spotted Owl, Komodo Dragon (> 1gigabase)
All sequencing and assembly is being conducted in-house, making accessibility to the primary data and assembly process by the RBVI easy and fast. A central server and the QB3 cluster will be used to store and process the data. Tools that will be deployed and considered for integration include ALLPATHS, SOAPdenovo, ABySS, Edena, Bambus2, in addition to the fast aligners, such as BowTie. Basic search tools, such as BLAST and BLAT will also be integrated. Besides using the currently available tools, the DeRisi Lab is authoring new assembly tools to improve de novo assembly performance. The PRICE assembler, currently in beta release ( http://derisilab.ucsf.edu ), is a de novo assembler that uses an inductive approach and performs particularly well in metagenomics contexts. A manuscript describing the algorithm and software is under preparation. The DeRisi lab would work closely with the RBVI to use, interpret, and integrate data and tools with the Cytoscape platform.
DBP 9: Understanding the mechanisms of cell migration (PI: Dyche Mullins)
SignificanceAmong the most fundamental properties of living cells are the ability to control their shape and the ability to move. In most eukaryotic cells, shape and movement are driven by assembly of crosslinked networks of actin filaments in the cytoplasm.
Figure 1. Light microscopy of neutrophil-like HL60 cells. Left: differential interference contrast imaging. Right: three dimensional reconstruction of some of our recent Bessel Beam microscopy data (rendered as an iso-surface contour by UCSF Chimera and shaded by Cinema4D). New imaging technologies provide dramatic new insights into dynamic cell shape changes driven by actin assembly.
Actin Assembly and Cell Migration
Despite years of study, the connection between actin filament assembly and amoeboid cell locomotion remains unclear. This is due, in part, to inherent molecular and biophysical complexities but it also reflects the fact that cell locomotion is not one single process. For years the canonical view of migration was that of a single sequence of coordinated events: (1) actin-driven membrane protrusion; (2) integrin-mediated leading-edge adhesion; (3) myosin-driven cell body contraction; and (4) force-dependent trailing edge de-adhesion. Recent work, however, has exploded this simple story and we now realize that eukaryotic cells use several different mechanisms to crawl. On two-dimensional surfaces most cells depend heavily on integrin-based adhesions. Crawling through complex, three-dimensional environments, however, some cells (e.g. fast-moving leukocytes) can move in an integrin-independent manner, (Lammermann 2008; Lammermann, 2009). The key to this adhesion-independent motility appears to be spatial confinement. When cells are forced to move through restrictions that are small compared to the size of their nuclei, weak electrostatic interactions give them purchase required to move forward (Heuzé, 2013; Renkawitz, 2010).
Dendritic actin networks help drive both two- and three-dimensional cell migration. Loss of the Arp2/3 complex slows two-dimensional fibroblast migration by 75%, similar to the effect of the actin polymerization inhibitor, Latrunculin B (Wu, 2012). The residual motility of these cells can still respond to external chemical cues, but this slow chemotaxis relies on mechanisms of membrane protrusion that, under normal circumstances, clearly do not contribute much to cell migration. Interestingly, under certain types of extreme confinement, such as when cells are squashed between glass coverslips or confined to very narrow channels, some cells can migrate rapidly in an Arp2/3-independent manner, driven solely by myosin-dependent retrograde flow of formin-nucleated actin filaments (Renkawitz, 2010; Matthieu Piel, personal communication). In more complex three-dimensional environments, however, loss of the Arp2/3 complex abolishes dynamic cell protrusions and dramatically slows migration (Giri, 2013). We are using high-resolution, three-dimensional light microscopy (Figure 1), mechanical measurements, and biochemically defined mutants of actin regulators to determine the role of dendritic actin networks in migration of cells through complex three-dimensional environments.
InnovationAt least four aspects of this project represent significant innovations: (1) One innovative feature is the seamless integration across size scales: from single molecule assays, through complex reconstitutions, to in vivo studies. Compared to studies of other complex cellular structures (e.g. the mitotic spindle) we have the great advantage of being able to reconstitute the basic biological function of the lamellipod (generating force and producing movement) from defined components. This enables us to study regulatory interactions inaccessible in vivo. (2) A second innovation is the use of three-dimensional Bessel Beam microscopy to follow migration of cells through complex environments. To extract maximum information we are working with data visualization specialists (Tom Ferrin and Graham Johnson at UCSF) to create new methods for displaying and analyzing high-resolution, 3D, time-lapse movies. (3) Thirdly, in collaboration with the Fletcher Lab at UC Berkeley, we have developed a unique experimental system that enables us to probe mechanics and composition of functional actin networks in unprecedented ways. (4) Fourthly, we have developed new tools to visualize actin in nuclei of live cells. Because they are based on filament-interaction domains of actin binding proteins these probes, unlike GFP-actin, recognize formin-generated actin filaments.
Decades of work on cell motility has been based on two-dimensional microscopy producing simple models of protrusion and retraction (Figure 2). Microscope advances have only recently been able to capture three-dimensional images to characterize cell motion that is not confined to flat surfaces.
Figure 3. Life history of a fan- or petal-shaped pseudopod projecting up from the surface of a neutrophil-like cell crawling on a two-dimensional surface. Three dimensional data sets from Bessel Beam microscopy have been iso-contour rendered in Chimera, colorized and overlaid to show extension (left) and collapse (right) of the pseudopod. The cell is shown in side view.
Use high-resolution 3D light microscopy to describe the functional dynamics of membrane protrusions in crawling cells.
Our goal is to understand the fundamental molecular and biophysical bases of rapid, three-dimensional cell migration. One approach to studying adhesion-independent migration has been to squeeze cells between passivated coverslips or force them into narrow, microfluidic channels. This likely mimics movement of some cells through tight spaces, such as extravasion of neutrophils from the bloodstream, but it does not reproduce conditions experienced by cells migrating through more complex and compliant matrices. In our studies we will focus on migration of neutrophil-like cells moving through sparse collagen matrices or microfluidic devices that mimic normal tissue geometries.
Follow the life-history of three-dimensional "lamellipodial" protrusions in fast-moving cells.
The morphology and molecular architecture (Iwasa, 2007) of lamellipodial actin networks have been studied on two-dimensional surfaces but they are not well understood in three dimensions. We will use Bessel Beam microscopy (Gao, 2014), to characterize membrane protrusions of neutrophils crawling on flat surfaces and through three dimensional collagen matrices. To analyze complex, three-dimensional "movies" of locomoting cells we use the open-source data visualization program UCSF Chimera, developed in Tom Ferrin's laboratory at UCSF (Pettersen, 2004). Chimera began life as a tool to visualize molecular structures, but recent revisions (Goddard, 2007) enable it to render density maps generated by three-dimensional light microscopy. The Ferrin laboratory is currently working with us to further extend Chimera's capabilities to create iso-contour surface renderings of cells and collagen matrices, extracted from three-dimensional data sets. One goal of this work is to produce useful three-dimensional analogs of the kymograph or space-time plot (Figure 3), which has proven useful for abstracting information on cellular dynamics from two-dimensional, time-lapse movies.
By Bessel Beam microscopy we found that, even when they are not supported by a flat surface, pseudopodial protrusions are composed of sheet-like "petals" (Figure 4). This is remarkable given that the dominant model in the field is that planar lamellipodial and lamellar actin networks arise from strong interactions with flat surfaces (Burnette, 2014). Some pseudopods consist of a single petal but, in many cases, a protrusion comprises multiple petals, nested to form a rosette. When we followed their entire life history, we noticed that pseudopods often begin as single, dynamic filopodia. Similarly, when they disappear, pseudopod rosettes collapse into a jumble of filopodial spikes. To understand the molecular architecture of these three dimensional pseudopods, we will perform high-resolution, three-dimensional imaging of the known components of two-dimensional lamellipodial and lamellar networks: Arp2/3 complex, capping protein, cofilin, and tropomyosin. One important question is whether the petals have the same architecture as two-dimensional lamellipodia and lamella (Iwasa, 2007): are they composed of dendritic actin networks sitting on top of contractile networks of tropomyosin-coated filaments? Does pseudopod collapse represent loss of the dendritic network or contraction of an underlying network? Also, do the residual filopodia that remain after collapse of the rosettes represent structures that were present the entire time? Does each lamellar petal have a filopodium at its heart? Also, do these filopodia contain Ena/VASP- or formin-family proteins? If we find that the protrusion of petals is always proceeded by filopodia, that would strongly suggest that pseudopod generation is a multi-step process, with a filopodial initiation phase and a stable, lamellar growth phase. Such a multi-step mechanism might explain several general features of cell migration, including the effect of membrane tension on the outgrowth of new pseudopods (Houk, 2012).
Figure 5. Time-lapse Bessel Beam microscopy of neutrophil-like HL-60 cell crawling through a three-dimensional collagen matrix. Red: filamentous actin labeled by mCherry-utrophin-260. Green: fluorescein collagen. Arrows indicate collagen fibers displaced by passage of the cell. The cell assembles a massive actin ring as it passes through a constriction.
Characterize the interaction of three-dimensional lamellipodial "petals" with the extracellular environment.
Forces generated by cells crawling on two dimensional surfaces have been measured many times (Plotnikov, 2014) and always turn out to be contractile. The distribution of forces around cells crawling in three dimensions have never been carefully measured. We aim to determine the direction and relative magnitude of the forces applied by fast-moving cells to the extracellular matrix. Briefly, we will image neutrophil-differentiated HL-60 cells, expressing a membrane-targeted mCherry, as they move through fluorescein-labeled collagen fibers. We will extract the network architecture of the collagen matrix from every frame of the movie using "Network Extractor" and "Image Surfer" (Feng, 2007), programs written especially for characterizing three-dimensional collagen matrices. After extracting the network architecture we will identify vertices and midpoints of all the network segments. We will then analyze the movements of the segment midpoints as the cell passes through the network. We will normalize the displacement by the thickness of the fiber so that relative displacement corresponds to relative force. Our preliminary data reveal that, in contrast to two-dimensional cell migration, neutrophils exert almost no pulling forces as they pass through a collagen matrix. Almost all of the forces appear to be pushing outward, away from the membrane. Also, when cells reach a constriction in the collagen matrix they generally polymerize a significant amount of actin in contact with the collagen, leading to very large, outward deformation of the collagen fibers around the cell (Figure 5). We will also correlate the pushing forces with local cell morphology. For example, in what direction do the forces around the cell body point? What types of forces are transmitted by growing and ruffling lamellar petals?
Compare the migration and membrane dynamics of normal and perturbed cells.
We intend to couple tools and insights developed from methods of the previous two sections with pharmacological and genetic perturbations to work out the molecular mechanisms underlying three-dimensional membrane protrusion and cell migration. We will, for example, compare the morphology and life cycle of membrane protrusions generated by: (i) normal cells; (ii) cells treated with cytoskeletal inhibitors; (iii) cells in which expression of actin regulators has been knocked down; and (iv) cells expressing biochemically defined mutant versions of actin regulatory proteins. Briefly, we will determine the source of the actin filaments (Arp2/3 complex, formins, etc.) generated at sites of intimate contact with constrictions in the collagen matrix. We will determine the roles of WASP and WAVE in three-dimensional cell migration and determine the extent of crosstalk between these nucleation promoting factors. We will, for example, knock down WASP and WAVE expression and characterize the morphology of the cells as well as their mechanical coupling to the collagen matrix. This is an extremely important experiment as we hypothesize that cells employ different biophysical mechanisms to carry out three dimensional migration in the absence of WASP and WAVE. We will knock down WAVE expression and rescue cells with WAVE truncations to determine whether the capacity to activate the Arp2/3 complex is essential for WAVE's role in pseudopod formation and cell migration.
Example Collaboration and Service Projects
- Integrating Chimera Into the NRAMM Automated Processing Pipeline
- Correlative Light and Electron Microscopy: Visualization of molecular machines in their native cell and tissue context
- Visualization of Biological Assemblies at Intermediate Resolution
- CryoEM Studies of Viruses
- Data Management at the RCSB-PDB
- Electron Microscopy Databank
- Three Dimensional Structure of Chromosomes
- Protein Modeling by Satisfaction of Spatial Restraints
- Structure-Based Inhibitor Discovery
- Pharmacogenetics of Membrane Transporters
- The International Genetrap Consortium
- Deciphering Enzyme Specificity
- Integration of Methods for Structural Analysis of Functional Sites: Active Site Profiling and Fuzzy Functional Forms
- Foundations for Genomic Enzymology: Families, Superfamilies, and Suprafamilies
Older Research ProjectsA list of some of our older and inactive research projects is available here.
Collaborative Research OpportunitiesCollaborative projects bring our expertise in developing computational and visualization tools together with the biomedical expertise of outside scientists. Such efforts may lead to joint publications. Although our resources are limited, we welcome inquiries from scientists interested in collaborating. Inquiries should be directed to Prof. Thomas Ferrin, Director. It would be helpful to include a synopsis of your proposed project in your e-mail.
Laboratory Overview | Research | Outreach & Training | Available Resources | Visitors Center | Search