Wu Enhui, Qiao Liang*
Department of Chemistry, Fudan University, Shanghai 200433, China
Microorganisms are closely related to human diseases and health. How to understand the composition of microbial communities and their functions is a major issue that needs to be studied urgently. In recent years, metaproteomics has become an important technical means to study the composition and function of microorganisms. However, due to the complexity and high heterogeneity of microbial community samples, sample processing, mass spectrometry data acquisition and data analysis have become the three major challenges currently faced by metaproteomics. In metaproteomics analysis, it is often necessary to optimize the pretreatment of different types of samples and adopt different microbial separation, enrichment, extraction and lysis schemes. Similar to the proteome of a single species, the mass spectrometry data acquisition modes in metaproteomics include data-dependent acquisition (DDA) mode and data-independent acquisition (DIA) mode. The DIA data acquisition mode can completely collect the peptide information of the sample and has great development potential. However, due to the complexity of metaproteome samples, its DIA data analysis has become a major problem that hinders the deep coverage of metaproteomics. In terms of data analysis, the most important step is the construction of a protein sequence database. The size and completeness of the database not only have a great impact on the number of identifications, but also affect the analysis at the species and functional levels. At present, the gold standard for the construction of a metaproteome database is a protein sequence database based on the metagenome. At the same time, the public database filtering method based on iterative search has also been proven to have strong practical value. From the perspective of specific data analysis strategies, peptide-centered DIA data analysis methods have occupied an absolute mainstream. With the development of deep learning and artificial intelligence, it will greatly promote the accuracy, coverage and analysis speed of macroproteomic data analysis. In terms of downstream bioinformatics analysis, a series of annotation tools have been developed in recent years, which can perform species annotation at the protein level, peptide level and gene level to obtain the composition of microbial communities. Compared with other omics methods, the functional analysis of microbial communities is a unique feature of macroproteomics. Macroproteomics has become an important part of multi-omics analysis of microbial communities, and still has great development potential in terms of coverage depth, detection sensitivity, and data analysis completeness.
01Sample pretreatment
At present, metaproteomics technology has been widely used in the research of human microbiome, soil, food, ocean, active sludge and other fields. Compared with the proteome analysis of a single species, the sample pretreatment of metaproteome of complex samples faces more challenges. The microbial composition in actual samples is complex, the dynamic range of abundance is large, the cell wall structure of different types of microorganisms is very different, and the samples often contain a large amount of host proteins and other impurities. Therefore, in the analysis of metaproteome, it is often necessary to optimize different types of samples and adopt different microbial separation, enrichment, extraction and lysis schemes.
The extraction of microbial metaproteomes from different samples has certain similarities as well as some differences, but currently there is a lack of a unified pre-processing process for different types of metaproteome samples.
02Mass spectrometry data acquisition
In shotgun proteome analysis, the peptide mixture after pretreatment is first separated in the chromatographic column, and then enters the mass spectrometer for data acquisition after ionization. Similar to single species proteome analysis, the mass spectrometry data acquisition modes in macroproteome analysis include DDA mode and DIA mode.
With the continuous iteration and update of mass spectrometry instruments, mass spectrometry instruments with higher sensitivity and resolution are applied to metaproteome, and the coverage depth of metaproteome analysis is also continuously improved. For a long time, a series of high-resolution mass spectrometry instruments headed by Orbitrap have been widely used in metaproteome.
Table 1 of the original text shows some representative studies on metaproteomics from 2011 to the present in terms of sample type, analysis strategy, mass spectrometry instrument, acquisition method, analysis software, and number of identifications.
03Mass spectrometry data analysis
3.1 DDA data analysis strategy
3.1.1 Database Search
3.1.2 de novo sequencing strategy
3.2 DIA data analysis strategy
04Species classification and functional annotation
The composition of microbial communities at different taxonomic levels is one of the key research areas in microbiome research. In recent years, a series of annotation tools have been developed to annotate species at the protein level, peptide level, and gene level to obtain the composition of microbial communities.
The essence of functional annotation is to compare the target protein sequence with the functional protein sequence database. Using gene function databases such as GO, COG, KEGG, eggNOG, etc., different functional annotation analyses can be performed on proteins identified by macroproteomes. Annotation tools include Blast2GO, DAVID, KOBAS, etc.
05Summary and Outlook
Microorganisms play an important role in human health and disease. In recent years, metaproteomics has become an important technical means to study the function of microbial communities. The analytical process of metaproteomics is similar to that of single-species proteomics, but due to the complexity of the research object of metaproteomics, specific research strategies need to be adopted in each analysis step, from sample pretreatment, data acquisition to data analysis. At present, thanks to the improvement of pretreatment methods, the continuous innovation of mass spectrometry technology and the rapid development of bioinformatics, metaproteomics has made great progress in identification depth and application scope.
In the process of pre-treatment of macroproteome samples, the nature of the sample must be considered first. How to separate microorganisms from environmental cells and proteins is one of the key challenges facing macroproteomes, and the balance between separation efficiency and microbial loss is an urgent problem to be solved. Secondly, the protein extraction of microorganisms must take into account the differences caused by the structural heterogeneity of different bacteria. Macroproteome samples in the trace range also require specific pre-treatment methods.
In terms of mass spectrometry instruments, mainstream mass spectrometry instruments have undergone a transition from mass spectrometers based on Orbitrap mass analyzers such as LTQ-Orbitrap and Q Exactive to mass spectrometers based on ion mobility coupled time-of-flight mass analyzers such as timsTOF Pro. The timsTOF series of instruments with ion mobility dimension information have high detection accuracy, low detection limit, and good repeatability. They have gradually become important instruments in a variety of research fields that require mass spectrometry detection, such as the proteome, metaproteome, and metabolome of a single species. It is worth noting that for a long time, the dynamic range of mass spectrometry instruments has limited the protein coverage depth of metaproteome research. In the future, mass spectrometry instruments with a larger dynamic range can improve the sensitivity and accuracy of protein identification in metaproteomes.
For mass spectrometry data acquisition, although the DIA data acquisition mode has been widely adopted in the proteome of a single species, most current macroproteome analyses still use the DDA data acquisition mode. The DIA data acquisition mode can fully obtain the fragment ion information of the sample, and compared with the DDA data acquisition mode, it has the potential to fully obtain the peptide information of the macroproteome sample. However, due to the high complexity of DIA data, the analysis of DIA macroproteome data is still facing great difficulties. The development of artificial intelligence and deep learning is expected to improve the accuracy and completeness of DIA data analysis.
In the data analysis of metaproteomics, one of the key steps is the construction of protein sequence database. For popular research areas such as intestinal flora, intestinal microbial databases such as IGC and HMP can be used, and good identification results have been achieved. For most other metaproteomics analyses, the most effective database construction strategy is still to establish a sample-specific protein sequence database based on metagenomic sequencing data. For microbial community samples with high complexity and large dynamic range, it is necessary to increase the sequencing depth to increase the identification of low-abundance species, thereby improving the coverage of the protein sequence database. When sequencing data is lacking, an iterative search method can be used to optimize the public database. However, iterative search may affect FDR quality control, so the search results need to be carefully checked. In addition, the applicability of traditional FDR quality control models in metaproteomics analysis is still worth exploring. In terms of search strategy, the hybrid spectral library strategy can improve the coverage depth of DIA metaproteomics. In recent years, the predicted spectral library generated based on deep learning has shown superior performance in DIA proteomics. However, metaproteome databases often contain millions of protein entries, which results in a large scale of predicted spectral libraries, consumes a lot of computing resources, and results in a large search space. In addition, the similarity between protein sequences in metaproteomes varies greatly, making it difficult to ensure the accuracy of the spectral library prediction model, so predicted spectral libraries have not been widely used in metaproteomics. In addition, new protein inference and classification annotation strategies need to be developed to apply to metaproteomics analysis of highly sequence-similar proteins.
In summary, as an emerging microbiome research technology, metaproteomics technology has achieved significant research results and also has huge development potential.