nf-core/mspepid
This pipeline performs the peptide identification of MS2 spectra from a proteomics experiment. For this, different search algorithm and rescoring approaches can be selected.
Introduction
This document describes the output produced by the pipeline. All paths are relative to the top-level results directory specified with --outdir.
Pipeline overview
nf-core/mspepid processes mass spectrometry data through the following steps:
- Database preparation - optional entrapment database creation and decoy sequence generation
- Spectra preparation - decompression and vendor-format conversion to mzML
- Spectra identification - database search with Comet and/or Sage
- Rescoring - FDR-based PSM rescoring with Percolator and/or MS2Rescore
Database preparation
Decoy generation
Decoy protein sequences are automatically appended to the target database unless --skip_decoy_generation is set. Decoys are generated by the OpenMS DecoyDatabase tool using sequence reversal as default strategy.
Entrapment database
When --entrapment_fold is greater than 0, an entrapment database is created by FDRBench prior to decoy generation. Entrapment sequences (shuffled target proteins) are used for independent FDR estimation during benchmarking.
Spectra identification
For each enabled search engine, PSM results are written under a subdirectory named after the search engine (e.g. comet/ or sage/).
All results are converted to a common format with psm-utils for downstream processing.
Comet
Output files
comet/*.mzid- PSMs in mzIdentML standard format (default output of Comet)*.comet.params- the final input parameters of the Comet run.*.psmutils.tsv- PSMs in psm-utils TSV format, used as input to rescoring steps.*.pin- Percolator input file (PIN format) containing search engine features.- Additional outputs can be activated, see module description of Comet
Comet is a widely used open-source database search engine for tandem mass spectra.
Sage
Output files
sage/*.results.sage.tsv- PSMs in Sage’s default TSV output format.*.results.json- the final input parameters of the Sage run.*.psmutils.tsv- PSMs in psm-utils TSV format, used as input to rescoring steps.*.pin- Percolator input file (PIN format) containing search engine features.
Sage is a fast, Rust-based database search engine that scales efficiently to large datasets and large protein databases.
Rescoring
PSMs from each search engine are rescored independently. Output directories are nested under the search engine directory.
Percolator
Output files
<searchengine>/percolator/*.target.psms- Target PSMs scored and filtered by Percolator.*.decoy.psms- Decoy PSMs (used for FDR calibration).- Additional outputs can be activated, see module description of Percolator
Percolator applies semi-supervised machine learning (using a SVM) to re-rank PSMs using search engine scores and auxiliary features, then estimates FDR using a target-decoy competition approach.
MS2Rescore / Tims2Rescore
Output files
<searchengine>/ms2rescore/*.target.psms- Target PSMs rescored after MS2PIP feature augmentation.*.decoy.psms- Decoy PSMs.*.pin- The original PSMs after identification with additional features created my MS2PIP
MS2Rescore generates additional rescoring features by comparing observed fragment ion spectra to spectra predicted by MS2PIP. The augmented feature set is then passed to Percolator for final scoring. This typically improves PSM identifications at a given FDR threshold, particularly for challenging samples such as immunopeptidomes or non-tryptic digests. I TIMS data is used as input, automatically Tims2Rescore is applied (which is the default behaviour of newer MS2Rescore implementations).
The MS2PIP fragmentation model is controlled by --ms2rescore_model (default: HCD).
Pipeline information
Output files
pipeline_info/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameters are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv. - Parameters used by the pipeline run:
params.json. - Software versions used:
nf_core_mspepid_software_versions.yml.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.