From European Proteomics Academy
Jump to navigation Jump to search
Original author(s) J. Griss, R. Wang
Usage Details
Input format MGF
Output format .clustering
Software Details
Website https://github.com/spectra-cluster/spectra-cluster-cli
Operating system Windows, Linux, Mac OSX
License Apache 2
Initial release 15.03.2016
Development status active

The spectra-cluster-cli is a stand-alone, command-line version of the spectra-cluster algorithm to cluster MS/MS spectra. The spectra-cluster algorithm is used to create the PRIDE Cluster resource. The spectra-cluster-cli tool allows user to cluster their own MS/MS data on a normal computer.


The spectra-cluster-cli is a command-line tool. It clusters MS/MS spectra in the MGF peak list format and writes the results to a .clustering result file. It currently supports:

  • Clustering of MS/MS spectra (MGF format)
  • Parallel processing of data

Note: Alternatively, a graphical user interface is also available at https://github.com/spectra-cluster/spectra-cluster-gui

Additionally, you can find a detailed example of how to prepare your data for clustering at http://spectra-cluster.github.io.



To run the spectra-cluster-cli Java needs to be installed on the computer.

Downloading spectra-cluster-cli

  1. Download the latest release from https://github.com/spectra-cluster/spectra-cluster-cli/releases
  2. Unpack the downloaded zip file. No further installation is required


Preparing your data

Peak lists must be available in MGF format. In case your data is only available in a different format, use ProteoWizard's msconvert to convert it to MGF.

Optionally, identification data can be incorporated into the MGF files using the SEQ= tag. A tool to do this automatically is under active development and will be released shortly. The identification data does not influence the clustering process. If identification data is present in the used MGF files the corresponding data will automatically be written to the resulting .clustering files.

Performing the clustering


C:\>cd C:\Downloads\spectra-cluster-cli
C:\Downloads\spectra-cluster-cli>java -jar spectra-cluster-cli-1.0-SNAPSHOT.jar -rounds 4 -threshold_end 0.99 -threshold_start 0.9999 -output_path C:\my_data\clustering_results.clustering C:\my_data\peaklists\*.mgf

Detailed instructions

  1. Open a command prompt. On windows this is achieved by opening the run dialog, entering "cmd" and clicking enter (on Windows 7 simply click the windows key and immediately enter "cmd").
  2. Navigate to the directory where you downloaded the spectra-cluster-cli tool.
  3. Launch the spectra-cluster-cli through the command java -jar spectra-cluster-cli-1.0-SNAPSHOT.jar (see above). Note: the spectra-cluster-cli-1.0-SNAPSHOT.jar part has to be adapted based on the downloaded version.

Additional options

 -binary_directory <arg>                 path to the directory to
                                         (temporarily) store the binary
                                         files. By default a temporary
                                         directory is being created
 -cluster_binary_file <arg>              if this option is set, only the
                                         passed binary file will be
                                         clustered and the result written
                                         to the file specified in
                                         '-output_path' in the binary
 -convert_cgf                            if this option is set the passed
                                         CGF file is converted into a
                                         .clustering file
 -fast_mode                              if this option is set the 'fast
                                         mode' is enabled. In this mode,
                                         the radical peak filtering used
                                         for the comparison function is
                                         already applied during spectrum
                                         conversion. Thereby, the
                                         clustering and consensus spectrum
                                         quality is slightly decreased but
                                         speed increases 2-3 fold.
 -fragment_tolerance                     fragment ion tolerance in m/z to
                                         use for fragment peak matching
 -help                                   print this message.
 -keep_binary_files                      if this options is set, the
                                         binary files are not deleted
                                         after clustering.
 -major_peak_jobs <arg>                  number of threads to use for
                                         major peak clustering.
 -merge_binary_results                   if this option is set, the passed
                                         binary results files are merged
                                         into a single .cgf file and
                                         written to '-output_path'
 -output_path <arg>                      path to the outputfile.
                                         Outputfile must not exist.
 -precursor_tolerance <arg>              precursor tolerance (clustering
                                         window size) in m/z used during
 -reuse_binary_files                     if this option is set, the binary
                                         files found in the binary file
                                         directory will be used for
 -rounds <arg>                           number of clustering rounds to
 -threshold_end <arg>                    (lowest) final clustering
 -threshold_start <arg>                  (highest) starting threshold
 -x_learn_cdf <output filename>          (Experimental option) Learn the
                                         used cumulative distribution
                                         function directly from the
                                         processed data. This is only
                                         recommended for high-resolution
                                         data. The result will be written
                                         to the defined file.
 -x_load_cdf <CDF filename>              (Experimental option) Loads the
                                         cumulative distribution function
                                         to use from the specified file.
                                         These files can be created using
                                         the x_learn_cdf parameter
 -x_min_comparisons <arg>                (Experimental option) Sets the
                                         minimum number of comparisons
                                         used to calculate the probability
                                         that incorrect spectra are
 -x_n_prefiltered_peaks <number peaks>   (Experimental option) Set the
                                         number of highest peaks that are
                                         kept per spectrum during loading.

Analysing clustering results

We are currently working on an analysis software for clustering results in the .clustering format.