Spectra-cluster-cli
Original author(s) | J. Griss, R. Wang |
---|---|
Usage Details | |
Input format | MGF |
Output format | .clustering |
Software Details | |
Website | https://github.com/spectra-cluster/spectra-cluster-cli |
Operating system | Windows, Linux, Mac OSX |
License | Apache 2 |
Initial release | 15.03.2016 |
Development status | active |
The spectra-cluster-cli is a stand-alone, command-line version of the spectra-cluster algorithm to cluster MS/MS spectra. The spectra-cluster algorithm is used to create the PRIDE Cluster resource. The spectra-cluster-cli tool allows user to cluster their own MS/MS data on a normal computer.
Contents
Features
The spectra-cluster-cli is a command-line tool. It clusters MS/MS spectra in the MGF peak list format and writes the results to a .clustering result file. It currently supports:
- Clustering of MS/MS spectra (MGF format)
- Parallel processing of data
Note: Alternatively, a graphical user interface is also available at https://github.com/spectra-cluster/spectra-cluster-gui
Additionally, you can find a detailed example of how to prepare your data for clustering at http://spectra-cluster.github.io.
Installation
Requirements
To run the spectra-cluster-cli Java needs to be installed on the computer.
Downloading spectra-cluster-cli
- Download the latest release from https://github.com/spectra-cluster/spectra-cluster-cli/releases
- Unpack the downloaded zip file. No further installation is required
Usage
Preparing your data
Peak lists must be available in MGF format. In case your data is only available in a different format, use ProteoWizard's msconvert to convert it to MGF.
Optionally, identification data can be incorporated into the MGF files using the SEQ=
tag. A tool to do this automatically is under active development and will be released shortly. The identification data does not influence the clustering process. If identification data is present in the used MGF files the corresponding data will automatically be written to the resulting .clustering files.
Performing the clustering
Summary
C:\>cd C:\Downloads\spectra-cluster-cli
C:\Downloads\spectra-cluster-cli>java -jar spectra-cluster-cli-1.0-SNAPSHOT.jar -rounds 4 -threshold_end 0.99 -threshold_start 0.9999 -output_path C:\my_data\clustering_results.clustering C:\my_data\peaklists\*.mgf
Detailed instructions
- Open a command prompt. On windows this is achieved by opening the
run
dialog, entering "cmd" and clicking enter (on Windows 7 simply click the windows key and immediately enter "cmd"). - Navigate to the directory where you downloaded the spectra-cluster-cli tool.
- Launch the spectra-cluster-cli through the command
java -jar spectra-cluster-cli-1.0-SNAPSHOT.jar
(see above). Note: thespectra-cluster-cli-1.0-SNAPSHOT.jar
part has to be adapted based on the downloaded version.
Additional options
-binary_directory <arg> path to the directory to
(temporarily) store the binary
files. By default a temporary
directory is being created
-cluster_binary_file <arg> if this option is set, only the
passed binary file will be
clustered and the result written
to the file specified in
'-output_path' in the binary
format
-convert_cgf if this option is set the passed
CGF file is converted into a
.clustering file
-fast_mode if this option is set the 'fast
mode' is enabled. In this mode,
the radical peak filtering used
for the comparison function is
already applied during spectrum
conversion. Thereby, the
clustering and consensus spectrum
quality is slightly decreased but
speed increases 2-3 fold.
-fragment_tolerance fragment ion tolerance in m/z to
use for fragment peak matching
-help print this message.
-keep_binary_files if this options is set, the
binary files are not deleted
after clustering.
-major_peak_jobs <arg> number of threads to use for
major peak clustering.
-merge_binary_results if this option is set, the passed
binary results files are merged
into a single .cgf file and
written to '-output_path'
-output_path <arg> path to the outputfile.
Outputfile must not exist.
-precursor_tolerance <arg> precursor tolerance (clustering
window size) in m/z used during
matching.
-reuse_binary_files if this option is set, the binary
files found in the binary file
directory will be used for
clustering.
-rounds <arg> number of clustering rounds to
use.
-threshold_end <arg> (lowest) final clustering
threshold
-threshold_start <arg> (highest) starting threshold
-x_learn_cdf <output filename> (Experimental option) Learn the
used cumulative distribution
function directly from the
processed data. This is only
recommended for high-resolution
data. The result will be written
to the defined file.
-x_load_cdf <CDF filename> (Experimental option) Loads the
cumulative distribution function
to use from the specified file.
These files can be created using
the x_learn_cdf parameter
-x_min_comparisons <arg> (Experimental option) Sets the
minimum number of comparisons
used to calculate the probability
that incorrect spectra are
clustered.
-x_n_prefiltered_peaks <number peaks> (Experimental option) Set the
number of highest peaks that are
kept per spectrum during loading.
Analysing clustering results
We are currently working on an analysis software for clustering results in the .clustering format.