Introduction

masstools is a comprehensive R package that provides a collection of useful tools for mass spectrometry data processing and analysis. As an integral component of the tidymass ecosystem, masstools serves as a foundational package that offers essential utilities for metabolomics and mass spectrometry workflows. The package encompasses a wide range of functionality designed to support various aspects of mass spectrometry data analysis, including tandem mass spectrometry (MS2) matching, dot product calculations, and other fundamental operations. masstools provides researchers and analysts with reliable, well-tested tools for handling complex mass spectrometry datasets.

The package includes specialized functions for: - Spectral similarity calculations and MS2 matching - Chemical formula processing - Mass spectrometry data file reading and manipulation - Data visualization tools for mass spectra

Developed primarily by Xiaotao Shen and contributors, masstools is designed to work seamlessly with modern R workflows and integrates well with other packages in the tidymass ecosystem, making it an essential tool for anyone working with mass spectrometry data in R.And masstools is a part of tidymass project.

Motivation for Inclusion in Bioconductor

The inclusion of masstools in Bioconductor is motivated by the need for a lightweight, dependency-conscious utility package that addresses common tasks in metabolomics and untargeted MS pipelines. It helps bridge gaps between core infrastructure packages (e.g., Spectra, MSnbase) and applied workflows by providing optimized, reusable building blocks for data parsing, similarity computation, and identifier conversion.

  • Spectra (core MS infrastructure): Spectra provides a formal class system for raw and processed spectra. masstools does not define new S4 classes, but instead offers functional utilities that can be applied directly to vectors, data frames, or lists of spectra—often after data is read via Spectra or MSnbase. For example, similarity metrics (get_dp()) or plotting (ms2_plot()) can complement Spectra workflows in downstream annotation tasks.

  • MSCoreUtils: This package offers generic low-level tools for MS data manipulation. masstools extends beyond these core operations by offering metabolomics-specific functions such as formula validation, adduct summation, and spectral similarity tailored to compound annotation tasks.

  • ProtGenerics: Used internally to align with Bioconductor’s S4 generic interface, though masstools primarily operates using S3 methods and functional programming paradigms. Future compatibility with Bioconductor S4 classes is planned and partially supported through its flexible function design.

  • MSnbase: While not directly dependent on MSnbase, users often pair MSnbase's data structure and raw data access with masstools utilities for data exploration and spectral annotation, e.g., using MSnExp with get_spectra_match_score() or ms2_plot().

  • []

Chemical formula operation

You can use masstools to do the chemical formula operation.

The sum_formula() function is designed to calculate the resulting chemical formula after applying a specific adduct transformation to a given base formula. This is particularly useful in mass spectrometry and metabolomics, where observed m/z values often correspond to the base compound plus/minus an adduct (e.g., [M+H]+, [M+Na]+, [M-H2O+H]+).

sum_formula(formula = "C9H11NO2", adduct = "M+H")
#> [1] "C9H12NO2"
sum_formula(formula = "C9H11NO2", adduct = "M+")
#> [1] "C9H11NO2"
sum_formula(formula = "C9H11NO2", adduct = "M+CH3COOH")
#> [1] "C11H15NO4"
sum_formula(formula = "C9H11", adduct = "M-H20")
#> [1] NA

The split_formula() function is a utility designed to parse a chemical formula string into a structured tabular form—specifically a data.frame containing the elemental composition (symbols and their counts). This function is a foundational step for any downstream chemical formula manipulation, such as computing molecular weights, summing with adducts (see sum_formula()), or formula validation.

split_formula(formula = "C9H11NO2")
#>   element number
#> 2       C      9
#> 3       H     11
#> 4       N      1
#> 5       O      2
split_formula(formula = "C2H4")
#>   element number
#> 2       C      2
#> 3       H      4

MS2 spectra operation

This section demonstrates a typical workflow for comparing and scoring similarity between two MS2 spectra, such as those from an experimental run and a reference library. It involves preprocessing, visualization, peak matching, and similarity scoring.

###remove the noisy peaks in one ms2 spectrum
exp.spectrum <- data.frame(mz = c(1:10, 1.0001), 
                           intensity = c(1:10, 0.1))
                           
###The `ms2_plot()` function enables side-by-side comparison of two MS2 spectra using either static or interactive plots. It allows control over visual styling (colors, font sizes) and comparison parameters (e.g., ppm tolerance).                          

ms2_plot(exp.spectrum)


###The `removeNoise()` function identifies and removes redundant or spurious peaks based on a ppm-based closeness criterion. For overlapping m/z values, it retains the more intense peak.

exp.spectrum2 = removeNoise(exp.spectrum)

ms2_plot(exp.spectrum, exp.spectrum2)


###match two spectra according to mz
exp.spectrum <- data.frame(mz = 1:10, intensity = 1:10)
lib.spectrum <- data.frame(mz = 1:10, intensity = 1:10)

###The `ms2Match()` function aligns peaks between experimental and library MS2 spectra based on m/z proximity within a ppm tolerance.

ms2Match(exp.spectrum, lib.spectrum)
#>    Lib.index Exp.index Lib.mz Lib.intensity Exp.mz Exp.intensity
#> 1          1         1      1             1      1             1
#> 2          2         2      2             2      2             2
#> 3          3         3      3             3      3             3
#> 4          4         4      4             4      4             4
#> 5          5         5      5             5      5             5
#> 6          6         6      6             6      6             6
#> 7          7         7      7             7      7             7
#> 8          8         8      8             8      8             8
#> 9          9         9      9             9      9             9
#> 10        10        10     10            10     10            10


## calculate the dot product of two matched intensity
## The `getDP()` function calculates a weighted dot product between two intensity vectors. This is a common metric for comparing shape similarity between two spectra.
getDP(exp.int = 1:10, lib.int = 1:10)
#> [1] 1
getDP(exp.int = 10:1, lib.int = 1:10)
#> [1] 0.379698

###matched two spectra and calculate dot product
exp.spectrum <- data.frame(mz = 1:10, intensity = 1:10)
lib.spectrum <- data.frame(mz = 1:10, intensity = 1:10)

## This function combines multiple aspects of similarity into a composite similarity score between an experimental and a library MS2 spectrum.

getSpectraMatchScore(exp.spectrum, lib.spectrum)
#> [1] 1

MS2 plot and MS2 matching plot. The ms2_plot() function is designed to visualize one or two MS2 spectra, enabling side-by-side comparison. This is useful in spectral library matching, experimental result inspection, or quality control.

Here, we simulate two identical MS2 spectra and demonstrate how the function can be used for both single-spectrum display and comparative visualization.

spectrum1 <- data.frame(
    mz = c(
        87.50874,
        94.85532,
        97.17808,
        97.25629,
        103.36186,
        106.96647,
        107.21461,
        111.00887,
        113.79269,
        118.70564
    ),
    intensity =
        c(
            8356.306,
            7654.128,
            9456.207,
            8837.188,
            8560.228,
            8746.359,
            8379.361,
            169741.797,
            7953.080,
            8378.066
        )
)
spectrum2 <- spectrum1
ms2_plot(spectrum1, spectrum2)

# ms2_plot(spectrum1, spectrum2, interactive_plot = TRUE)
ms2_plot(spectrum1)

# ms2_plot(spectrum1, interactive_plot = TRUE)

Match two feature tables

We can match two feature tables according to mz and retention time. In LC-MS-based metabolomics or proteomics, features (i.e., detected signals) are typically characterized by mass-to-charge ratio (m/z) and retention time (RT). Accurate matching of features between samples, instruments, or datasets is a fundamental requirement for annotation, alignment, and statistical comparison.

The mz_rt_match() function performs feature-wise matching between two datasets (data1 and data2) using both m/z and RT criteria, allowing users to define mass tolerance in ppm and RT tolerance either in absolute units or relative percentage.

data1 <- data.frame(mz = 1:10, rt = 1:10)
data2 <- data.frame(mz = 1:10, rt = 1:10)
mz_rt_match(data1, data2, mz.tol = 10)
#>    Index1 Index2 mz1 mz2 mz error rt1 rt2 rt error
#> 1       1      1   1   1        0   1   1        0
#> 2       2      2   2   2        0   2   2        0
#> 3       3      3   3   3        0   3   3        0
#> 4       4      4   4   4        0   4   4        0
#> 5       5      5   5   5        0   5   5        0
#> 6       6      6   6   6        0   6   6        0
#> 7       7      7   7   7        0   7   7        0
#> 8       8      8   8   8        0   8   8        0
#> 9       9      9   9   9        0   9   9        0
#> 10     10     10  10  10        0  10  10        0

Compound ID converter

Two web tools are used for compound compound convert.

1. cts.fiehnlab

cts.fiehnlab is http://cts.fiehnlab.ucdavis.edu/service/convert. It support a lot of databases.

We can use the trans_id_database() to get the databases that cts.fiehnlab. Caution: trans_id_database function temporarily can not be used.

database_name <- trans_id_database(server = "cts.fiehnlab")
head(database_name$From$From)
head(database_name$To$From)

We can see that it support a lot of (> 200) databases.

We can try the most common convert, from KEGG to HMDB.

The trans_ID() function facilitates the conversion of metabolite or chemical identifiers between widely used databases such as KEGG, PubChem, HMDB, ChEBI, and others. This is a critical task in metabolomics and cheminformatics, especially when integrating datasets from different sources, performing compound annotation, or preparing data for submission to external tools and repositories.

trans_ID(
  query = "C00001",
  from = "KEGG",
  to = "Human Metabolome Database",
  top = 1,
  server = "cts.fiehnlab"
)
#>     KEGG Human Metabolome Database
#> 1 C00001                        NA

Now, trans_ID doesn’t support verctor query. So you can use the purrr::map() to achive this.

c("C00001", "C00001", "C00001") %>%
  purrr::map(
    .f = function(x) {
      trans_ID(
        query = x,
        from = "KEGG",
        to = "Human Metabolome Database",
        top = 1,
        server = "cts.fiehnlab"
      )
    }
  ) %>%
  do.call(rbind, .) %>%
  as.data.frame()
#>     KEGG Human Metabolome Database
#> 1 C00001                        NA
#> 2 C00001                        NA
#> 3 C00001                        NA

2. chemspider

This is from https://www.chemspider.com/InChI.asmx.

We can use the trans_id_database() to get the databases that chemspider

database_name2 <- trans_id_database(server = "chemspider")
database_name2$From
database_name2$To

This is very useful if you want to get the inchikey, inchi or smiles for one compound. But this web only support “ChemSpider ID” (csid), so we need use cts.fiehnlab convert to csid first.

trans_ID(
  query = "C00001",
  from = "KEGG",
  to = "ChemSpider",
  top = 1,
  server = "cts.fiehnlab"
)
#>     KEGG ChemSpider
#> 1 C00001         NA
trans_ID(
  query = "140526",
  from = "csid",
  to = "mol",
  top = 1,
  server = "chemspider"
)
#> [1] NA

Get compound class based on classyfire

Refer this publication: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-016-0174-y

The get_compound_class() function provides a convenient interface to query the ClassyFire database (https://classyfire.wishartlab.com), retrieving the chemical taxonomy and classification of a compound given its InChIKey. ClassyFire organizes chemical compounds into a structured taxonomy (Kingdom → Superclass → Class → Subclass…), and is widely used in cheminformatics, metabolomics, and compound annotation.

result <- 
get_compound_class(
  inchikey = "QZDWODWEESGPLC-UHFFFAOYSA-N",
  server = "http://classyfire.wishartlab.com/entities/",
  sleep = 5
)
result
#> Kingdom : Organic compounds
#> └─Superclass : Organoheterocyclic compounds
#>   └─Class : Pyridines and derivatives

Other tools

Rename one vector with duplicated items.

In many data processing workflows—especially during column naming, feature annotation, or ID assignment—duplicate names can lead to ambiguity, overwriting, or unexpected behavior in downstream analyses.

The name_duplicated() function provides a simple and robust mechanism to disambiguate repeated values in a character vector by appending sequential suffixes (e.g., _1, _2, …).

name_duplicated(c("a", "a", "b", "c", "a", "b", "c", "a"))
#> [1] "a_1" "a_2" "b_1" "c_1" "a_3" "b_2" "c_2" "a_4"
name_duplicated(c(rep(1, 5), 2))
#> [1] "1_1" "1_2" "1_3" "1_4" "1_5" "2"
name_duplicated(1:5)
#> [1] 1 2 3 4 5

Set working directory in Windows

Copy the file path in File explorer in Windows.

Then type in R:

setwd_win()

Then paste the file path and type Enter.

Check the operate system

The get_os() function is a lightweight utility that determines the operating system (OS) on which the current R session is running. This is often necessary when designing scripts or packages that involve platform-dependent behavior.

get_os()
#>   sysname 
#> "windows"

Check version of masstools

Output logo and version of masstools.

masstools_logo()
#>                        _______          _     
#>                       |__   __|        | |    
#>   _ __ ___   __ _ ___ ___| | ___   ___ | |___ 
#>  | '_ ` _ \ / _` / __/ __| |/ _ \ / _ \| / __|
#>  | | | | | | (_| \__ \__ \ | (_) | (_) | \__ \
#>  |_| |_| |_|\__,_|___/___/_|\___/ \___/|_|___/
#>                                               
#> 

Check conflicts of masstools

The masstools_conflicts() function is a diagnostic tool designed to identify function name conflicts between functions exported by the masstools package and functions from other packages currently attached in the R session.

Function name conflicts are common in the R ecosystem due to the way the search path is managed. When multiple packages export functions with the same name (e.g., filter() from both stats and dplyr), the one most recently attached typically “masks” the others. This can lead to confusing or unintended behavior.

masstools_conflicts()
#> ── Conflicts ────────────────────────────────────────── masstools_conflicts() ──
#>  methods::body<-()    masks base::body<-()
#>  tidyr::extract()     masks magrittr::extract()
#>  dplyr::filter()      masks stats::filter()
#>  methods::kronecker() masks base::kronecker()
#>  dplyr::lag()         masks stats::lag()
#>  purrr::set_names()   masks magrittr::set_names()

List all pacakges in masstools

The masstools_packages() function provides a programmatic way to retrieve the list of R packages imported by masstools, as declared in its DESCRIPTION file under the Imports field. This utility is particularly helpful in diagnostic functions, dependency management, and package introspection.

masstools_packages()
#>  [1] "dplyr"        "jsonlite"     "remotes"      "magrittr"     "tibble"      
#>  [6] "tidyr"        "stringr"      "methods"      "crayon"       "cli"         
#> [11] "purrr"        "pbapply"      "httr"         "rvest"        "xml2"        
#> [16] "stats"        "utils"        "MSnbase"      "ProtGenerics" "lifecycle"   
#> [21] "ggplot2"      "masstools"

Session information

sessionInfo()
#> R version 4.4.3 (2025-02-28 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.4  forcats_1.0.0    stringr_1.5.1    purrr_1.0.4     
#>  [5] readr_2.1.5      tidyr_1.3.1      tibble_3.3.0     ggplot2_3.5.2   
#>  [9] tidyverse_2.0.0  dplyr_1.1.4      magrittr_2.0.3   masstools_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] pbapply_1.7-2               remotes_2.5.0              
#>   [3] rlang_1.1.6                 clue_0.3-66                
#>   [5] matrixStats_1.5.0           compiler_4.4.3             
#>   [7] vctrs_0.6.5                 reshape2_1.4.4             
#>   [9] rvest_1.0.4                 ProtGenerics_1.38.0        
#>  [11] pkgconfig_2.0.3             crayon_1.5.3               
#>  [13] fastmap_1.2.0               XVector_0.46.0             
#>  [15] labeling_0.4.3              rmarkdown_2.29             
#>  [17] tzdb_0.5.0                  UCSC.utils_1.2.0           
#>  [19] preprocessCore_1.68.0       xfun_0.52                  
#>  [21] MultiAssayExperiment_1.32.0 zlibbioc_1.52.0            
#>  [23] cachem_1.1.0                GenomeInfoDb_1.42.3        
#>  [25] jsonlite_2.0.0              DelayedArray_0.32.0        
#>  [27] BiocParallel_1.36.0         parallel_4.4.3             
#>  [29] cluster_2.1.8               R6_2.6.1                   
#>  [31] bslib_0.9.0                 stringi_1.8.7              
#>  [33] RColorBrewer_1.1-3          limma_3.62.2               
#>  [35] GenomicRanges_1.58.0        jquerylib_0.1.4            
#>  [37] Rcpp_1.1.0                  SummarizedExperiment_1.36.0
#>  [39] iterators_1.0.14            knitr_1.50                 
#>  [41] IRanges_2.40.1              timechange_0.3.0           
#>  [43] Matrix_1.7-2                igraph_2.1.4               
#>  [45] tidyselect_1.2.1            rstudioapi_0.17.1          
#>  [47] abind_1.4-8                 yaml_2.3.10                
#>  [49] doParallel_1.0.17           codetools_0.2-20           
#>  [51] affy_1.84.0                 curl_6.4.0                 
#>  [53] lattice_0.22-6              plyr_1.8.9                 
#>  [55] withr_3.0.2                 Biobase_2.66.0             
#>  [57] evaluate_1.0.4              desc_1.4.3                 
#>  [59] xml2_1.3.8                  pillar_1.11.0              
#>  [61] affyio_1.76.0               BiocManager_1.30.26        
#>  [63] MatrixGenerics_1.18.1       foreach_1.5.2              
#>  [65] stats4_4.4.3                MSnbase_2.32.0             
#>  [67] MALDIquant_1.22.3           ncdf4_1.24                 
#>  [69] generics_0.1.4              hms_1.1.3                  
#>  [71] S4Vectors_0.44.0            scales_1.4.0               
#>  [73] glue_1.8.0                  lazyeval_0.2.2             
#>  [75] tools_4.4.3                 mzID_1.44.0                
#>  [77] QFeatures_1.16.0            vsn_3.74.0                 
#>  [79] mzR_2.40.0                  fs_1.6.6                   
#>  [81] XML_3.99-0.18               grid_4.4.3                 
#>  [83] impute_1.80.0               MsCoreUtils_1.18.0         
#>  [85] colorspace_2.1-1            GenomeInfoDbData_1.2.13    
#>  [87] PSMatch_1.10.0              cli_3.6.5                  
#>  [89] S4Arrays_1.6.0              AnnotationFilter_1.30.0    
#>  [91] pcaMethods_1.98.0           gtable_0.3.6               
#>  [93] selectr_0.4-2               sass_0.4.10                
#>  [95] digest_0.6.37               BiocGenerics_0.52.0        
#>  [97] SparseArray_1.6.2           htmlwidgets_1.6.4          
#>  [99] farver_2.1.2                htmltools_0.5.8.1          
#> [101] pkgdown_2.1.3               lifecycle_1.0.4            
#> [103] httr_1.4.7                  statmod_1.5.0              
#> [105] MASS_7.3-65