VSOLassoBag is a variable-selection oriented Lasso bagging algorithm for biomarker development in omics-based translational research, providing a bagging Lasso framework for selecting significant variables from multiple models. A main application of this package is to screen a limited number of variables that are less dependent to the training data sets. Basically, this package was initially developed to adjust Lasso selected results from bootstrapped sample sets. The variables with the highest frequency among several selected results were considered as the stable variables to discriminate different sample sets. However, it is usually difficult to determine the cut-off point in terms of frequency when applied in real data sets. In this package, we introduced several methods, namely (1) curve elbow point detection, (2) parametric statistical test to help determine the cut-off point for variable selection. The source code and documents are freely available through Github (https://github.com/likelet/VSOLassoBag).
If you find this tool useful, please cite:
Liang J, Wang C, Zhang D, et al. VSOLassoBag: A variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research [published online ahead of print, 2023 Jan 3]. J Genet Genomics. 2023;S1673-8527(22)00285-5. doi:10.1016/j.jgg.2022.12.005
VSOLassoBag can run in both Windows system and most POSIX systems. To the latest version of this package, start R and enter:
To perform analysis with VSOLassoBag
function, you need
to provide:
ExpressionData is an object constructed by SummarizedExperiment, including independent variables (e.g., expression matrix) and dependent/outcome variable(s). Please view the SummarizedExperiment doucumention for the construction of SummarizedExperiment objects.
character indicating the type of dependent
variables as input to the parameter a.family
.
Note:
a.family
is determined entirely by the type of
dependent/outcome variable(s)
.
You can also tune other parameters to better balance the
performance and resource required for the analysis. Key parameters
include bootN
for bagging times, and
bagFreq.sigMethod
for the cut-off point decision
method.
Independent variables
is a matrix-like object with
variables as rows and observations as columns, all the entries should be
numeric. It can be obtained by SummarizedExperiment::assay
function.
Example Independent variables
matrix
[,1] [,2] [,3] [,4] [,5] [,6]
X_1 -1.0059043 -0.5831154 0.5188620 0.08714558 0.45080348 0.1149953
X_2 1.2907372 -0.1005391 0.3385391 0.28154759 -0.14453034 1.0213830
X_3 -0.5321077 1.9809263 0.5795232 0.98104200 -0.02066287 -1.0550853
X_4 -0.9984327 1.4455361 1.3130022 0.47038414 -1.23164279 2.3385465
X_5 -0.2032319 -0.6088388 -1.3633668 -0.22835534 2.33349367 0.2281900
X_6 2.4383744 0.7443007 1.7134854 0.62076548 0.07363745 -0.0915762
dependent/outcome variable(s)
store sample information
with the same rows as the samples in the independent variables matrix
and can be invoked by SummarizedExperiment::colData
function.
Example out.mat
vector
[1] 0.1580122 0.4223877 4.3253401 3.9309707 1.2470942 -0.4556825
a.family
is a character determined by the data type of
out.mat
, including binomial,
gaussian, cox,
multinomial, mgaussian,
poisson.
A list with:
results
: a dataframe containing
variables with selection frequency >= 1. If setting
bagFreq.sigMethod == "PST"
, the P.value
and the P.adjust of each variable will be provided. If
setting bagFreq.sigMethod = "CEP"
or using the elbow point
indicators, elbow point(s) will be marked with * and
thresholds for each variable will be assigned.
permutations
: the permutation test results.
model
: the regression model built on LassoBag
results.
For tutorial purpose, here we used two examples utilizing different
cut-off point decision methods to exhibit how to interpret results. The
data is simulated by Gaussian model and you can obtain it by
data(simulated_example)
“CEP” (i.e. “Curve Elbow Point Detection”) is the default and recommended method for cut-off point decision. This method assumes that a sharp decrease of the observed frequency should separate important features from those unimportant ones, and based on this detects the elbow point(s) on the observed frequency curve. Finally the features with observed frequency higher than the elbow point are inferred as important features.
Note: There may be more than one elbow
point detected on the curve when using loose threshold,
therefore it is recommended to use a stricter threshold first (use a
larger kneedle.S
) and the algorithm will automatically
loosen the S parameter in case no elbow point can be found.
library(VSOLassoBag)
data(ExpressionData)
set.seed(19084)
VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "PST", do.plot = FALSE)
Results of important variables:
Thu Nov 21 04:51:04 2024 -- INIT process completed, start analyzing
Thu Nov 21 04:51:04 2024 -- start calculating observed frequency
2024-11-21 04:51:56.776522--Bagging finished ...
Thu Nov 21 04:51:57 2024 -- Using Non-permutation method to determine a cutoff point
Using Parametric Statistical Test...
Thu Nov 21 04:51:57 2024 -- Done
variable Frequency P.value P.adjust
3 X_3 100 0.000000e+00 0.000000e+00
7 X_7 100 0.000000e+00 0.000000e+00
2 X_2 99 3.872023e-116 3.071805e-114
10 X_10 98 5.138189e-113 3.057222e-111
6 X_6 97 3.375159e-110 1.606576e-108
9 X_9 88 1.233395e-89 4.892466e-88
5 X_5 87 1.214149e-87 4.128108e-86
8 X_8 82 3.167383e-78 9.422964e-77
4 X_4 62 4.736353e-47 1.252502e-45
1 X_1 45 1.290263e-26 3.070827e-25
59 X_108 28 2.930143e-11 6.339763e-10
226 X_468 25 3.983529e-09 7.900667e-08
86 X_169 21 1.372018e-06 2.511849e-05
193 X_404 19 1.836787e-05 3.122538e-04
CEP Observed Frequency Curve (generated by
do.plot = TRUE
):
“PST” (i.e. “Parametric Statistical Test”) is one of the alternative methods for cut-off point decision, which is computed as fast and memory-efficient as “CEP”. It assumes the expected selection frequency of all variables follows a binomial distribution, so we can first model such a theoretical background distribution, and then obtain the statistical significance (p-value) of all variables.
library(VSOLassoBag)
data(ExpressionData)
set.seed(19084)
VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "CEP", do.plot = TRUE)
Results of PST method
Thu Nov 21 04:51:57 2024 -- INIT process completed, start analyzing
Thu Nov 21 04:51:57 2024 -- start calculating observed frequency
2024-11-21 04:52:49.284108--Bagging finished ...
Thu Nov 21 04:52:49 2024 -- Using Non-permutation method to determine a cutoff point
Detecting Elbow Points on the Observed Frequency Curve...
Using S = 10 for elbow point dection.
Thu Nov 21 04:52:49 2024 -- Done
variable Frequency elbow.point Diff Thres
3 X_3 100 0 0.000000
7 X_7 100 0 0.000000
2 X_2 99 1 0.000000
10 X_10 98 1 0.000000
6 X_6 97 1 0.000000
9 X_9 88 * 9 4.822785
5 X_5 87 1 4.822785
8 X_8 82 5 4.822785
4 X_4 62 * 20 15.822785
1 X_1 45 17 15.822785
59 X_108 28 17 15.822785
226 X_468 25 3 15.822785
86 X_169 21 4 15.822785
193 X_404 19 2 15.822785
PST Observed Frequency Distribution (generated by
do.plot = TRUE
):
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] VSOLassoBag_0.99.1 rmdformats_1.0.4
loaded via a namespace (and not attached):
[1] generics_0.1.3 sass_0.4.9
[3] utf8_1.2.4 SparseArray_1.7.2
[5] shape_1.4.6.1 lattice_0.22-6
[7] digest_0.6.37 magrittr_2.0.3
[9] evaluate_1.0.1 grid_4.4.2
[11] bookdown_0.41 iterators_1.0.14
[13] fastmap_1.2.0 foreach_1.5.2
[15] jsonlite_1.8.9 glmnet_4.1-8
[17] Matrix_1.7-1 GenomeInfoDb_1.43.1
[19] survival_3.7-0 httr_1.4.7
[21] fansi_1.0.6 UCSC.utils_1.3.0
[23] scales_1.3.0 pbapply_1.7-2
[25] codetools_0.2-20 jquerylib_0.1.4
[27] abind_1.4-8 cli_3.6.3
[29] crayon_1.5.3 rlang_1.1.4
[31] XVector_0.47.0 Biobase_2.67.0
[33] munsell_0.5.1 splines_4.4.2
[35] DelayedArray_0.33.2 POT_1.1-11
[37] cachem_1.1.0 yaml_2.3.10
[39] S4Arrays_1.7.1 tools_4.4.2
[41] parallel_4.4.2 colorspace_2.1-1
[43] ggplot2_3.5.1 GenomeInfoDbData_1.2.13
[45] SummarizedExperiment_1.37.0 BiocGenerics_0.53.3
[47] buildtools_1.0.0 vctrs_0.6.5
[49] R6_2.5.1 stats4_4.4.2
[51] matrixStats_1.4.1 lifecycle_1.0.4
[53] zlibbioc_1.52.0 IRanges_2.41.1
[55] S4Vectors_0.45.2 pkgconfig_2.0.3
[57] pillar_1.9.0 bslib_0.8.0
[59] gtable_0.3.6 glue_1.8.0
[61] Rcpp_1.0.13-1 GenomicRanges_1.59.1
[63] xfun_0.49 tibble_3.2.1
[65] MatrixGenerics_1.19.0 sys_3.4.3
[67] knitr_1.49 htmltools_0.5.8.1
[69] rmarkdown_2.29 maketools_1.3.1
[71] compiler_4.4.2
@ Copyright 2018-2022, Center of Bioinformatics, Sun Yat-sen University Cancer Center Revision 322baf5b.