Title: | Principal Components Lasso |
---|---|
Description: | A method for fitting the entire regularization path of the principal components lasso for linear and logistic regression models. The algorithm uses cyclic coordinate descent in a path-wise fashion. See URL below for more information on the algorithm. See Tay, K., Friedman, J. ,Tibshirani, R., (2014) 'Principal component-guided sparse regression' <arXiv:1810.04651>. |
Authors: | Jerome Friedman, Kenneth Tay, Robert Tibshirani |
Maintainer: | Rob Tibshirani <[email protected]> |
License: | GPL-3 |
Version: | 1.2 |
Built: | 2025-02-25 03:45:12 UTC |
Source: | https://github.com/cran/pcLasso |
Does k
-fold cross-validation for pcLasso
.
cv.pcLasso(x, y, w = rep(1, length(y)), ratio = NULL, theta = NULL, groups = vector("list", 1), family = "gaussian", nfolds = 10, foldid = NULL, keep = FALSE, verbose = FALSE, ...)
cv.pcLasso(x, y, w = rep(1, length(y)), ratio = NULL, theta = NULL, groups = vector("list", 1), family = "gaussian", nfolds = 10, foldid = NULL, keep = FALSE, verbose = FALSE, ...)
x |
|
y |
|
w |
Observation weights. Default is 1 for each observation. |
ratio |
Ratio of shrinkage between the second and first principal components
in the absence of the |
theta |
Multiplier for the quadratic penalty: a non-negative real number.
|
groups |
A list describing which features belong in each group. The
length of the list should be equal to the number of groups, with
|
family |
Response type. Either |
nfolds |
Number of folds for CV (default is 10). Although |
foldid |
An optional vector of values between 1 and |
keep |
If |
verbose |
Print out progess along the way? Default is |
... |
Other arguments that can be passed to |
This function runs pcLasso nfolds+1
times: the first to get the
lambda
sequence, and the remaining nfolds
times to compute the
fit with each of the folds omitted. The error is accumulated, and the mean
error and standard deviation over the folds is compued. Note that
cv.pcLasso
does NOT search for values of theta
or ratio
.
A specific value of theta
or ratio
should be supplied.
An object of class "cv.pcLasso"
, which is a list with the
ingredients of the cross-validation fit.
glmfit |
A fitted |
theta |
Value of |
lambda |
The values of |
nzero |
If the groups overlap, the number of non-zero coefficients
in the model |
orignzero |
If the groups are overlapping, this is the number of
non-zero coefficients in the model |
fit.preval |
If |
cvm |
The mean cross-validated error: a vector of length
|
cvse |
Estimate of standard error of |
cvlo |
Lower curve = |
cvup |
Upper curve = |
lambda.min |
The value of |
lambda.1se |
The largest value of |
foldid |
If |
name |
Name of error measurement used for CV. |
call |
The call that produced this object. |
pcLasso
and plot.cv.pcLasso
.
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) groups <- vector("list", 4) for (k in 1:4) { groups[[k]] <- 5 * (k-1) + 1:5 } cvfit1 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8) # change no. of CV folds cvfit2 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, nfolds = 5) # specify which observations are in each fold foldid <- sample(rep(seq(5), length = length(y))) cvfit3 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, foldid = foldid) # keep=TRUE to have pre-validated fits and foldid returned cvfit4 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, keep = TRUE)
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) groups <- vector("list", 4) for (k in 1:4) { groups[[k]] <- 5 * (k-1) + 1:5 } cvfit1 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8) # change no. of CV folds cvfit2 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, nfolds = 5) # specify which observations are in each fold foldid <- sample(rep(seq(5), length = length(y))) cvfit3 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, foldid = foldid) # keep=TRUE to have pre-validated fits and foldid returned cvfit4 <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, keep = TRUE)
Fit a model using the principal components lasso for an entire regularization
path indexed by the parameter lambda
. Fits linear and logistic regression
models.
pcLasso(x, y, w = rep(1, length(y)), family = c("gaussian", "binomial"), ratio = NULL, theta = NULL, groups = vector("list", 1), lambda.min.ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nlam = 100, lambda = NULL, standardize = F, SVD_info = NULL, nv = NULL, propack = T, thr = 1e-04, maxit = 1e+05, verbose = FALSE)
pcLasso(x, y, w = rep(1, length(y)), family = c("gaussian", "binomial"), ratio = NULL, theta = NULL, groups = vector("list", 1), lambda.min.ratio = ifelse(nrow(x) < ncol(x), 0.01, 1e-04), nlam = 100, lambda = NULL, standardize = F, SVD_info = NULL, nv = NULL, propack = T, thr = 1e-04, maxit = 1e+05, verbose = FALSE)
x |
Input matrix, of dimension |
y |
Response variable. Quantitative for |
w |
Observation weights. Default is 1 for each observation. |
family |
Response type. Either |
ratio |
Ratio of shrinkage between the second and first principal components
in the absence of the |
theta |
Multiplier for the quadratic penalty: a non-negative real number.
|
groups |
A list describing which features belong in each group. The
length of the list should be equal to the number of groups, with
|
lambda.min.ratio |
Smallest value for |
nlam |
Number of |
lambda |
A user supplied |
standardize |
If |
SVD_info |
A list containing SVD information. Usually this should not
be specified by the user: the function will compute it on its own by default.
Since the initial SVD of |
nv |
Number of singular vectors to use in the singular value decompositions. If not specified, the full SVD is used. |
propack |
If |
thr |
Convergence threhold for the coordinate descent algorithm. Default
is |
maxit |
Maximum number of passes over the data for all lambda values;
default is |
verbose |
Print out progess along the way? Default is |
The objective function for "gaussian"
is
where the sum is over the feature groups . The objective function
for
"binomial"
is
pcLasso
can handle overlapping groups. In this case, the original
x
matrix is expanded to a nobs x p_1+...+p_K
matrix (where
p_k
is the number of features in group k) such that columns
p_1+...+p_{k-1}+1
to p_1+...+p_k
represent the feature matrix for
group k. pcLasso
returns the model coefficients for both the expanded
feature space and the original feature space.
One needs to specify the strength of the quadratic penalty either by
specifying ratio
, which is the ratio of shrinkage between the second
and first principal components in the absence of the penalty,
or by specifying the multiplier
theta
. ratio
is unitless and is
more convenient.
pcLasso
always mean centers the columns of the x
matrix. If
standardize=TRUE
, pcLasso
will also scale the columns to have
standard deviation 1. In all cases, the beta
coefficients returned are
for the original x
values (i.e. uncentered and unscaled).
An object of class "pcLasso"
.
beta |
If the groups overlap, a |
origbeta |
If the groups overlap, a |
a0 |
Intercept sequence of length |
lambda |
The actual sequence of |
nzero |
If the groups overlap, the number of non-zero coefficients in the
expanded feature space for each value of |
orignzero |
If the groups are overlapping, this is the number of
non-zero coefficients in the original feature space of the model for each
|
jerr |
Error flag for warnings and errors (largely for internal debugging). |
theta |
Value of |
origgroups |
If the |
groups |
If the groups are not overlapping, this has the same
value as |
SVD_info |
A list containing SVD information. See param |
mx |
If groups overlap, column means of the expanded |
origmx |
Column means of the original |
my |
If |
overlap |
A logical flag indicating if the feature groups were overlapping or not. |
nlp |
Actual number of passes over the data for all lambda values. |
family |
Response type. |
call |
The call that produced this object. |
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) # all features in one group by default fit1 <- pcLasso(x, y, ratio = 0.8) # print(fit1) # Not run # features in groups groups <- vector("list", 4) for (k in 1:4) { groups[[k]] <- 5 * (k-1) + 1:5 } fit2 <- pcLasso(x, y, groups = groups, ratio = 0.8) # groups can be overlapping groups[[1]] <- 1:8 fit3 <- pcLasso(x, y, groups = groups, ratio = 0.8) # specify ratio or theta, but not both fit4 <- pcLasso(x, y, groups = groups, theta = 10) # family = "binomial" y2 <- sample(0:1, 100, replace = TRUE) fit5 <- pcLasso(x, y2, ratio = 0.8, family = "binomial") # example where SVD is computed once, then re-used fit1 <- pcLasso(x, y, ratio = 0.8) fit2 <- pcLasso(x, y, ratio = 0.8, SVD_info = fit1$SVD_info)
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) # all features in one group by default fit1 <- pcLasso(x, y, ratio = 0.8) # print(fit1) # Not run # features in groups groups <- vector("list", 4) for (k in 1:4) { groups[[k]] <- 5 * (k-1) + 1:5 } fit2 <- pcLasso(x, y, groups = groups, ratio = 0.8) # groups can be overlapping groups[[1]] <- 1:8 fit3 <- pcLasso(x, y, groups = groups, ratio = 0.8) # specify ratio or theta, but not both fit4 <- pcLasso(x, y, groups = groups, theta = 10) # family = "binomial" y2 <- sample(0:1, 100, replace = TRUE) fit5 <- pcLasso(x, y2, ratio = 0.8, family = "binomial") # example where SVD is computed once, then re-used fit1 <- pcLasso(x, y, ratio = 0.8) fit2 <- pcLasso(x, y, ratio = 0.8, SVD_info = fit1$SVD_info)
Plots the cross-validation curve produced by a cv.pcLasso
object, along
with upper and lower standard deviation curves, as a function of the lambda
values used.
## S3 method for class 'cv.pcLasso' plot(x, sign.lambda = 1, orignz = TRUE, ...)
## S3 method for class 'cv.pcLasso' plot(x, sign.lambda = 1, orignz = TRUE, ...)
x |
Fitted " |
sign.lambda |
Either plot against |
orignz |
If |
... |
Other graphical paramters to plot. |
A plot is produced and nothing is returned.
pcLasso
and cv.pcLasso
.
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) groups <- vector("list", 4) for (k in 1:4) { groups[[k]] <- 5 * (k-1) + 1:5 } cvfit <- cv.pcLasso(x, y, ratio = 0.8, groups = groups) plot(cvfit) # plot flipped: x-axis tracks -log(lambda) instead plot(cvfit, sign.lambda = -1) # if groups overlap, orignz can be used to decide which space to count the # number of non-zero coefficients at the top groups[[1]] <- 1:8 cvfit <- cv.pcLasso(x, y, ratio = 0.8, groups = groups) plot(cvfit) # no. of non-zero coefficients in original space plot(cvfit, orignz = FALSE) # no. of non-zero coefficients in expanded space
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) groups <- vector("list", 4) for (k in 1:4) { groups[[k]] <- 5 * (k-1) + 1:5 } cvfit <- cv.pcLasso(x, y, ratio = 0.8, groups = groups) plot(cvfit) # plot flipped: x-axis tracks -log(lambda) instead plot(cvfit, sign.lambda = -1) # if groups overlap, orignz can be used to decide which space to count the # number of non-zero coefficients at the top groups[[1]] <- 1:8 cvfit <- cv.pcLasso(x, y, ratio = 0.8, groups = groups) plot(cvfit) # no. of non-zero coefficients in original space plot(cvfit, orignz = FALSE) # no. of non-zero coefficients in expanded space
This function returns the predictions for a new data matrix from a
cross-validated pcLasso model by using the stored "glmfit
" object and
the optimal value chosen for lambda
.
## S3 method for class 'cv.pcLasso' predict(object, xnew, s = c("lambda.1se", "lambda.min"), ...)
## S3 method for class 'cv.pcLasso' predict(object, xnew, s = c("lambda.1se", "lambda.min"), ...)
object |
Fitted " |
xnew |
Matrix of new values for |
s |
Value of the penalty parameter |
... |
Potentially other arguments to be passed to and from methods; currently not in use. |
This function makes it easier to use the results of cross-validation to make
a prediction. Note that xnew
should have the same number of columns as
the original feature space, regardless of whether the groups are overlapping
or not.
Predictions which the cross-validated model makes for xnew
at
the optimal value of lambda
. Note that the default is the "lambda.1se" for lambda,
to make this function consistent with cv.glmnet
in the glmnet
package. The output is predictions of : these are probabilities
for the binomial family.
cv.pcLasso
and predict.pcLasso
.
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) cvfit <- cv.pcLasso(x, y, ratio = 0.8) predict(cvfit, xnew = x[1:5, ]) predict(cvfit, xnew = x[1:5, ], s = "lambda.min")
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) y <- rnorm(100) cvfit <- cv.pcLasso(x, y, ratio = 0.8) predict(cvfit, xnew = x[1:5, ]) predict(cvfit, xnew = x[1:5, ], s = "lambda.min")
This function returns the predictions from a "pcLasso
" object
for a new data matrix.
## S3 method for class 'pcLasso' predict(object, xnew, ...)
## S3 method for class 'pcLasso' predict(object, xnew, ...)
object |
Fitted " |
xnew |
Matrix of new values for |
... |
Potentially other arguments to be passed to and from methods; currently not in use. |
Note that xnew
should have the same number of columns as the original
feature space, regardless of whether the groups are overlapping or not.
Predictions of which the model
object
makes at
xnew
. These are probabilities for the binomial family.
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) # family = "gaussian" y <- rnorm(100) fit1 <- pcLasso(x, y, ratio = 0.8) predict(fit1, xnew = x[1:5, ]) # family = "binomial" y2 <- sample(0:1, 100, replace = TRUE) fit2 <- pcLasso(x, y2, ratio = 0.8, family = "binomial") predict(fit2, xnew = x[1:5, ])
set.seed(1) x <- matrix(rnorm(100 * 20), 100, 20) # family = "gaussian" y <- rnorm(100) fit1 <- pcLasso(x, y, ratio = 0.8) predict(fit1, xnew = x[1:5, ]) # family = "binomial" y2 <- sample(0:1, 100, replace = TRUE) fit2 <- pcLasso(x, y2, ratio = 0.8, family = "binomial") predict(fit2, xnew = x[1:5, ])