pcLasso
is a package that fits the principal components
lasso, a new method for obtaining sparse models for supervised learning
problems. pcLasso shrinks predictions toward the top principal
components of the feature matrix. It combines the power of Principal
Components Regression with the sparsity of the lasso. The method is also
able to handle grouped features, where the features can belong to one of
many groups. In that case, pcLasso shrinks the component of the
parameter vector towards the top principal componets of the
corresponding feature matrix.
We introduce some notation that we will use throughout this vignette. Let there be n observations, each with feature vector xi ∈ ℝp and response yi. Let X ∈ ℝn × p denote the overall feature matrix, and let y ∈ ℝn denote the vector of responses. Assume our data features come in K groups, and let Xk ∈ ℝn × pk denote the feature matrix for group k. In the simplest case with K = 1, there is no feature grouping.
For each feature matrix Xk, let Xk = UkDkVkT be its singular value decomposition (SVD). Let Dk have diagonal entries dk1, dk2, …, and let Ddk12 − dkj2 denote the diagonal matrix such that the jth diagonal entry is dk1 − dkj2.
pcLasso
solves the optimization problem
$\underset{\beta_0,\beta}{\text{minimize}} \quad \dfrac{1}{n} \displaystyle\sum_{i=1}^N w_i \ell (y_i, \beta_0 + \beta^T x_i) + \lambda \|\beta\|_1 + \dfrac{\theta}{2} \sum_{k = 1}^K \beta_k^T V_k D_{d_{k1}^2 - d_{kj}^2} V_k^T \beta_k.$
In the above, ℓ(y, η) is the negative
log-likelihood contribution for observation i; e.g. in the Gaussian case $\ell(y, \eta) = \frac{1}{2}(y - \eta)^2$.
wi is the
observation weight given to observation i (by default wi = 1 for all
i). βk is the
subvector of β which
corresponds to group k. λ and θ are non-negative hyperparameters.
pcLasso
solves the optimization problem for a grid of λ values; θ must be specified in advance by
the user.
pcLasso
uses cyclical coordinate descent, which
successively optimizes the objective function over each parameter with
all other parameters fixed, cycling repeatedly until convergence.
The package also includes methods for prediction and plotting, and a function which performs k-fold cross-validation.
For more details, please see our paper on arXiv.
The simplest way to obtain pcLasso
is to install it
directly from CRAN. Type the following command in R console:
This command downloads the R package and installs it to the default
directories. Users may change add a repos
option to the
function call to specify which repository to download from, depending on
their locations and preferences.
Alternatively, users can download the package source at CRAN and type Unix commands to install it to the desired location.
The purpose of this section is to give users a general sense of the package. We will briefly go over the main functions of the package, as well as some key options. More details are given in later sections.
First, we load the pcLasso
package:
Let’s generate some data:
set.seed(944)
n <- 100
p <- 20
x <- matrix(rnorm(n*p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)
We fit the model using the most basic call to
pcLasso
:
In addition to the feature matrix x
and the response
y
, the user must specify either theta
or
ratio
(but not both). theta
is the
hyperparameter value for the last term in the objective function. The
scale for theta
depends on x
and
y
and hence can be hard to determine a priori. Instead of
specifying theta
, we recommend that the user specify
ratio
, a hyperparameter whose value lies in the interval
(0, 1]. Roughly speaking, smaller
values of ratio
represent greater shrinkage to the top
principal components. (ratio = 1
gives the usual
lasso.)
If the argument verbose
is set to TRUE
,
while the function is running, pcLasso
informs the user
which lambda
in the lambda
sequence it is
currently processing by printing it to the console.
If ratio
is passed to pcLasso
, it
internally computes the corresponding value of theta
and
uses that in minimizing the objective function. This value can be
retrieved as the value of theta
in the output list.
The function pcLasso
returns a pcLasso
object. At the moment, the only way to extract model coefficients is to
use list syntax on the returned object. For example, the code below
extracts the intercept and coefficients for the model at the 20th
lambda
value:
# intercept
fit$a0[20]
#> [1] -0.05855159
# coefficients
fit$beta[, 20]
#> [1] 0.26955362 0.34871182 0.31431625 0.44873422 0.17694987 0.21373963
#> [7] 0.00000000 -0.05236147 0.00000000 0.00000000 -0.17767456 0.11606285
#> [13] 0.01645924 0.11182802 0.00000000 -0.10867816 0.06198964 0.12127325
#> [19] -0.04744480 -0.07523572
A pcLasso
object has a nzero
key which
tells us the number of non-zero model coefficients at each value of
lambda
:
fit$nzero
#> [1] 0 1 1 4 4 4 5 5 5 6 7 7 11 12 13 14 15 16 16 16 17 17 17 18 18
#> [26] 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20
#> [51] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
#> [76] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
We can make predictions using a pcLasso
object by
calling the predict
method. Each column gives the
predictions for a value of lambda
.
# get predictions for 20th model
predict(fit, x[1:5, ])[, 20]
#> [1] 1.0830384 -0.3009271 1.8452090 1.3588647 -1.1403960
By default, pcLasso
assumes that all the features in
x
belong to one group. If the features come in groups,
pcLasso
can make use of this information in the
model-fitting procedure.
Assume our features come in 4 (non-overlapping) groups of size 5:
groups <- vector("list", 4)
for (k in 1:4) {
groups[[k]] <- 5 * (k-1) + 1:5
}
groups
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 6 7 8 9 10
#>
#> [[3]]
#> [1] 11 12 13 14 15
#>
#> [[4]]
#> [1] 16 17 18 19 20
We can use this information in our fit by specifying the
groups
option. groups
must be a list of length
K (the number of groups).
groups[[k]]
is a vector of column indices representing the
features which belong to group k. (For example,
groups[[1]] <- c(3, 4, 6)
means that columns 3, 4 and 6
of x
belong to group 1.) Every feature must belong to at
least one group.
One advantage of pcLasso
is that the algorithm works
with overlapping groups as well. For example, we modify the groups list
such that features 6 and 7 also belong to group 1. We can make the same
call to pcLasso
:
groups[[1]] <- 1:7
groups
#> [[1]]
#> [1] 1 2 3 4 5 6 7
#>
#> [[2]]
#> [1] 6 7 8 9 10
#>
#> [[3]]
#> [1] 11 12 13 14 15
#>
#> [[4]]
#> [1] 16 17 18 19 20
One thing to note with overlapping groups is that the model
coefficients and the number of non-zero coefficients are stored
differently in the output object. To get the coefficients in the
original feature space, look at the origbeta
key:
# intercept at 20th model: same as before
fit$a0[20]
#> [1] -0.03609611
# coefficients at 20th model: look at origbeta instead
fit$origbeta[, 20]
#> [1] 0.9453523 1.0498390 1.0494266 1.3574616 0.7357579 0.3295442 0.0000000
#> [8] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
To get the number of non-zero coefficients in the original feature
space, look at orignzero
:
fit$orignzero
#> [1] 0 1 1 3 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 6 7 8 8 8 10
#> [26] 10 12 12 12 13 13 13 15 15 16 17 17 17 17 17 17 18 19 19 19 19 19 19 19 19
#> [51] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
#> [76] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
For more information on what the algorithm does in the case of overlapping groups, see Section 3.2 of our paper.
We can perform k-fold
cross-validation (CV) with cv.pcLasso
. It does 10-fold
cross-validation by default:
We can change the number of folds using the nfolds
option:
If we want to specify which observation belongs to which fold, we can
do that by specifying the foldid
option, which is a vector
of length n, with the ith element being the fold number
for observation i.
foldid <- sample(rep(seq(10), length = n))
cvfit <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, foldid = foldid)
A cv.pcLasso
call returns a cv.pcLasso
object. We can plot this object to get the CV curve with error bars (one
standard error in each direction). The left vertical dotted line
represents lambda.min
, the lambda
value which
attains minimum CV error, while the right vertical dotted line
represents lambda.1se
, the largest lambda
value with CV error within one standard error of the minimum CV
error.
The numbers at the top represent the number of non-zero coefficients
for each model in the original feature space. If the groups are
overlapping, we can plot the number of non-zero coefficients for each
model in the expanded feature space instead by setting
orignz = FALSE
:
The two special lambda
values can be extracted directly
from the cv.pcLasso
object as well:
Predictions can be made from the fitted cv.pcLasso
object. By default, predictions are given for lambda
being
equal to lambda.1se
. To get predictions are
lambda.min
, set s = "lambda.min"
.
Here are some other options that one may specify for the
pcLasso
and cv.pcLasso
functions:
w
: The user can pass a vector of length n representing observation weights.
The squared residual of the observations are weighted according to this
vector. By default, this is set to 1 for all observations.
family
: The default value for the
family
option of the pcLasso
and
cv.pcLasso
functions is gaussian
. Use this
default when y
is a quantitative variable (i.e. takes
values along the real number line). The objective function for the
Gaussian family is
$\underset{\beta_0,\beta}{\text{minimize}} \quad \dfrac{1}{2n} \displaystyle\sum_{i=1}^N w_i (y_i - \beta_0 - \beta^T x_i)^2 + \lambda \|\beta\|_1 + \dfrac{\theta}{2} \sum_{k = 1}^K \beta_k^T V_k D_{d_{k1}^2 - d_{kj}^2} V_k^T \beta_k.$
As before, λ and θ are hyperparameters. The user
passes a specific value of theta
(or its ratio
equivalent) to the pcLasso
and cv.pcLasso
functions, and the function computes the model fit for a path of
lambda
values.
For binary prediction, use family = binomial
. In this
setting, the response y
should be a numeric vector
containing just 0s and 1s.
SVD_info
: A list containing SVD information that the
algorithm uses. The user typically does not need to provide these
options to pcLasso
; pcLasso
will compute them
from the given data. Internally, cv.pcLasso
provides these
options to pcLasso
for computational efficiency.
lambda
: The lambda
sequence at which
the model fit will be computed. This is typically not provided by the
user: the program can construct the sequence on its own. When
automatically generated, the lambda
sequence is determined
by lambda.max
(internally computed) and
lambda.min.ratio
. (lambda.min.ratio
is the
ratio of smallest value of the generated lambda
sequence,
say lambda.min
, to lambda.max
.) The program
generates nlam
values (default is 100) linear on the log
scale from lambda.max
down to
lambda.min
.
standardize
: If set to TRUE
, the
columns of the feature matrix x
are scaled to have unit
variance before the algorithm is run.
For more information, type ?pcLasso
or
?cv.pcLasso
.