Last updated: 2024-12-16
Checks: 1 1
Knit directory: zinck-website/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version d4297f5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: analysis/.DS_Store
Unstaged changes:
Modified: .gitignore
Modified: analysis/CRC.Rmd
Deleted: analysis/CRC.html
Modified: analysis/Heatmaps.Rmd
Modified: analysis/IBD.Rmd
Modified: analysis/_site.yml
Modified: analysis/index.Rmd
Modified: analysis/simulation.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/index.Rmd
) and HTML
(docs/index.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | ab6400d | Patron | 2024-06-18 | Build and publish the website |
Rmd | a6c38f8 | Patron | 2024-06-18 | Add home, experiment, and simulation pages |
Rmd | 6e66a4e | Patron | 2024-06-17 | Initial commit of Workflowr project |
zinck
Research Website!This website documents the research and experiments related to the
zinck
package.
zinck
exploits a zero-inflated variant of the Latent
Dirichlet Allocation (LDA) model to generate valid knockoffs that
capture the key characteristics of microbiome data - mainly its
compositional nature and high sparsity. It exhibits the properties of
simultaneous variable selection and FDR control to identify microbial
biomarkers. This package provides an implementation of
zinck
, which is trained either using the Automatic
Differentiation Variational Inference (ADVI) algorithm or using a
collapsed Gibbs sampler, facilitating variable selection for both
continuous as well as binary outcomes.
zinck
Before going into the structure of zinck
, we define the
following parameters:
Number of biological samples: \(D\), number of microbial features: \(p\), number of latent clusters \(K\) and the sequencing depth of the \(d^{th}\) sample is given by \(N_d\), where \(d=1,\ldots,D\).
\(\boldsymbol{\theta}_{d} = (\theta_{d1},\ldots,\theta_{dK})'\) is the vector of cluster mixing probabilities for the \(d^{th}\) biological sample, where \(d \in 1,2,\ldots,D\)
\(\boldsymbol{\beta}_{k} = (\beta_{k1},\ldots,\beta_{k(p-1)})'\) and \(\boldsymbol{\beta}_{kp}=1-\sum_{i=1}^{p-1}\beta_{ki}\) is the vector of feature proportions for each cluster \(k=1,2,\ldots,K\).
\(\mathbf{z}_{d}=\left(z_{d1},\ldots,z_{dN_{d}}\right)'\) is the vector of cluster assignments for the \(d^{th}\) biological sample. For instance, \(z_{dn}=k\) implies that the \(n^{th}\) sequencing read in the \(d^{th}\) sample belongs to the \(k^{th}\) cluster.
\(\mathbf{w}_{d}=\left(w_{d1},\ldots,w_{dN_{d}}\right)'\) is the vector of the features drawn for each sequencing read for the \(d^{th}\) biological sample.
\(\boldsymbol{\alpha}=(\alpha,\ldots,\alpha)^{K \times 1}\): symmetric hyperparameter of the Dirichlet prior of \(\boldsymbol{\theta}_{d}\).
\(\boldsymbol{\pi}_{k}=(\pi_{k1},\ldots,\pi_{k(p-1)})'\): hyperparameter of the ZIGD distribution specifying the probability of being a structural zero for the \(j^{th}\) feature in the \(k^{th}\) subcommunity, where \(j=1,\ldots,p\) and \(k=1,\ldots,K\).
\(\mathbf{a}=(a,\ldots,a)^{(p-1) \times 1}\) and \(\mathbf{b}=(b, \ldots, b)^{(p-1) \times 1}\) are the symmetric hyperparameter vectors on the ZIGD of \(\boldsymbol{\beta}_{k}\).
It is to be noted that ZIGD here refers to the zero-inflated Generalized Dirichlet distribution.
zinck is a probabilistic hierarchical model with the following specification:
\[ \begin{aligned} w_{dn}|z_{dn},\boldsymbol{\beta}_{z_{dn}} & \sim \text{Multinomial}(\boldsymbol{\beta}_{z_{dn}}) \\ \boldsymbol{\beta}_{z_{dn}}|\boldsymbol{\pi},\mathbf{a},\mathbf{b} & \sim \text{ZIGD}\left(\pi_{z_{dn}},a,b\right) \\ z_{dn}|\theta_{d} & \sim \text{Multinomial}(\theta_{d}) \\ \boldsymbol{\theta}_{d} & \sim \text{Dirichlet}(\boldsymbol{\alpha}) \end{aligned} \]
We denote the elements of the observed sample taxa matrix \(\mathbf{X}^{D \times p}\) by \((x_{dj})\) as the observed read count of the \(j^{th}\) taxon for the \(d^{th}\) subject: \[x_{dj}=\sum_{n=1}^{N_d} \mathbb{1}_{\{w_{dn}=j\}}\]
zinck
We can exploit the knockoff generative model to learn the structure of the sample-feature count matrix \(\mathbf{X}\) and generate a valid knockoff copy, which is further used for FDR-controlled feature selection. We fit the augmented LDA model to the microbiome count data matrix \(\mathbf{X}\). The latent parameters of the model namely, \(\theta_{d}\) and \(\beta_{k}\) are learnt by approximating their joint posterior distribution via the Automatic Differentiation Variational Inference (ADVI) algorithm or by drawing MCMC samples using a Collapsed Gibbs sampler. We use the learnt parameters \(\tilde{\boldsymbol{\theta}}_d, \tilde{\boldsymbol{\beta}}_k\), for \(d=1,2,\ldots D\) and \(k=1,2,\ldots,K\) to generate a knockoff copy. For each sample \(d\), we first sample a cluster allocation \(z_{dn}\) from \(\text{Multinomial}(1,\tilde{\boldsymbol{\theta}}_d)\) and then we sample a feature \(w_{dn}\) from the selected cluster \(z_{dn}\), that is, \(w_{dn} \sim \text{Multinomial}(1,\tilde{\boldsymbol{\beta}}_{z_{dn}})\). Finally, we form the knockoff matrix \(\tilde{\mathbf{X}}^{D \times p} = \{\tilde{x}_{dj}\}\) by cumulating all the taxa read counts per subjects as illustrated previously.
zinck
The figure below illustrates the hierarchical structure of the
zinck
model:
Nodes represent random variables, and edges indicate dependencies. The shaded node is observed, while the non-shaded nodes are latent. Replicated variables are indicated by plates: the outermost plate, labeled with \(D\), signifies multiple biological samples, each indexed by \(d\), and the inner plate, marked by \(N_d\), represents the sequencing reads within each sample. The plate on the right, labeled with \(K\), indicates replication across clusters.