Metadata-Version: 2.4
Name: varbook
Version: 0.1.0
Summary: varbook is a package & tool that comprehensively generates annotations and plots for genetic variant effect prioritization, curation, and viewing.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: seaborn>=0.12.0
Requires-Dist: plotly>=5.14.0
Requires-Dist: markdown>=3.4.0
Requires-Dist: weasyprint>=60.0
Requires-Dist: pytest>=7.0.0

# varbook

## Overview

varbook is a package & tool that comprehensively generates annotations and plots for genetic variant effect prioritization, curation, and viewing. At a high level it has three different functions/modules:
1) `annotate`: generate columns 
2) `plot`: generate plots, either for models or for variants
3) `write`: write markdowns & build a notebook from the markdowns
All of these modules have functions callable via Python and via CLI.

The goal is to progressively generate annotations and plots to more and more variants over time- adding them to our "notebook" of variants that is easily readable.

## Module functions

### `annotate` module

#### `annotate kmeans` module
```
varbook annotate kmeans INPUT_VARIANTS_TSV VARIANT_ID_COL --subcluster-fixed-n SUBCLUSTER_COL N_SUBCLUSTERS_PER_CLUSTER --datasets DATASET1 DATASET2 CLUSTER_COL N_CLUSTERS --models OPTIONAL_MODEL1 OPTIONAL_MODEL2 ... -o OUTPUT_VARIANTS_TSV
```
Performs KMeans clustering (euclidian distance) to generates a TSV containing cols:
1) VARIANT_ID_COL
2) kmeans cols of format: f"kmeans_{n_clusters}-{dataset}"
3) kmeans subcusters cols of format: f"kmeans_{n_clusters}_subcluster{n_subclusters}_{group_name}_{config_name}_{sub_distance}_{sub_scaler_suffix}"
If specific models are preferred instead of entire datasets, you may override the datasets by providing the desired model names. Only one of --datasets or --models are required. Only one of --datasets or --models are required.

### `plot` module

#### `plot models` submodule

##### `plot models heatmap` command
Format
```
plot models heatmap INPUT_VARIANTS_TSV HEATMAP_COL X_COL Y_COL --dataset DATASET1 DATASET2 --models MODEL1 MODEL2 --no-row-cluster --no-col-cluster --add-genomic-pct FASTA CHROM_SIZES
```
TODO: Is that right? 
Example
```
TODO
```
TODO: write out desc

#### `plot variant` submodule

##### `plot variant model-scatterplot` command

Generates a variant-specific scatterplot showing values across models for specified X and Y axes.

**Format:**
```bash
varbook plot variant model-scatterplot VARIANTS_TSV VARIANT_ID X_COL Y_COL \
  [--datasets DATASET1 ...] [--models MODEL1 ...] \
  [--variant-datasets VARIANT_DATASET] \
  [--interactive-plot] \
  [-o MD_PATH HTML_PATH] \
  [--label-cols LABEL_COL1 ...] \
  [--label-names NAME1 ...] \
  [--label-colors COLOR1 ...]
```

**Basic Example:**
```bash
varbook plot variant model-scatterplot variants.tsv chr10:123:A:G logfc aaq \
  --datasets KUN_FB \
  --variant-datasets chd_variants \
  --interactive-plot
```

**Example with Custom Labels:**
```bash
varbook plot variant model-scatterplot variants.tsv chr10:123:A:G logfc aaq \
  --datasets KUN_FB \
  --variant-datasets chd_variants \
  --interactive-plot \
  --label-cols model_prioritized_by_peak model_prioritized_by_promoter model_prioritized_by_outofpeak \
  --label-names "Prioritized in Peak" "Prioritized in Promoter" "Prioritized Out of Peak" \
  --label-colors green red blue
```

**Key Features:**
- **Model selection**: Use `--datasets` for entire datasets or `--models` for specific models (mutually exclusive)
- **Interactive plots**: Add `--interactive-plot` for plotly HTML instead of static matplotlib PNG
- **Auto-discovery**: Use `--variant-datasets` to automatically generate output paths following the default structure
- **Custom labeling**: Color-code points by categorical labels using `--label-cols`, `--label-names`, and `--label-colors`
  - Label columns should be in the format `{label_prefix}-{model_name}` (e.g., `model_prioritized_by_peak-KUN_FB_neuron`)
  - Points matching each label are colored according to `--label-colors` (defaults: green, red, blue, etc.)
  - Unlabeled points shown in gray as "Other"
- **Marker shapes**: Automatically assigned based on model type (circle for KUN_FB, square for HDMA, triangle for KUN_THYROID)

**Output:**
- Markdown file with plot reference and legend
- HTML (interactive) or PNG (static) plot file
- Auto-generated legend showing label colors and marker shape meanings

##### `plot variant model-specificity-barplot` command
Format
```
plot variant model-barplot MODEL_PRIORITIZATION_COL_ID organs --datasets DATASET1 DATASET2 ... --models MODEL1 MODEL2 ...
```
Examples
```
TODO
```

TODO
Mainly used for receiving organ/cell-type/system specificity for variants in prioritized models, out of all models. 
If specific models are preferred instead of entire datasets, you may override the datasets by providing the desired model names. Only one of --datasets or --models are required.

##### `plot variant profiles` command
Format
```
plot variant profiles VARIANTS_TSV VARIANT_ID MODEL_PATH_COL --datasets DATASET1 DATASET2 ... --models MODEL1 MODEL2 ...
```
The goal of this is to plot the model predictions. Each model in the dataset(s)+model(s) are expected are expected to have a path column.

### `write` module

#### `write html` command

Generates a standalone HTML report from markdown files with embedded interactive plots.

**Format:**
```bash
varbook write html OUTPUT_FILE [options]
```

**Options:**
- `--beginning-files MD_FILE [MD_FILE ...]` - Markdown files for the beginning section (manual mode)
- `--variant-files MD_FILE [MD_FILE ...]` - Variant-specific markdown files (manual mode)
- `--ending-files MD_FILE [MD_FILE ...]` - Markdown files for the ending section (manual mode)
- `--variant-datasets DATASET [DATASET ...]` - Variant dataset names for auto-discovery mode
- `--toc` - Include table of contents with navigation
- `--debug-paths` - Show source file paths in output for debugging

**Manual mode example:**
```bash
varbook write html report.html \
  --beginning-files intro.md methods.md \
  --variant-files variant1.md variant2.md \
  --ending-files conclusions.md \
  --toc
```

**Auto-discovery mode example:**
```bash
varbook write html report.html \
  --variant-datasets Broad_neurological_disorders \
  --toc \
  --debug-paths
```

**Features:**
- **Interactive plots**: Properly embeds plotly HTML plots via iframes
- **Auto-discovery**: Scans `{VARBOOK_DEFAULT_OUTPUT_DIR}/{variant_dataset}/` for files
- **Hierarchical structure**: Follows same ordering as PDF (intro → variants → conclusion)
- **Navigation**: Optional table of contents with anchor links
- **Responsive**: Mobile-friendly layout
- **Self-contained**: All content embedded in single HTML file (except large plots via iframe)

**Output structure:**
1. Beginning files (if provided)
2. Table of contents (if `--toc`)
3. Variant files (in hierarchical order if auto-discovery)
4. Ending files (if provided)

**Advantages over PDF:**
- ✅ Interactive plotly plots work fully
- ✅ Faster generation (no PDF rendering)
- ✅ Can be viewed in browser with full interactivity
- ✅ Easy to share (single HTML file)
- ✅ Supports navigation between sections

#### `write pdf` command

Generates a static PDF from markdown files.

**Format:**
```bash
varbook write pdf OUTPUT_FILE [options]
```

**Options:** Same as `write html`

**Note:** PDFs cannot embed interactive plots. For reports with interactive plotly visualizations, use `write html` instead.

**Example:**
```bash
varbook write pdf report.pdf \
  --variant-datasets Broad_neurological_disorders \
  --toc
```

## Options

### Vascore column naming format

By default, dataset-specific columns are expected to have a format as such: {annot}-{dataset}.
By default, model-specific columns are expected to have a format as such: {annot}-{model}.
By default, each model name contains the dataset it originates from as the prefix like so: {dataset}-{model_wo_dataset}.
For example:
Given dataset=`KUN_FB` and model=`KUN_FB_microglia`:
- The dataset-specific col of `prioritized_models_count` for dataset `KUN_FB` will be named `prioritized_models_count-KUN_FB`.
- The model-specific col of `logfc` for model `KUN_FB_microglia` will be named `logfc-KUN_FB_microglia`.

### Environment Variables

The following environment variables can be set to customize varbook behavior:

```bash
# Column naming formats
VARBOOK_MODEL_SPECIFIC_COLUMN_FORMAT="%A-%M"       # default: "%A-%M"
VARBOOK_DATASET_SPECIFIC_COLUMN_FORMAT="%A-%D"     # default: "%A-%D"
VARBOOK_MODEL_NAME_FORMAT="%D_%m"                  # default: "%D_%m"

# Output directories
VARBOOK_DEFAULT_OUTPUT_DIR="varbook_gen/"          # default: "varbook_gen/"
VARBOOK_BUILD_VARIANT_PROFILES_PATH=""             # optional template path for profiles
```

**Substitution patterns:**
- `%A` = annotation/column name (e.g., "logfc", "aaq")
- `%D` = dataset name (e.g., "KUN_FB")
- `%M` = full model name (e.g., "KUN_FB_microglia")
- `%m` = model name without dataset prefix (e.g., "microglia")

### Default Output Paths

When `--variant-datasets` is provided to plot commands, varbook automatically generates output paths in this structure:

```
{VARBOOK_DEFAULT_OUTPUT_DIR}/
  {variant_dataset}/
    heatmap/                    # Dataset-level heatmaps
      {models}.png
      heatmap.before.md         # Optional intro text
      heatmap.after.md          # Optional conclusion text
    {variant_id}/
      intro.md                  # Optional variant introduction
      model-scatterplot/
        {models}.html           # or .png for static
        {models}.md
        model-scatterplot.before.md   # Optional
        model-scatterplot.after.md    # Optional
      model-specificity-barplot/
        {models}.png
        model-specificity-barplot.before.md   # Optional
        model-specificity-barplot.after.md    # Optional
      profiles/
        {models}.png
        profiles.before.md      # Optional
        profiles.after.md       # Optional
      conclusion.md             # Optional variant conclusion
```

Where `{models}` is either:
- The dataset name (e.g., "KUN_FB"), or
- A sorted concatenation of model names from `--models` (e.g., "KUN_FB_microglia_KUN_FB_neuron"), or
- A user-specified string via `-o` (for manual paths)

### Plot Commands with Default Paths

All `plot variant` commands now support `--variant-datasets` for automatic path generation:

```bash
# Using default paths (recommended)
varbook plot variant model-scatterplot variants.tsv var123 logfc aaq \
  --datasets KUN_FB \
  --variant-datasets chd_variants \
  --interactive-plot

# Using manual paths (backwards compatible)
varbook plot variant model-scatterplot variants.tsv var123 logfc aaq \
  --datasets KUN_FB \
  -o output.md output.html \
  --interactive-plot
```

**Benefits of default paths:**
1. Consistent directory structure
2. Automatic PDF generation via auto-discovery
3. Support for .before.md and .after.md annotations
4. Easier to manage many variants

### PDF Generation with Auto-Discovery

The `write pdf` command has two modes:

**Manual mode (original):**
```bash
varbook write pdf output.pdf \
  --beginning-files intro.md \
  --variant-files var1.md var2.md \
  --ending-files conclusion.md \
  --toc
```

**Auto-discovery mode (new):**
```bash
varbook write pdf output.pdf \
  --variant-datasets chd_variants \
  --toc \
  --debug-paths
```

Auto-discovery mode:
- Scans `{VARBOOK_DEFAULT_OUTPUT_DIR}/{variant_dataset}/` for variants
- Automatically includes all plot types in hierarchical order
- Includes .before.md and .after.md files where they exist
- `--debug-paths` adds source file paths to PDF footer for debugging

### Hierarchical PDF Structure

When using auto-discovery, the PDF is organized hierarchically:

1. **Variant Dataset** (e.g., chd_variants)
   - Dataset-level heatmap (if exists)
     - heatmap.before.md
     - Heatmap files
     - heatmap.after.md
   - **Variant 1**
     - intro.md
     - model-scatterplot/
       - model-scatterplot.before.md
       - Scatterplot files
       - model-scatterplot.after.md
     - model-specificity-barplot/
       - model-specificity-barplot.before.md
       - Barplot files
       - model-specificity-barplot.after.md
     - profiles/
       - profiles.before.md
       - Profile files
       - profiles.after.md
     - conclusion.md
   - **Variant 2**
     - ...

This structure allows for:
- Narrative flow with intro/conclusion sections
- Contextual annotations with .before/.after files
- Consistent organization across all variants
- Easy addition of new variants without restructuring

### Markdown Linking

Markdown files may contain links to HTML files, such as interactive plotly plots generated by `plot variant model-scatterplot`:

```markdown
## Model Scatterplot

![Scatterplot](model-scatterplot/KUN_FB.html)
```

These links will be properly resolved during PDF generation.

# Developer FAQ

Various varbook functions and submodules have "prototype" files that the functions & submodules should be based off of. These prototypes are imported from previous projects, and they are pasted, disconnected parts that should serve be reimpmlemented and/or serve as the base for implementing the function and commands.

Pytest tests should be added for each new function.