# varbook-container overview In the varbook-container project, our goal is to implement the package ./varbook using its README.md for instructions. Use the `uv` environment in this dir. ## Hierarchical Structure The varbook system uses a 3-level hierarchy for organizing variant and model analyses: ### Structure Levels 1. **Variant Dataset** (top level): The main dataset of variants (e.g., "Broad_neurological_disorders") 2. **Variant Subdataset** (optional): Subsets of variants (e.g., "optional_variant_subset" or "cluster_3") 3. **Model Dataset**: Named groups of models to analyze (e.g., "KUN_FB Models and KUN_HDMA_Eye") ### Path Structure ``` varbook_gen/ {variant_dataset}/ # e.g., Broad_neurological_disorders {variant_subdataset}/ # optional, e.g., cluster_3 {model_dataset}/ # e.g., "KUN_FB Models" heatmap/ # Model-level heatmap {models}.png {models}.md {variant_id}/ # Per-variant analyses model-scatterplot/ model-specificity-barplot/ profiles/ ``` ### ToC Hierarchy ``` Broad_neurological_disorders (variant_dataset) optional_variant_subset (variant subdataset) KUN_FB Models and KUN_HDMA_Eye (model_dataset) Heatmap (component) Variants Table (component) ``` ### Command Examples **With model dataset name:** ```bash varbook plot models heatmap variants.tsv score_fetal_brain logfc aaq \ --variant-datasets Broad_neurological_disorders:cluster_3 \ --models KUN_FB* KUN_HDMA_Eye_c13_astrocyte \ --model-dataset "KUN_FB Models and KUN_HDMA_Eye" ``` **Without variant subdataset:** ```bash varbook plot models heatmap variants.tsv score_fetal_brain logfc aaq \ --variant-datasets Broad_neurological_disorders \ --models KUN_FB* \ --model-dataset "KUN_FB Models" ``` ### Variant Selection - For each model_dataset, only variants prioritized in ANY model in that set are included - Source: `/oak/stanford/groups/akundaje/.../splits/broad.model_prioritized_by_any.tsv` - Can be generated using `split-columns` and `merge-columns` CLI commands ### Snakemake Workflow Pattern ```python for variant_dataset in variant_datasets: for variant_subdataset in variant_subdatasets (optional): for model_set in model_sets: # Generate heatmap for all prioritized variants in model_set generate heatmap # Optionally cluster and generate per-variant plots for cluster in kmeans_clusters: if generate_variant_plots: for variant in prioritized_variants: varbook plot variant model-scatterplot varbook plot variant model-specificity-barplot varbook plot variant profiles ``` ### Auto-generated Descriptions All auto-generated `.before.md` and `.after.md` files should be minimal or empty to avoid verbosity. ## Logging All thought processes and changes performed by you should be documented in the appropriate log in `./logs` Always include the outlines of the changes you're making to the log in readable-form; don't log code, but outline overviews and updates; using variable names is encouraged though. ## Logging File Name Format For changes to varscore modules, submodules, and commands, use this format for the file name: 'varscore.{module}.{submodule}.{cmd}.tsv'. I'd read ./logs for examples of file names. ## Writing logs Write this before making any new additions to logs: ``` --- $(date) ``` ### Examples: ``` --- Fri Nov 21 05:23:16 PM PST 2025 Modifying the `plot()` function to have optional legends for `jsd` clusters. ``` ### Todo After all the prototype-provided commands have been implemented, stop. After completing a todo, a git commit should be created. Also, you may add todo items based on implementing everything in the README.md. If any functional errors to the commands are found in the README.md of varbook, stop and let me know. You should create an outline for each solution to each task (which you can add to the log) before writing code. #### Todo list Make sure you read and understand the Snakefile and varbook scripts before working on these TODOs. - [ ] Implement cluster-level upset plots, then generate them - [ ] Fix the variants' scatterplot generation rule in Snakefile- it's failing to generate - [ ] Make sure the profiles rules use the --motifs-tsv option generated by finemo. - [ ] Implement a `varbook plot models histogram`, to be used for our purposes for the number of HPO terms of the variants - [ ] Create a snakemake rule to generate the HPO histograms at the model dataset level - [ ] Create a snakemake rule to generate the HPO histograms at the cluster level - [ ] Fix the scatterplots for the clusters, then generate them - [ ] Rename "microglia-specific cluster (#3)" to "glutamatergic neuron 7 GoF cluster (#3)" in the Snakefile and in the files - [ ] Generate plots for cluster #15, which should be named "glutamatergic neuron 2 to 7 and nIPC and interneurons 2 & 4 LoF cluster (#15)" - [ ] Generate plots for cluster #17, which should be named "early & late radial glia, nIPC, and oIPC LoF cluster (#17)" - [ ] Generate plots for cluster #27, which should be named "glutamatergic neuron 1 to 7, early & late radial glia, oIPC, nIPC, and more LoF cluster (#27)" - [ ] Generate plots for cluster #14, which should be named "glutamatergic neuron 2 to 7 GoF cluster (#14)" - [ ] Generate plots for cluster #10, which should be named "early & late radial glia and oIPC GoF cluster (#10)" - [ ] Generate plots for cluster #9, which should be named "nIPC GoF cluster (#9)"