# Merged Expression Data Files Generated: 2025-12-18 ## Files Created ### 1. MASTER_all_expression_wide.txt (70 MB) **Format:** WIDE - One row per gene, many columns **Columns:** 225 total - gene_id - mean_c0 through mean_c22 (23 cluster means) - pct_c0 through pct_c22 (23 cluster percentages) - mean_pcw16, mean_pcw20, mean_pcw21, mean_pcw24 (4 PCW means) - pct_pcw16, pct_pcw20, pct_pcw21, pct_pcw24 (4 PCW percentages) - mean_c0_pcw16 through mean_c22_pcw24 (85 cluster×PCW means) - pct_c0_pcw16 through pct_c22_pcw24 (85 cluster×PCW percentages) **Best for:** Quick lookups, Excel, human reading ### 2. MASTER_all_expression_long.txt (430 MB) **Format:** LONG - One row per gene-condition-metric combination **Columns:** 5 - gene_id: Ensembl ID - condition: Which cluster/PCW/combination (e.g., "c15", "pcw20", "c15_pcw20") - condition_type: "cluster", "pcw", or "cluster_pcw" - metric: "mean_expression" or "pct_expressing" - value: The actual value **Rows:** 7,471,520 **Best for:** Databases, filtering, R/Python analysis ### 3. COMBINED_by_cluster.txt (18 MB) **Format:** Mean and pct side-by-side for each cluster **Columns:** gene_id, c0_mean, c0_pct, c1_mean, c1_pct, ... **Best for:** Easy comparison of mean vs pct for clusters ### 4. COMBINED_by_pcw.txt (4.3 MB) **Format:** Mean and pct side-by-side for each PCW **Columns:** gene_id, pcw16_mean, pcw16_pct, pcw20_mean, pcw20_pct, ... **Best for:** Easy comparison of mean vs pct for timepoints ### 5. COMBINED_by_cluster_pcw.txt (50 MB) **Format:** Mean and pct side-by-side for cluster×PCW combinations **Columns:** gene_id, c0_pcw16_mean, c0_pcw16_pct, c0_pcw20_mean, ... **Best for:** Detailed temporal analysis within clusters ## Usage Examples ### WIDE FORMAT (MASTER_all_expression_wide.txt) R: ```R data <- read.table("expression_data/MASTER_all_expression_wide.txt", header=TRUE, row.names=1, sep="\t") # Get SBF2 across all conditions sbf2 <- data["ENSG00000133703", ] # Get mean in cluster c15 sbf2_c15_mean <- data["ENSG00000133703", "mean_c15"] ``` Python: ```python import pandas as pd data = pd.read_table("expression_data/MASTER_all_expression_wide.txt", index_col=0) sbf2 = data.loc["ENSG00000133703"] ``` ### LONG FORMAT (MASTER_all_expression_long.txt) R: ```R library(dplyr) data <- read.table("expression_data/MASTER_all_expression_long.txt", header=TRUE, sep="\t") # Get all SBF2 data sbf2_data <- data %>% filter(gene_id == "ENSG00000133703") # Get cluster means only sbf2_cluster_means <- data %>% filter(gene_id == "ENSG00000133703", condition_type == "cluster", metric == "mean_expression") ``` Python: ```python import pandas as pd data = pd.read_table("expression_data/MASTER_all_expression_long.txt") # Get SBF2 cluster means sbf2_cluster_means = data[ (data["gene_id"] == "ENSG00000133703") & (data["condition_type"] == "cluster") & (data["metric"] == "mean_expression") ] ``` ### COMBINED FORMAT (easier mean vs pct comparison) R: ```R data <- read.table("expression_data/COMBINED_by_cluster.txt", header=TRUE, row.names=1, sep="\t") # Find genes with high mean AND high pct in c15 markers <- data[data$c15_mean > 1.0 & data$c15_pct > 50, ] ``` ## Which File to Use? | Task | Recommended File | |------|-----------------| | Quick gene lookup | MASTER_wide or COMBINED files | | Filtering by condition | MASTER_long | | Database import | MASTER_long | | Excel analysis | MASTER_wide or COMBINED files | | Compare mean vs pct | COMBINED files | | SQL queries | MASTER_long | | Machine learning | MASTER_long | ## Total Size All merged files: ~572 MB Original files remain in subdirectories (01_by_cluster/, 02_by_pcw/, 03_by_cluster_and_pcw/)