# Merged Expression Data Files

Generated: 2025-12-18

## Files Created

### 1. MASTER_all_expression_wide.txt (70 MB)
**Format:** WIDE - One row per gene, many columns
**Columns:** 225 total
  - gene_id
  - mean_c0 through mean_c22 (23 cluster means)
  - pct_c0 through pct_c22 (23 cluster percentages)
  - mean_pcw16, mean_pcw20, mean_pcw21, mean_pcw24 (4 PCW means)
  - pct_pcw16, pct_pcw20, pct_pcw21, pct_pcw24 (4 PCW percentages)
  - mean_c0_pcw16 through mean_c22_pcw24 (85 cluster×PCW means)
  - pct_c0_pcw16 through pct_c22_pcw24 (85 cluster×PCW percentages)

**Best for:** Quick lookups, Excel, human reading

### 2. MASTER_all_expression_long.txt (430 MB)
**Format:** LONG - One row per gene-condition-metric combination
**Columns:** 5
  - gene_id: Ensembl ID
  - condition: Which cluster/PCW/combination (e.g., "c15", "pcw20", "c15_pcw20")
  - condition_type: "cluster", "pcw", or "cluster_pcw"
  - metric: "mean_expression" or "pct_expressing"
  - value: The actual value

**Rows:** 7,471,520
**Best for:** Databases, filtering, R/Python analysis

### 3. COMBINED_by_cluster.txt (18 MB)
**Format:** Mean and pct side-by-side for each cluster
**Columns:** gene_id, c0_mean, c0_pct, c1_mean, c1_pct, ...
**Best for:** Easy comparison of mean vs pct for clusters

### 4. COMBINED_by_pcw.txt (4.3 MB)
**Format:** Mean and pct side-by-side for each PCW
**Columns:** gene_id, pcw16_mean, pcw16_pct, pcw20_mean, pcw20_pct, ...
**Best for:** Easy comparison of mean vs pct for timepoints

### 5. COMBINED_by_cluster_pcw.txt (50 MB)
**Format:** Mean and pct side-by-side for cluster×PCW combinations
**Columns:** gene_id, c0_pcw16_mean, c0_pcw16_pct, c0_pcw20_mean, ...
**Best for:** Detailed temporal analysis within clusters

## Usage Examples

### WIDE FORMAT (MASTER_all_expression_wide.txt)

R:
```R
data <- read.table("expression_data/MASTER_all_expression_wide.txt",
                   header=TRUE, row.names=1, sep="\t")

# Get SBF2 across all conditions
sbf2 <- data["ENSG00000133703", ]

# Get mean in cluster c15
sbf2_c15_mean <- data["ENSG00000133703", "mean_c15"]
```

Python:
```python
import pandas as pd
data = pd.read_table("expression_data/MASTER_all_expression_wide.txt", index_col=0)
sbf2 = data.loc["ENSG00000133703"]
```

### LONG FORMAT (MASTER_all_expression_long.txt)

R:
```R
library(dplyr)
data <- read.table("expression_data/MASTER_all_expression_long.txt",
                   header=TRUE, sep="\t")

# Get all SBF2 data
sbf2_data <- data %>% filter(gene_id == "ENSG00000133703")

# Get cluster means only
sbf2_cluster_means <- data %>%
  filter(gene_id == "ENSG00000133703",
         condition_type == "cluster",
         metric == "mean_expression")
```

Python:
```python
import pandas as pd
data = pd.read_table("expression_data/MASTER_all_expression_long.txt")

# Get SBF2 cluster means
sbf2_cluster_means = data[
    (data["gene_id"] == "ENSG00000133703") &
    (data["condition_type"] == "cluster") &
    (data["metric"] == "mean_expression")
]
```

### COMBINED FORMAT (easier mean vs pct comparison)

R:
```R
data <- read.table("expression_data/COMBINED_by_cluster.txt",
                   header=TRUE, row.names=1, sep="\t")

# Find genes with high mean AND high pct in c15
markers <- data[data$c15_mean > 1.0 & data$c15_pct > 50, ]
```

## Which File to Use?

| Task | Recommended File |
|------|-----------------|
| Quick gene lookup | MASTER_wide or COMBINED files |
| Filtering by condition | MASTER_long |
| Database import | MASTER_long |
| Excel analysis | MASTER_wide or COMBINED files |
| Compare mean vs pct | COMBINED files |
| SQL queries | MASTER_long |
| Machine learning | MASTER_long |

## Total Size

All merged files: ~572 MB
Original files remain in subdirectories (01_by_cluster/, 02_by_pcw/, 03_by_cluster_and_pcw/)