# Gene Expression Data Files

Generated: 2025-12-18 23:35:12.654007

## Dataset Information

- Total genes: 33355
- Total cells: 57868
- Clusters: 23 (c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22)
- PCW timepoints: 4 (pcw16, pcw20, pcw21, pcw24)
- Cluster×PCW combinations: 85

## Directory Structure

### 01_by_cluster/
Expression matrices with genes as rows and clusters as columns.
- mean_expression_by_cluster.txt: Mean expression per gene per cluster
- median_expression_by_cluster.txt: Median expression per gene per cluster
- pct_expressing_by_cluster.txt: % of cells expressing each gene per cluster

### 02_by_pcw/
Expression matrices with genes as rows and PCW timepoints as columns.
- mean_expression_by_pcw.txt: Mean expression per gene per PCW
- median_expression_by_pcw.txt: Median expression per gene per PCW
- pct_expressing_by_pcw.txt: % of cells expressing each gene per PCW

### 03_by_cluster_and_pcw/
Combined cluster×PCW analysis.
- mean_expression_cluster_pcw.txt: Mean expression per gene for each cluster×PCW combo
- pct_expressing_cluster_pcw.txt: % expressing for each cluster×PCW combo
- metadata_cluster_pcw.txt: Information about each combination (cell counts)

### 04_individual_pcw/
Separate files for each PCW timepoint showing cluster breakdowns.
- pcw16_mean_by_cluster.txt, pcw16_pct_by_cluster.txt
- pcw20_mean_by_cluster.txt, pcw20_pct_by_cluster.txt
- pcw21_mean_by_cluster.txt, pcw21_pct_by_cluster.txt
- pcw24_mean_by_cluster.txt, pcw24_pct_by_cluster.txt

### 05_gene_annotations/
Gene-level metadata and annotations.
- gene_metadata.txt: Overall statistics for each gene
- highly_variable_genes.txt: Top 2000 most variable genes

### 06_summary/
Summary statistics and top gene lists.
- cell_counts_per_condition.txt: Number of cells in each cluster×PCW combo
- top50_genes_per_cluster.txt: Top 50 genes for each cluster
- top50_genes_per_pcw.txt: Top 50 genes for each PCW timepoint

## File Format

All files are tab-delimited text (.txt) with:
- Header row with column names
- First column: gene_id (Ensembl ID)
- No quotes around strings
- Missing values: represented as 0

## Usage Examples

### R
```R
# Load cluster expression data
cluster_expr <- read.table('01_by_cluster/mean_expression_by_cluster.txt',
                          header=TRUE, row.names=1, sep='\t')

# Get expression of a specific gene (e.g., SBF2)
sbf2_expr <- cluster_expr['ENSG00000133703', ]

# Find top genes in cluster c15
top_c15 <- cluster_expr[order(cluster_expr$c15, decreasing=TRUE), ]
head(top_c15, 20)
```

### Python (pandas)
```python
import pandas as pd

# Load cluster expression data
cluster_expr = pd.read_table('01_by_cluster/mean_expression_by_cluster.txt',
                            index_col=0)

# Get expression of specific gene
sbf2_expr = cluster_expr.loc['ENSG00000133703']

# Find top genes in cluster c15
top_c15 = cluster_expr.sort_values('c15', ascending=False)
```

## Notes

- Gene IDs are Ensembl IDs (format: ENSG#############)
- Expression values are normalized (log-transformed)
- Expression = 0 means gene not detected in any cells
- Percentage expressing: % of cells with expression > 0