---
title: "Database Formats and Multi-Contig Support"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Database Formats and Multi-Contig Support}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r setup}
library(misha)
```

# Overview

Starting with misha 5.3.0, databases can be stored in two formats:

- **Indexed format** (default, recommended): Single unified files for sequences and tracks
- **Per-chromosome format** (legacy): Separate files for each chromosome

The indexed format provides better performance and scalability, especially for genomes with many contigs (>50 chromosomes).

## Key Features

- **Automatic format detection** - misha automatically detects which format your database uses
- **Fully backward compatible** - existing databases continue to work without modification
- **Transparent to users** - same API for both formats
- **Migration tools** - convert databases when convenient
- **Performance benefits** - 4-14% faster for large-scale analyses

# Database Formats

## Indexed Format (Recommended)

The indexed format uses unified files:

**Sequence data:**
- `seq/genome.seq` - All chromosome sequences concatenated
- `seq/genome.idx` - Index mapping chromosome names to positions

**Track data:**
- `tracks/mytrack.track/track.dat` - All chromosome data concatenated
- `tracks/mytrack.track/track.idx` - Index with offset/length per chromosome

**Advantages:**
- Fewer file descriptors (important for genomes with 100+ contigs)
- Better performance for large workloads (14% faster)
- Smaller disk footprint
- Faster track creation and conversion

## Per-Chromosome Format (Legacy)

The per-chromosome format uses separate files:

**Sequence data:**
- `seq/chr1.seq`, `seq/chr2.seq`, ... - One file per chromosome

**Track data:**
- `tracks/mytrack.track/chr1.track`, `chr2.track`, ... - One file per chromosome

**When to use:**
- Compatibility with older misha versions (<5.3.0)
- Small genomes (<25 chromosomes) where performance difference is negligible

# Creating Databases

## New Databases (Indexed Format)

By default, new databases use the indexed format:

```{r, eval = FALSE}
# Create database from FASTA file
gdb.create("mydb", "/path/to/genome.fa")

# Or download pre-built genome
gdb.create_genome("hg38", path = "/path/to/install")
```

## Force Legacy Format

To create a database in legacy format (for compatibility with older misha):

```{r, eval = FALSE}
# Set option before creating database
options(gmulticontig.indexed_format = FALSE)
gdb.create("mydb", "/path/to/genome.fa")
```

# Checking Database Format

Use `gdb.info()` to check your database format:

```{r, eval = FALSE}
gsetroot("/path/to/mydb")
info <- gdb.info()
print(info$format) # "indexed" or "per-chromosome"
```

Example output:

```{r, eval = FALSE}
info <- gdb.info()
# $path
# [1] "/path/to/mydb"
#
# $is_db
# [1] TRUE
#
# $format
# [1] "indexed"
#
# $num_chromosomes
# [1] 24
#
# $genome_size
# [1] 3095693983
```

# Converting Databases

## Convert Entire Database

Convert all tracks and sequences to indexed format:

```{r, eval = FALSE}
gsetroot("/path/to/mydb")
gdb.convert_to_indexed()
```

This will:
1. Convert sequence files (`chr*.seq` → `genome.seq + genome.idx`)
2. Convert all tracks to indexed format
3. Validate conversions
4. Remove old files after successful conversion

## Convert Individual Tracks

Convert specific tracks while keeping others in legacy format:

```{r, eval = FALSE}
# 1D tracks
gtrack.convert_to_indexed("mytrack")

# 2D tracks (rectangles and points)
gtrack.2d.convert_to_indexed("my2dtrack")
```

## Convert Intervals

Convert interval sets to indexed format:

```{r, eval = FALSE}
# 1D intervals
gintervals.convert_to_indexed("myintervals")

# 2D intervals
gintervals.2d.convert_to_indexed("my2dintervals")
```

# Migration Guide

## When to Migrate

**High priority (significant benefits):**
- Genomes with many contigs (>50 chromosomes)
- Large-scale analyses (10M+ bp regions frequently)
- 2D track workflows
- File descriptor limit issues

**Medium priority (moderate benefits):**
- Repeated extraction workflows
- Regular analyses on medium-sized regions (1-10M bp)

**Low priority (minimal benefits):**
- Small genomes (<25 chromosomes)
- One-off analyses
- Simple queries on small regions

## Migration Workflow

**Step 1: Backup (optional but recommended)**

```{r, eval = FALSE}
# Create backup of important database
system("cp -r /path/to/mydb /path/to/mydb.backup")
```

**Step 2: Check current format**

```{r, eval = FALSE}
gsetroot("/path/to/mydb")
info <- gdb.info()
print(paste("Current format:", info$format))
```

**Step 3: Convert**

```{r, eval = FALSE}
gdb.convert_to_indexed()
```

**Step 4: Verify**

```{r, eval = FALSE}
# Check format changed
info <- gdb.info()
print(paste("New format:", info$format))

# Test a few operations
result <- gextract("mytrack", gintervals(1, 0, 1000))
print(head(result))
```

**Step 5: Remove backup (after validation)**

```{r, eval = FALSE}
# After thorough testing
system("rm -rf /path/to/mydb.backup")
```

# Copying Tracks Between Databases

You can freely copy tracks between databases with different formats.

## Method 1: Export and Import

```{r, eval = FALSE}
# Export from source database
gsetroot("/path/to/source_db")
gextract("mytrack", gintervals.all(),
    iterator = "mytrack",
    file = "/tmp/mytrack.txt"
)

# Import to target database (format auto-detected)
gsetroot("/path/to/target_db")
gtrack.import("mytrack", "Copied track", "/tmp/mytrack.txt", binsize = 0)
# Automatically converted to target database format!
```

## Method 2: Batch Copy

```{r, eval = FALSE}
# Copy multiple tracks
tracks <- c("track1", "track2", "track3")

for (track in tracks) {
    # Export
    gsetroot("/path/to/source_db")
    file_path <- sprintf("/tmp/%s.txt", track)
    gextract(track, gintervals.all(), iterator = track, file = file_path)

    # Import
    gsetroot("/path/to/target_db")
    info <- gtrack.info(track) # Get description
    gtrack.import(track, info$description, file_path, binsize = 0)
    unlink(file_path)
}
```

# Performance Comparison

Based on comprehensive benchmarks comparing indexed vs legacy formats:

## Operations Faster with Indexed Format

- **Very large workloads (10M+ bp)**: 14% faster
- **2D track operations**: 2% faster (was 36% slower in legacy)
- **Repeated extractions**: 14% faster
- **Real workflows**: 8% faster on average

## Operations Similar Performance

- Single chromosome extraction: Within 5%
- Multi-chromosome (10-22 chr): Within 1%
- PWM operations: Within 3%
- Small workloads (<1M bp): Within 10%

## Summary

- **64% of operations are faster** with indexed format
- **93% within 5%** of legacy format (statistically equal)
- **Average: 4% faster** across all benchmarks
- **No regressions** for common use cases

# Backward Compatibility

## Fully Compatible

- ✅ Existing databases work without modification
- ✅ Existing scripts work without changes
- ✅ Same API for both formats
- ✅ Automatic format detection
- ✅ Can mix formats in same analysis

## Example: Mixed Environment

```{r, eval = FALSE}
# Work with both formats in same session
gsetroot("/path/to/legacy_db")
data1 <- gextract("track1", gintervals(1, 0, 1000))

gsetroot("/path/to/indexed_db")
data2 <- gextract("track2", gintervals(1, 0, 1000))
```

# Troubleshooting

## "File descriptor limit reached"

This occurs with many-contig genomes in legacy format:

**Solution:** Convert to indexed format

```{r, eval = FALSE}
gdb.convert_to_indexed()
```

## "Track not found after copying files"

After manually copying track directories:

**Solution:** Reload database

```{r, eval = FALSE}
gdb.reload()
```

## "Conversion fails with disk space error"

Conversion needs 2x track size temporarily:

**Solution:** Free disk space or convert tracks individually

```{r, eval = FALSE}
# Convert one track at a time
gtrack.convert_to_indexed("track1")
gtrack.convert_to_indexed("track2")
# etc.
```

# Best Practices

## For New Projects

1. Use indexed format (default) for new databases
2. Use `gdb.create_genome()` for standard genomes
3. Use `gdb.create()` with multi-FASTA for custom genomes

## For Existing Projects

1. Check format with `gdb.info()`
2. Migrate if beneficial (many contigs, large analyses)
3. No rush - legacy format remains fully supported

# Summary

- **Indexed format is the default and recommended** for new databases
- **Legacy format remains fully supported** - no forced migration
- **Automatic format detection** - users don't need to think about it
- **Significant performance benefits** for large-scale analyses
- **Easy migration** with `gdb.convert_to_indexed()`
- **Fully backward compatible** - existing code works unchanged


