Getting started with vcfheader

The fast variant call format (VCF) file header intelligence and audit.

TLDR

The main workflow is to parse a VCF header and write a standalone HTML report.

input <- parse_vcf_header("file.vcf")

vcfheader(
  input,
  file = "file_report.html"
)

Example report preview

A preview of the bundled example HTML report is shown below.

The report includes file metadata, contig summaries, INFO and FORMAT field definitions, and other header entries in a portable HTML layout.

Overview

The vcfheader package reads and interprets Variant Call Format (VCF) header metadata without loading full variant records. It is designed for fast inspection, validation, and reporting of header structure in both small and large VCF files.

This vignette shows how to:

Bundled example files

The package examples use small VCF files shipped with the package, so the vignette works offline and is suitable for package checking.

simple_vcf <- system.file("extdata", "simple.vcf", package = "vcfheader")
sv_vcf <- system.file("extdata", "sv44.vcf", package = "vcfheader")

basename(simple_vcf)
#> [1] "simple.vcf"
basename(sv_vcf)
#> [1] "sv44.vcf"

Read raw header lines

Use read_vcf_header() when you want only the original header lines.

hdr_lines <- read_vcf_header(simple_vcf)
head(hdr_lines, 5)
#> [1] "##fileformat=VCFv4.3"                                                                                                  
#> [2] "##fileDate=20090805"                                                                                                   
#> [3] "##source=myImputationProgramV3.1"                                                                                      
#> [4] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"                                                      
#> [5] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>"

Parse a VCF header

Use parse_vcf_header() to create a structured object containing file metadata, contigs, INFO and FORMAT definitions, sample names, warnings, and errors.

hdr <- parse_vcf_header(simple_vcf)
hdr
#> <vcf_hdr>
#>   fileformat: VCFv4.3 
#>   samples: 3 
#>   contigs: 1 
#>   INFO: 6  FORMAT: 4  FILTER: 2  ALT: 0

Inspect parsed content

The parsed object is a list-like S3 object with standard components.

names(hdr)
#>  [1] "file"        "samples"     "contigs"     "info"        "format"     
#>  [6] "filter"      "alt"         "sample_meta" "pedigree"    "other_kv"   
#> [11] "raw"         "warnings"    "errors"

File-level metadata:

file_meta <- hdr$file
file_meta$input_path <- basename(file_meta$input_path)
file_meta
#> $input_path
#> [1] "simple.vcf"
#> 
#> $fileformat
#> [1] "VCFv4.3"
#> 
#> $fileDate
#> [1] "20090805"
#> 
#> $source
#> [1] "myImputationProgramV3.1"
#> 
#> $reference
#> [1] "file:///seq/references/1000GenomesPilot-NCBI36.fasta"
#> 
#> $phasing
#> [1] "partial"

Sample names:

hdr$samples
#> [1] "NA00001" "NA00002" "NA00003"

INFO definitions:

hdr$info[, c("ID", "Number", "Type", "Description")]
#>   ID Number    Type                 Description
#> 1 NS      1 Integer Number of Samples With Data
#> 2 DP      1 Integer                 Total Depth
#> 3 AF      A   Float            Allele Frequency
#> 4 AA      1  String            Ancestral Allele
#> 5 DB      0    Flag dbSNP membership, build 129
#> 6 H2      0    Flag          HapMap2 membership

FORMAT definitions:

hdr$format[, c("ID", "Number", "Type", "Description")]
#>   ID Number    Type       Description
#> 1 GT      1  String          Genotype
#> 2 GQ      1 Integer  Genotype Quality
#> 3 DP      1 Integer        Read Depth
#> 4 HQ      2 Integer Haplotype Quality

Validation output:

hdr$warnings
#> character(0)
hdr$errors
#> character(0)

Inference from the header

The parser can add lightweight inferred annotations. For example, some common annotation systems can be guessed from reserved INFO tags.

hdr$file$caller_guess
#> NULL

Structural variant example

The structural-variant example includes symbolic ALT definitions and structural INFO fields.

sv_hdr <- parse_vcf_header(sv_vcf)
sv_hdr
#> <vcf_hdr>
#>   fileformat: VCFv4.4 
#>   samples: 1 
#>   contigs: 1 
#>   INFO: 9  FORMAT: 1  FILTER: 0  ALT: 6

Contigs:

sv_hdr$contigs[, intersect(c("ID", "length", "assembly", "md5"), names(sv_hdr$contigs)), drop = FALSE]
#>     ID  length assembly  md5
#> 1 chrA 1000000     <NA> <NA>

ALT definitions:

sv_hdr$alt[, c("ID", "Description")]
#>           ID                 Description
#> 1        INV                   Inversion
#> 2        INS                   Insertion
#> 3        DUP                 Duplication
#> 4 DUP:TANDEM          Tandem Duplication
#> 5        DEL                    Deletion
#> 6        CNV Copy number variable region

INFO fields:

sv_hdr$info[, c("ID", "Number", "Type", "Description")]
#>          ID Number    Type
#> 1    MATEID      A  String
#> 2       END      1 Integer
#> 3     CIPOS      . Integer
#> 4     SVLEN      A Integer
#> 5     CILEN      . Integer
#> 6     EVENT      A  String
#> 7 EVENTTYPE      A  String
#> 8   SVCLAIM      A  String
#> 9 IMPRECISE      0    Flag
#>                                                                                                           Description
#> 1                                                                                                 ID of mate breakend
#> 2                                                        End position of the longest variant described in this record
#> 3                                                     Confidence interval around POS for symbolic structural variants
#> 4                                                                                        Length of structural variant
#> 5                                                                             Confidence interval for the SVLEN field
#> 6                                                                                              ID of associated event
#> 7                                                                                            Type of associated event
#> 8 Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively
#> 9                                                                                      Imprecise structural variation

Generate an HTML report

vcfheader() writes a standalone HTML report. This is the main use case.


hdr <- parse_vcf_header("file.vcf")

out_file <- "file_report.html"

vcfheader(
  hdr,
  file = out_file
)

file.exists(out_file)
#> [1] TRUE
basename(out_file)
#> [1] "file_report.html"

You can also let vcfheader() derive the output path from the original input file path.

vcfheader(hdr)

Bundled example HTML report

The package also ships a prebuilt example report generated from the bundled simple.vcf file.

example_report <- system.file(
  "extdata",
  "simple_vcfheader.html",
  package = "vcfheader"
)


basename(example_report)
#> [1] "simple_vcfheader.html"
file.exists(example_report)
#> [1] TRUE

This file can be opened directly in a browser after installation.

Summary

vcfheader provides a quick way to inspect VCF header metadata, validate structure, extract structured definitions, and produce a portable HTML summary for review or reporting.

For research use only.

VCFheader is a free and open-source tool provided by Switzerland Omics. Use of the software and generated reports is permitted, including in commercial settings, under the MIT License. Attribution to Switzerland Omics and VCFheader should be retained where reasonably practicable. Further information: https://switzerlandomics.ch. Switzerland OmicsĀ® is a registered trade mark. VCF specification references in this report relate to samtools and the broader HTS specifications ecosystem and are distributed under the MIT/Expat License by Genome Research Ltd.Ā Source: https://github.com/samtools/samtools. Further reading on HTS and file format specifications: https://www.htslib.org/doc/#file-formats.