Subtitle Text Analysis with subtools

Overview

subtools reads and manipulates video subtitle files from a variety of formats (SubRip .srt, WebVTT .vtt, SubStation Alpha .ass/.ssa, SubViewer .sub, MicroDVD .sub) and exposes them as tidy tibbles ready for text analysis.

This vignette walks through:

Reading subtitle files
Exploring and cleaning subtitle objects
Combining subtitles from multiple files
Adjusting timecodes
Tokenising and analysing text with tidytext
Analysing dialogue across a TV series

1. Reading subtitles

From a file

read_subtitles() is the main entry point. It auto-detects the file format from the extension and returns a subtitles object — a tibble with four core columns: ID, Timecode_in, Timecode_out, and Text_content.

f_srt <- system.file("extdata", "ex_subrip.srt", package = "subtools")
subs <- read_subtitles(file = f_srt)
subs
#> # A tibble: 6 × 4
#>   ID    Timecode_in Timecode_out Text_content                                   
#>   <chr> <time>      <time>       <chr>                                          
#> 1 1     00'22.5"    00'24.1"     Lorem ipsum dolor sit amet, consectetur adipis…
#> 2 2     00'25.5"    00'27.1"     Donec eu nisl commodo, elementum dui ut, gravi…
#> 3 3     00'28.7"    00'29.1"     Nulla aliquam,                                 
#> 4 4     00'29.9"    00'31.1"     nibh cursus interdum volutpat,                 
#> 5 5     00'31.9"    00'33.1"     dolor lacus hendrerit tellus, vel faucibus jus…
#> 6 6     00'33.9"    00'34.8"     Suspendisse potenti.

The same call works for every supported format. Use format = "auto" (default) or supply the format explicitly.

f_vtt <- system.file("extdata", "ex_webvtt.vtt", package = "subtools")
read_subtitles(file = f_vtt, format = "webvtt")
#> # A tibble: 3 × 4
#>   ID              Timecode_in Timecode_out Text_content                         
#>   <chr>           <time>      <time>       <chr>                                
#> 1 1               00'01"      00'04"       Never drink liquid nitrogen.         
#> 2 2               00'05"      00'09"       — It will perforate your stomach. — …
#> 3 A dangerous cue 00'11"      00'14"       Dès Noël où un zéphyr haï me vêt de …

f_ass <- system.file("extdata", "ex_substation.ass", package = "subtools")
read_subtitles(file = f_ass, format = "substation")
#> # A tibble: 6 × 4
#>   ID    Timecode_in Timecode_out Text_content                                   
#>   <chr> <time>      <time>       <chr>                                          
#> 1 1     00'22.5"    00'24.1"     Lorem ipsum dolor sit amet, consectetur adipis…
#> 2 2     00'25.5"    00'27.1"     Donec eu nisl commodo, elementum dui ut, gravi…
#> 3 3     00'28.7"    00'29.1"     Nulla aliquam,                                 
#> 4 4     00'29.9"    00'31.1"     nibh cursus interdum volutpat,                 
#> 5 5     00'31.9"    00'33.1"     dolor lacus hendrerit tellus, vel faucibus jus…
#> 6 6     00'33.9"    00'34.8"     Suspendisse potenti.

Attaching metadata at read time

Any descriptive information — season, episode, source, language — can be attached as a one-row tibble via the metadata argument. The values are repeated for every subtitle line, keeping the tidy structure intact.

subs_meta <- read_subtitles(
  file = f_srt,
  metadata = tibble::tibble(Season = 1L, Episode = 3L, Language = "en")
)
subs_meta
#> # A tibble: 6 × 7
#>   ID    Timecode_in Timecode_out Text_content            Season Episode Language
#>   <chr> <time>      <time>       <chr>                    <int>   <int> <chr>   
#> 1 1     00'22.5"    00'24.1"     Lorem ipsum dolor sit …      1       3 en      
#> 2 2     00'25.5"    00'27.1"     Donec eu nisl commodo,…      1       3 en      
#> 3 3     00'28.7"    00'29.1"     Nulla aliquam,               1       3 en      
#> 4 4     00'29.9"    00'31.1"     nibh cursus interdum v…      1       3 en      
#> 5 5     00'31.9"    00'33.1"     dolor lacus hendrerit …      1       3 en      
#> 6 6     00'33.9"    00'34.8"     Suspendisse potenti.         1       3 en

Metadata columns travel with the object through all subtools operations.

From a character vector

as_subtitle() parses an in-memory character vector, which is useful when the subtitle text is already loaded or generated programmatically.

raw <- c(
  "1",
  "00:00:01,000 --> 00:00:03,500",
  "Hello, world.",
  "",
  "2",
  "00:00:04,000 --> 00:00:06,000",
  "This is subtools."
)
as_subtitle(x = raw, format = "srt")
#> # A tibble: 2 × 4
#>   ID    Timecode_in Timecode_out Text_content     
#>   <chr> <time>      <time>       <chr>            
#> 1 1     00'01"      00'03.5"     Hello, world.    
#> 2 2     00'04"      00'06.0"     This is subtools.

2. Exploring the subtitles object

Quick summary

get_subtitles_info() prints a compact summary: line count, overall duration, and attached metadata fields.

s <- read_subtitles(
  file = system.file("extdata", "ex_subrip.srt", package = "subtools")
)
get_subtitles_info(x = s)
#> subtitles object
#>   Text lines: 6
#>   Duration: 00:00:12.3
#>   Metadata: 0

Raw text extraction

get_raw_text() collapses all subtitle lines into a single character string, useful when passing the whole transcript to external Natural Language Processing tools.

transcript <- get_raw_text(x = s)
transcript
#> [1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eu nisl commodo, elementum dui ut, gravida orci. Nulla aliquam, nibh cursus interdum volutpat, dolor lacus hendrerit tellus, vel faucibus justo nisi quis felis. Suspendisse potenti."

# One line per subtitle, separated by newlines
cat(get_raw_text(x = s, collapse = "\n"))
#> Lorem ipsum dolor sit amet, consectetur adipiscing elit.
#> Donec eu nisl commodo, elementum dui ut, gravida orci.
#> Nulla aliquam,
#> nibh cursus interdum volutpat,
#> dolor lacus hendrerit tellus, vel faucibus justo nisi quis felis.
#> Suspendisse potenti.

Accessing individual columns

Because a subtitles object is a tibble, all dplyr verbs work directly:

library(dplyr)

# Lines spoken after the first 30 seconds
s |>
  filter(Timecode_in > hms::as_hms("00:00:30"))
#> # A tibble: 2 × 4
#>   ID    Timecode_in Timecode_out Text_content                                   
#>   <chr> <time>      <time>       <chr>                                          
#> 1 5     00'31.9"    00'33.1"     dolor lacus hendrerit tellus, vel faucibus jus…
#> 2 6     00'33.9"    00'34.8"     Suspendisse potenti.

# Duration of each subtitle cue (in seconds)
s |>
  mutate(duration_s = as.numeric(Timecode_out - Timecode_in)) |>
  select(ID, Text_content, duration_s)
#> # A tibble: 6 × 3
#>   ID    Text_content                                                  duration_s
#>   <chr> <chr>                                                              <dbl>
#> 1 1     Lorem ipsum dolor sit amet, consectetur adipiscing elit.           1.60 
#> 2 2     Donec eu nisl commodo, elementum dui ut, gravida orci.             1.60 
#> 3 3     Nulla aliquam,                                                     0.400
#> 4 4     nibh cursus interdum volutpat,                                     1.20 
#> 5 5     dolor lacus hendrerit tellus, vel faucibus justo nisi quis f…      1.20 
#> 6 6     Suspendisse potenti.                                               0.900

3. Cleaning subtitles

Subtitle files frequently contain formatting tags, closed-caption descriptions, and other non-speech artefacts that should be removed before text analysis.

Remove formatting tags

clean_tags() strips HTML-style tags (used in SRT and WebVTT) and curly-brace override blocks (used in SubStation Alpha).

tagged <- as_subtitle(
  x = c(
    "1",
    "00:00:01,000 --> 00:00:03,000",
    "<i>This is <b>important</b>.</i>",
    "",
    "2",
    "00:00:04,000 --> 00:00:06,000",
    "<font color=\"red\">Warning!</font>"
  ),
  format = "srt",
  clean.tags = FALSE   # keep tags so we can demonstrate cleaning
)
tagged$Text_content
#> [1] "<i>This is <b>important</b>.</i>"    "<font color=\"red\">Warning!</font>"

clean_tags(x = tagged)$Text_content
#> [1] "This is important." "Warning!"

Remove closed captions

clean_captions() removes text enclosed in parentheses or square brackets — typically sound descriptions and speaker identifiers used in accessibility captions.

bb <- read_subtitles(
  file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
  clean.tags = FALSE
)
bb$Text_content
#> [1] "Oh, my God. Christ!"                         
#> [2] "Shit."                                       
#> [3] "[SIRENS WAILING IN DISTANCE]"                
#> [4] "Oh, God. Oh, my God."                        
#> [5] "Oh, my God. Oh, my God. Think, think, think."

clean_captions(x = bb)$Text_content
#> [1] "Oh, my God. Christ!"                         
#> [2] "Shit."                                       
#> [3] "Oh, God. Oh, my God."                        
#> [4] "Oh, my God. Oh, my God. Think, think, think."

Remove arbitrary patterns

clean_patterns() accepts any regular expression, giving full flexibility for project-specific cleaning.

# Remove speaker labels such as "WALTER:" or "JESSE:"
s_labeled <- as_subtitle(
  x = c(
    "1", "00:00:01,000 --> 00:00:03,000", "WALTER: We need to cook.",
    "",
    "2", "00:00:04,000 --> 00:00:06,000", "JESSE: Yeah, Mr. White!"
  ),
  format = "srt", clean.tags = FALSE
)

clean_patterns(x = s_labeled, pattern = "^[A-Z]+: ")$Text_content
#> [1] "We need to cook." "Yeah, Mr. White!"

Chaining cleaning steps

Because each cleaning function returns a subtitles object, steps can be piped:

s_clean <- read_subtitles(file = f_srt, clean.tags = FALSE) |>
  clean_tags() |>
  clean_captions() |>
  clean_patterns(pattern = "^-\\s*")   # remove leading dialogue dashes

s_clean$Text_content
#> [1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit."         
#> [2] "Donec eu nisl commodo, elementum dui ut, gravida orci."           
#> [3] "Nulla aliquam,"                                                   
#> [4] "nibh cursus interdum volutpat,"                                   
#> [5] "dolor lacus hendrerit tellus, vel faucibus justo nisi quis felis."
#> [6] "Suspendisse potenti."

4. Combining subtitles

Collapsing multiple objects into one

bind_subtitles() merges any number of subtitles (or multisubtitles) objects. With collapse = TRUE (default), timecodes are shifted so that each file follows the previous one sequentially.

s1 <- read_subtitles(
  file = system.file("extdata", "ex_subrip.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 1L)
)
s2 <- read_subtitles(
  file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 2L)
)

combined <- bind_subtitles(s1, s2)
nrow(combined)
#> [1] 10
range(combined$Timecode_in)
#> Time differences in secs
#> [1]   22.500 1292.851

Keeping a list structure

Set collapse = FALSE to get a multisubtitles object — a named list of subtitles — when you want to process episodes independently before merging.

multi <- bind_subtitles(s1, s2, collapse = FALSE)
class(multi)
#> [1] "multisubtitles"
print(multi)
#> A multisubtitles object with 2 elements
#> subtitles object [[1]]
#> # A tibble: 6 × 5
#>   ID    Timecode_in Timecode_out Text_content                            Episode
#>   <chr> <time>      <time>       <chr>                                     <int>
#> 1 1     00'22.5"    00'24.1"     Lorem ipsum dolor sit amet, consectetu…       1
#> 2 2     00'25.5"    00'27.1"     Donec eu nisl commodo, elementum dui u…       1
#> 3 3     00'28.7"    00'29.1"     Nulla aliquam,                                1
#> 4 4     00'29.9"    00'31.1"     nibh cursus interdum volutpat,                1
#> 5 5     00'31.9"    00'33.1"     dolor lacus hendrerit tellus, vel fauc…       1
#> 6 6     00'33.9"    00'34.8"     Suspendisse potenti.                          1
#> 
#> 
#> subtitles object [[2]]
#> # A tibble: 4 × 5
#>   ID    Timecode_in Timecode_out Text_content                            Episode
#>   <chr> <time>      <time>       <chr>                                     <int>
#> 1 180   21'15.769"  21'23.069"   Rushmore deserves an aquarium. A first…       2
#> 2 181   21'23.069"  21'25.670"   - I don't know. What do you think, Ern…       2
#> 3 182   21'25.746"  21'32.170"   - What kind of fish? - Barracudas. Sti…       2
#> 4 183   21'32.851"  21'36.570"   - Piranhas? Really? - Yes, I'm talking…       2

get_subtitles_info() also works on multisubtitles:

get_subtitles_info(x = multi)
#> multisubtitles object with 2 subtitles.

5. Reading an entire series

For TV series organised in a standard directory tree, subtools provides convenience readers that handle the hierarchy automatically and extract Season/Episode metadata from folder and file names.

Series_Collection/
|-- BreakingBad/
|   |-- Season_01/
|   |   |-- S01E01.srt
|   |   |-- S01E02.srt
|   |-- Season_02/
|       |-- S02E01.srt

# Read a single season
season1 <- read_subtitles_season(dir = "BreakingBad/Season_01/")

# Read an entire series (all seasons)
bb_all <- read_subtitles_serie(dir = "BreakingBad/")

# Read multiple series at once
collection <- read_subtitles_multiseries(dir = "Series_Collection/")

Each function returns a single collapsed subtitles object by default (bind = TRUE), with Serie, Season, and Episode columns populated from the directory structure. Pass bind = FALSE to get a multisubtitles list instead.

6. Adjusting timecodes

move_subtitles() shifts all timecodes by a fixed number of seconds. Positive values shift forward; negative values shift backward. This is useful when the subtitle file is out of sync with the video.

subs_shifted <- move_subtitles(x = subs, lag = 2.5)

# Compare first cue before and after
subs$Timecode_in[1]
#> 00:00:22.5
subs_shifted$Timecode_in[1]
#> 00:00:25

move_subtitles() also works on multisubtitles:

multi_shifted <- move_subtitles(x = multi, lag = -1.0)
multi_shifted[[1]]$Timecode_in[1]
#> 00:00:21.5

7. Writing subtitles back to disk

write_subtitles() serialises a subtitles object to a SubRip .srt file.

write_subtitles(x = subs_shifted, file = "synced_episode.srt")

8. Text analysis with tidytext

Tokenising into words

unnest_tokens() extends tidytext::unnest_tokens() with subtitle-aware timecode remapping: each token inherits a proportional slice of the original cue’s time window, enabling timeline-based analyses.

words <- unnest_tokens(tbl = subs)
words
#> # A tibble: 35 × 4
#>    ID    Timecode_in Timecode_out Text_content
#>    <chr> <time>      <time>       <chr>       
#>  1 1     00'22.5010" 00'22.6702"  lorem       
#>  2 1     00'22.6712" 00'22.8404"  ipsum       
#>  3 1     00'22.8414" 00'23.0106"  dolor       
#>  4 1     00'23.0116" 00'23.1128"  sit         
#>  5 1     00'23.1138" 00'23.2489"  amet        
#>  6 1     00'23.2499" 00'23.6234"  consectetur 
#>  7 1     00'23.6244" 00'23.9638"  adipiscing  
#>  8 1     00'23.9648" 00'24.1000"  elit        
#>  9 2     00'25.5010" 00'25.6860"  donec       
#> 10 2     00'25.6870" 00'25.7605"  eu          
#> # ℹ 25 more rows

The Timecode_in / Timecode_out columns now reflect the estimated position of each word within its cue.

Tokenising into sentences or n-grams

# Bigrams
bigrams <- unnest_tokens(tbl = subs, output = Word, input = Text_content,
                         token = "ngrams", n = 2)
bigrams$Word
#>  [1] "lorem ipsum"            "ipsum dolor"            "dolor sit"             
#>  [4] "sit amet"               "amet consectetur"       "consectetur adipiscing"
#>  [7] "adipiscing elit"        "donec eu"               "eu nisl"               
#> [10] "nisl commodo"           "commodo elementum"      "elementum dui"         
#> [13] "dui ut"                 "ut gravida"             "gravida orci"          
#> [16] "nulla aliquam"          "nibh cursus"            "cursus interdum"       
#> [19] "interdum volutpat"      "dolor lacus"            "lacus hendrerit"       
#> [22] "hendrerit tellus"       "tellus vel"             "vel faucibus"          
#> [25] "faucibus justo"         "justo nisi"             "nisi quis"             
#> [28] "quis felis"             "suspendisse potenti"

Word frequency

library(dplyr)

words |>
  count(Text_content, sort = TRUE) |>
  head(10)
#> # A tibble: 10 × 2
#>    Text_content     n
#>    <chr>        <int>
#>  1 dolor            2
#>  2 adipiscing       1
#>  3 aliquam          1
#>  4 amet             1
#>  5 commodo          1
#>  6 consectetur      1
#>  7 cursus           1
#>  8 donec            1
#>  9 dui              1
#> 10 elementum        1

9. Advanced: cross-episode analysis

The metadata columns added at read time make it straightforward to compare episodes or seasons. The example below simulates a two-episode corpus and computes per-episode word counts — a pattern that scales directly to a full series loaded with read_subtitles_serie().

ep1 <- read_subtitles(
  file = system.file("extdata", "ex_breakingbad.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 1L)
)
ep2 <- read_subtitles(
  file = system.file("extdata", "ex_rushmore.srt", package = "subtools"),
  metadata = tibble::tibble(Episode = 2L)
)
ep3 <- read_subtitles(
  file = system.file("extdata", "ex_webvtt.vtt", package = "subtools"),
  metadata = tibble::tibble(Episode = 3L)
)

corpus <- bind_subtitles(ep1, ep2, ep3)

token_counts <- unnest_tokens(corpus) |>
  count(Episode, Text_content, sort = TRUE)

token_counts |>
  slice_max(n, n = 5, by = Episode)
#> # A tibble: 51 × 3
#>    Episode Text_content     n
#>      <int> <chr>        <int>
#>  1       1 god              5
#>  2       1 oh               5
#>  3       1 my               4
#>  4       1 think            3
#>  5       1 christ           1
#>  6       1 distance         1
#>  7       1 in               1
#>  8       1 shit             1
#>  9       1 sirens           1
#> 10       1 wailing          1
#> # ℹ 41 more rows

TF-IDF across episodes

TF-IDF highlights words that are distinctive to each episode compared with the rest of the corpus.

token_counts |>
  tidytext::bind_tf_idf(Text_content, Episode, n) |>
  arrange(Episode, desc(tf_idf)) |>
  slice_max(tf_idf, n = 5, by = Episode)
#> # A tibble: 48 × 6
#>    Episode Text_content     n     tf   idf tf_idf
#>      <int> <chr>        <int>  <dbl> <dbl>  <dbl>
#>  1       1 god              5 0.217  1.10  0.239 
#>  2       1 oh               5 0.217  1.10  0.239 
#>  3       1 my               4 0.174  1.10  0.191 
#>  4       1 think            3 0.130  0.405 0.0529
#>  5       1 christ           1 0.0435 1.10  0.0478
#>  6       1 distance         1 0.0435 1.10  0.0478
#>  7       1 shit             1 0.0435 1.10  0.0478
#>  8       1 sirens           1 0.0435 1.10  0.0478
#>  9       1 wailing          1 0.0435 1.10  0.0478
#> 10       2 aquarium         3 0.0612 1.10  0.0673
#> # ℹ 38 more rows

Dialogue timeline

Because timecodes are preserved through unnest_tokens(), words can be plotted along a timeline, e.g. to visualise how vocabulary density evolves across a film.

words_ep1 <- unnest_tokens(tbl = ep1) |>
  mutate(minute = as.numeric(Timecode_in) / 60)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  ggplot(words_ep1, aes(x = minute)) +
    geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white") +
    labs(
      title = "Word density over time",
      x     = "Time (minutes)",
      y     = "Word count"
    ) +
    theme_minimal()
}

Summary

Task	Function
Read a subtitle file	`read_subtitles()`
Parse in-memory text	`as_subtitle()`
Read a full season/series	`read_subtitles_season()` / `read_subtitles_serie()` / `read_subtitles_multiseries()`
Print a summary	`get_subtitles_info()`
Extract plain text	`get_raw_text()`
Remove HTML/ASS tags	`clean_tags()`
Remove closed captions	`clean_captions()`
Remove custom patterns	`clean_patterns()`
Merge subtitle objects	`bind_subtitles()`
Shift timecodes	`move_subtitles()`
Write to `.srt`	`write_subtitles()`
Tokenise (words, n-grams, …)	`unnest_tokens()`