llmjson

R-CMD-check Codecov test coverage R-universe version

llmjson repairs malformed JSON strings, particularly those generated by Large Language Models (LLMs). It uses Rust for fast, reliable JSON repair based on a vendored and bug-fixed version of the llm_json crate.

Features

Installation

You can install the development version of llmjson from GitHub:

# install.packages("remotes")
remotes::install_github("DyfanJones/llmjson")

Or r-universe:

install.packages('llmjson', repos = c('https://dyfanjones.r-universe.dev', 'https://cloud.r-project.org'))

System Requirements

This package requires the Rust toolchain to be installed on your system. If you don’t have Rust installed:

Usage

Basic JSON Repair

library(llmjson)

# Repair JSON with trailing comma
repair_json_str('{"key": "value",}')
#> [1] "{\"key\":\"value\"}"

# Repair JSON with unquoted keys
repair_json_str('{key: "value"}')
#> [1] "{\"key\":\"value\"}"

# Repair incomplete JSON
repair_json_str('{"name": "John", "age": 30')
#> [1] "{\"name\":\"John\",\"age\":30}"

# Repair JSON with single quotes
repair_json_str("{'name': 'John'}")
#> [1] "{\"name\":\"John\"}"

Return R Objects Directly

Instead of returning a JSON string, you can get R objects directly:

# Return as R list instead of JSON string
result <- repair_json_str('{"name": "Alice", "age": 30}', return_objects = TRUE)
result
#> $name
#> [1] "Alice"
#>
#> $age
#> [1] 30

# Works with all repair functions
result <- repair_json_file("data.json", return_objects = TRUE)

Handling Large Integers (64-bit)

JSON numbers that exceed R’s 32-bit integer range (beyond -2,147,483,648 to 2,147,483,647) need special handling. The int64 parameter controls how these large integers are converted:

json_str <- '{"id": 9007199254740993}'

# Option 1: "double" (default) - Convert to R numeric (may lose precision)
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "double")
result$id
#> [1] 9.007199e+15  # Lost precision: actual value is 9007199254740992

# Option 2: "string" - Preserve exact value as character
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "string")
result$id
#> [1] "9007199254740993"  # Exact value preserved

# Option 3: "bit64" - Use bit64 package for true 64-bit integers
# Requires: install.packages("bit64")
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "bit64")
result$id
#> integer64
#> [1] 9007199254740993  # Exact value preserved with integer type

Which option should I use?

Schema Validation and Type Conversion

Define schemas to validate JSON structure and ensure correct R types. The schema system is inspired by the structr package and provides an intuitive way to define expected JSON structures:

# Define a schema for a user object
schema <- json_object(
  name = json_string(),
  age = json_integer(),
  email = json_string()
)

# Repair and validate with schema
result <- repair_json_str(
  '{"name": "Alice", "age": "30", "email": "alice@example.com"}',
  schema = schema,
  return_objects = TRUE
)

# Note: age is coerced from string "30" to integer 30
str(result)
#> List of 3
#>  $ name : chr "Alice"
#>  $ age  : int 30
#>  $ email: chr "alice@example.com"

Required vs Optional Fields and Default Values

Control how missing fields are handled with .required and .default parameters:

Required fields (.required = TRUE): - Missing fields are added with their .default value (or their type’s default if no explicit default) - Always appear in the output

Optional fields (.required = FALSE, the default): - Missing fields are omitted entirely from the output - Only appear if present in the input JSON

# Example 1: Required field with explicit default
schema <- json_object(
  name = json_string(.required = TRUE),
  age = json_integer(.default = 25L, .required = TRUE)  # required, will use default if missing
)

result <- repair_json_str('{"name": "Alice"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Alice"
#>
#> $age
#> [1] 25

# Example 2: Optional field (omitted when missing)
schema <- json_object(
  name = json_string(.required = TRUE),
  nickname = json_string(.required = FALSE)  # optional, omitted if not in input
)

result <- repair_json_str('{"name": "Bob"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Bob"
# Note: nickname is not present since it was optional and missing from input

# Example 3: Required field with type default
schema <- json_object(
  name = json_string(.required = TRUE),
  age = json_integer(.required = TRUE)  # required, will use type default (0L) if missing
)

result <- repair_json_str('{"name": "Charlie"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Charlie"
#>
#> $age
#> [1] 0

Nested Schemas and Arrays

Build complex schemas with nested objects and arrays:

# Schema with nested object and array
schema <- json_object(
  name = json_string(),
  address = json_object(
    city = json_string(),
    zip = json_integer()
  ),
  scores = json_array(json_integer())
)

json_str <- '{
  "name": "Alice",
  "address": {"city": "NYC", "zip": "10001"},
  "scores": [90, 85, 95]
}'

result <- repair_json_str(json_str, schema = schema, return_objects = TRUE)
str(result)
#> List of 3
#>  $ name   : chr "Alice"
#>  $ address:List of 2
#>   ..$ city: chr "NYC"
#>   ..$ zip : int 10001
#>  $ scores : int [1:3] 90 85 95

Build Schemas for Better Performance

For repeated use with the same schema, use json_schema() to compile the schema once and reuse it many times.

# Define your schema
schema <- json_object(
  name = json_string(),
  age = json_integer(),
  email = json_string()
)

# Build it once - this creates an optimized internal representation
built_schema <- json_schema(schema)

# Reuse many times - much faster!
for (json_str in json_strings) {
  result <- repair_json_str(json_str, built_schema, return_objects = TRUE)
  # Process result...
}

Performance comparison (complex nested schema): - Without json_schema(): ~266µs per call - With json_schema(): ~51µs per call (5.2x faster) - No schema: ~44µs per call

The performance benefit is especially significant for: - Complex nested schemas with multiple levels - Batch processing of many JSON strings - Performance-critical applications - Real-time data processing pipelines

Repair JSON from Files

# Read and repair JSON from a file
repair_json_file("malformed.json")

# With schema validation
schema <- json_object(
  name = json_string(.required = TRUE),
  age = json_integer(.default = 25L, .required = TRUE)  # required field with default
)
result <- repair_json_file("data.json", schema = schema, return_objects = TRUE)

Repair JSON from Raw Bytes

# Repair JSON from raw byte vector
raw_data <- charToRaw('{"key": "value",}')
repair_json_raw(raw_data)
#> [1] "{\"key\":\"value\"}"

# With return_objects
result <- repair_json_raw(raw_data, return_objects = TRUE)

Repair JSON from Connections

Read and repair JSON from any R connection (files, URLs, pipes, compressed files, etc.):

# Read from a file connection
conn <- file("malformed.json", "r")
result <- repair_json_conn(conn)
close(conn)

# Read from a URL
conn <- url("https://api.example.com/data.json")
result <- repair_json_conn(conn, return_objects = TRUE)
close(conn)

# Read from a compressed file
conn <- gzfile("data.json.gz", "r")
result <- repair_json_conn(conn, return_objects = TRUE, int64 = "string")
close(conn)

# Use with() to ensure connection is closed automatically
result <- local({
  conn <- file("malformed.json", "r")
  on.exit(close(conn))
  repair_json_conn(conn, return_objects = TRUE)
})

Use Case: Working with LLM Outputs

Large Language Models often generate JSON that is almost correct but has minor syntax errors. This package helps you handle those cases gracefully:

# LLM might output JSON with trailing commas and unquoted keys
llm_output <- '{
  users: [
    {name: "Alice", age: 30,},
    {name: "Bob", age: 25,},
  ],
}'

# Option 1: Repair and parse with your chosen JSON parser (e.g., jsonlite)
repaired <- repair_json_str(llm_output)
(parsed <- jsonlite::fromJSON(repaired))
#> $users
#>   age  name
#> 1  30 Alice
#> 2  25   Bob

# Option 2: Use schema with return_objects for type safety
schema <- json_object(
  users = json_array(json_object(
    name = json_string(),
    age = json_integer()
  ))
)

result <- repair_json_str(llm_output, schema = schema, return_objects = TRUE)
str(result)
#> List of 1
#>  $ users:List of 2
#>   ..$ :List of 2
#>   .. ..$ name: chr "Alice"
#>   .. ..$ age : int 30
#>   ..$ :List of 2
#>   .. ..$ name: chr "Bob"
#>   .. ..$ age : int 25

Available Functions

Repair Functions

All repair functions support the schema, return_objects, ensure_ascii, and int64 parameters:

Parameters: - schema - Optional schema definition (R list from json_object(), etc.) or built schema (from json_schema()) - return_objects - If TRUE, returns R objects instead of JSON strings - ensure_ascii - If TRUE (default), escape non-ASCII characters in the output JSON - int64 - Policy for handling 64-bit integers: "double" (default), "string", or "bit64"

Schema Functions

Comparison with Similar Packages

While R has several JSON parsing packages like jsonlite, they typically fail when encountering malformed JSON. llmjson is specifically designed to handle the common errors that LLMs make when generating JSON output, making it ideal for:

Acknowledgments

This package includes a vendored and bug-fixed version of the llm_json Rust crate (v1.0.1) by Ribelo, which is itself a Rust port of the Python json_repair library by Stefano Baccianella (mangiucugna). Our vendored version includes critical bug fixes for array parsing not present in the upstream release.

The schema system was inspired by the structr package, which provides elegant patterns for defining and validating data structures in R.

Code of Conduct

Please note that the llmjson project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.