The statlingua Package

Using LLMs to Help Explain and Interperate Statistical Output

Introduction

Statistical models are indispensable tools for extracting insights from data, yet their outputs can often be cryptic and laden with technical jargon. Deciphering coefficients, p-values, confidence intervals, and various model fit statistics typically requires a solid statistical background. This can create a barrier when communicating findings to a wider audience or even for those still developing their statistical acumen.

The statlingua R package is here to change that! It masterfully leverages the power of Large Language Models (LLMs) to translate complex statistical model outputs into clear, understandable, and context-aware natural language. By simply feeding your R statistical model objects into statlingua, you can generate human-readable interpretations, making statistical understanding accessible to everyone, regardless of their technical expertise.

It’s important to note that statlingua itself doesn’t directly call LLM APIs. Instead, it serves as a sophisticated prompt engineering toolkit. It meticulously prepares the necessary inputs (your model summary and contextual information) and then passes them to the ellmer package, which handles the actual communication with the LLM. The primary workhorse function you’ll use in statlingua is explain().

This vignette will guide you through understanding and using statlingua effectively.

Prerequisites

Before diving in, please ensure you have the following:

  1. The statlingua package installed from GitHub (not yet available on CRAN):
if (!requireNamespace("remotes")) {
  install.packages("remotes")
}
remotes::install_github("bgreenwell/statlingua")
  1. The ellmer package installed. statlingua relies on it for LLM communication:
install.packages("ellmer")
  1. Access to an LLM provider (e.g., OpenAI, Google Gemini, or Anthropic) and a corresponding API key. You’ll need to configure your API key according to the ellmer package’s documentation. This usually involves setting environment variables like OPENAI_API_KEY, GOOGLE_API_KEY, or ANTHROPIC_API_KEY. Note that while While ellmer supports numerous LLM providers, this vignette will specifically use Google Gemini models via ellmer::chat_google_gemini(); I find Google Gemini to be particularly well-suited for explaining statistical output and they offer a generous free tier. You’ll need to configure your API key according to the ellmer package’s documentation. This typically involves setting the GEMINI_API_KEY environment variable in your R session or .Renviron file (e.g., Sys.setenv(GEMINI_API_KEY = "YOUR_API_KEY_HERE")).
  2. For the examples in this vignette, you’ll also need the following packages: ISLR2, MASS, and survival.
install.packages(c("ISLR2", "jsonlite", "lme4", "MASS", "survival"))

How statlingua Works: The explain() Function and ellmer

The primary function you’ll use in ****statlingua**** is explain(). This is an S3 generic function, meaning its behavior adapts to the class of the R statistical object you provide (e.g., an "lm" object, "glm" object, "lmerMod" object, etc.).

The process explain() follows to generate an interpretation involves several key steps:

  1. Input & Initial Argument Resolution: You call explain() with your statistical object, an ellmer client, and optionally context, audience, verbosity, and the new style argument. The explain() generic function first resolves audience, verbosity, and style to their specific chosen values (e.g., audience = "novice", style = "markdown") using match.arg(). These resolved values are then passed to the appropriate S3 method for the class of your object.

  2. Model Summary Extraction: Internally, explain() (typically via the .explain_core() helper function or directly in explain.default()) uses the summarize() function to capture a text-based summary of your statistical object. This captured text (e.g., similar to what summary(object) would produce) forms the core statistical information that the LLM will interpret.

  3. System Prompt Assembly (via .assemble_sys_prompt()): This is where statlingua constructs the detailed instructions for the LLM. The internal .assemble_sys_prompt() function pieces together several components, all read from .md files stored within the package’s inst/prompts/ directory. The final system prompt typically includes the following sections, ordered to guide the LLM effectively:

    • Role Definition:
      • A base role description is read from inst/prompts/common/role_base.md.
      • If available, model-specific role details are appended from inst/prompts/models/<model_name>/role_specific.md (where <model_name> corresponds to the class of your object, like “lm” or “lmerMod”). If this file doesn’t exist for a specific model, this part is omitted.
    • Intended Audience and Verbosity:
      • Instructions tailored to the specified audience (e.g., "novice", "researcher") are read from inst/prompts/audience/<audience_value>.md (e.g., inst/prompts/audience/novice.md).
      • Instructions defining the verbosity level (e.g., "brief", "detailed") are read from inst/prompts/verbosity/<verbosity_value>.md (e.g., inst/prompts/verbosity/detailed.md).
    • Response Format Specification (Style):
      • This crucial part is determined by the style argument. Instructions for the desired output format (e.g., "markdown", "html", "json", "text", "latex") are read from inst/prompts/style/<style_value>.md (e.g., inst/prompts/style/markdown.md). This tells the LLM how to structure its entire response.
    • Model-Specific Instructions:
      • Detailed instructions on what aspects of the statistical model to explain are read from inst/prompts/models/<model_name>/instructions.md (e.g., for an "lm" object, it would read from inst/prompts/models/lm/instructions.md). If model-specific instructions aren’t found, it defaults to inst/prompts/models/default/instructions.md.
    • Cautionary Notes:
      • A general caution message is appended from inst/prompts/common/caution.md.

    These components are assembled into a single, comprehensive system prompt that guides the LLM’s behavior, tone, content focus, and output format.

  4. User Prompt Construction (via .build_usr_prompt()): The “user prompt” (the actual query containing the data to be interpreted) is constructed by combining:

    • A leading phrase indicating the type of model (e.g., “Explain the following linear regression model output:”).
    • The captured model output_summary from step 2.
    • Any additional context string provided by the user via the context argument.
  5. LLM Interaction via ellmer: The assembled sys_prompt is set for the ellmer client object. Then, the constructed usr_prompt is sent to the LLM using client$chat(usr_prompt). ellmer handles the actual API communication.

  6. Output Post-processing (via .remove_fences()): Before returning the explanation, ****statlingua**** calls an internal utility, .remove_fences(), to clean the LLM’s raw output. This function attempts to remove common “language fence” wrappers (like markdown ... or json ...) that LLMs sometimes add around their responses.

  7. Output Packaging: The cleaned explanation string from the LLM is then packaged into a statlingua_explanation object. This object’s text component holds the explanation string in the specified style. It also includes metadata like the model_type, audience, verbosity, and style used. The statlingua_explanation object has a default print method that uses cat() for easy viewing in the console.

This comprehensive and modular approach to prompt engineering allows statlingua to provide tailored and well-formatted explanations for a variety of statistical models and user needs.

Understanding explain()’s Arguments

The explain() function is flexible, with several arguments to fine-tune its behavior:

The Power of context: Why It Matters

You could just pass your model object to explain() and get a basic interpretation. However, to unlock truly insightful and actionable explanations, providing context is paramount.

LLMs are incredibly powerful, but they don’t inherently know the nuances of your specific research. They don’t know what “VarX” really means in your data set, its units, the specific hypothesis you’re testing, or the population you’re studying unless you tell them. The context argument is your channel to provide this vital background.

What makes for effective context?

By supplying such details, you empower the LLM to:

Think of context as the difference between asking a generic statistician “What does this mean?” versus asking a statistician who deeply understands your research area, data, and objectives. The latter will always provide a more valuable interpretation.

Some Examples in Action!

Let’s see statlingua shine with some practical examples.

Important Note on API Keys: The following code chunks that call explain() are set to eval = FALSE by default in this vignette. This is because they require an active API key configured for ellmer. To run these examples yourself:

  1. Ensure your API key (e.g., GOOGLE_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY) is set up as an environment variable that ellmer can access.
  2. Change the R chunk option eval = FALSE to eval = TRUE for the chunks you wish to run.
  3. You may need to adjust the ellmer client initialization (e.g., ellmer::chat_openai()) to match your chosen LLM provider.

For this examples in this vignette

Example 1: Linear Regression (lm) - Sales of Child Car Seats

Let’s use a linear model to predict Sales of child car seats from various predictors using the Carseats data set from package ISLR2. To make this example a bit more complicated, we’ll include pairwise interaction effects in the model (you can include polynomial terms, smoothing splines, or any type of transformation that makes sense). Note that the categorical variables ShelveLoc, Urban, and US have been dummy encoded by default).

data(Carseats, package = "ISLR2")  # load the Carseats data

# Fit a linear model to the Carseats data set
fm_carseats <- lm(Sales ~ . + Price:Age + Income:Advertising, data = Carseats)
summary(fm_carseats)  # print model summary
#> 
#> Call:
#> lm(formula = Sales ~ . + Price:Age + Income:Advertising, data = Carseats)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.9208 -0.7503  0.0177  0.6754  3.3413 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
#> CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
#> Income              0.0108940  0.0026044   4.183 3.57e-05 ***
#> Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
#> Population          0.0001592  0.0003679   0.433 0.665330    
#> Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
#> ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
#> ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
#> Age                -0.0579466  0.0159506  -3.633 0.000318 ***
#> Education          -0.0208525  0.0196131  -1.063 0.288361    
#> UrbanYes            0.1401597  0.1124019   1.247 0.213171    
#> USYes              -0.1575571  0.1489234  -1.058 0.290729    
#> Price:Age           0.0001068  0.0001333   0.801 0.423812    
#> Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.011 on 386 degrees of freedom
#> Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
#> F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

The next code chunk loads the statlingua package and establishes a connection to a (default) Google Gemini model. We also define some context for the LLM to use when explaining the above output:

library(statlingua)

# Establish client connection
client <- ellmer::chat_google_gemini(echo = "none")
#> Using model = "gemini-2.0-flash".

# Additional context for the LLM to consider
carseats_context <- "
The model uses a data set on child car seat sales (in thousands of units) at 400 different stores.
The goal is to identify factors associated with sales.
The variables are:
  * Sales: Unit sales (in thousands) at each location (the response variable).
  * CompPrice: Price charged by competitor at each location.
  * Income: Community income level (in thousands of dollars).
  * Advertising: Local advertising budget for the company at each location (in thousands of dollars).
  * Population: Population size in the region (in thousands).
  * Price: Price the company charges for car seats at each site.
  * ShelveLoc: A factor with levels 'Bad', 'Good', and 'Medium' indicating the quality of the shelving location for the car seats. ('Bad' is the reference level).
  * Age: Average age of the local population.
  * Education: Education level at each location.
  * Urban: A factor ('No', 'Yes') indicating if the store is in an urban or rural location. ('No' is the reference level).
  * US: A factor ('No', 'Yes') indicating if the store is in the US or not. ('No' is the reference level).
Interaction terms `Income:Advertising` and `Price:Age` are also included.
The data set is simulated. We want to understand key drivers of sales and how to interpret the interaction terms.
"

Next, let’s use the Google Gemini model to generate an explanation of the model’s output, targeting a "student" audience with "detailed" verbosity.

explain(fm_carseats, client = client, context = carseats_context,
        audience = "novice", verbosity = "detailed")

Interpretation of Linear Regression Model Output

This output details a linear regression model analyzing car seat sales based on various factors. The model aims to understand how competitor price, income, advertising, population, car seat price, shelving location, age, education, urban/rural location, US location, and interactions between price & age and income & advertising relate to car seat sales. Given that the response variable (Sales) is continuous, a linear regression model is potentially appropriate, assuming the relationships are approximately linear and other assumptions are met.

Call

lm(formula = Sales ~ . + Price:Age + Income:Advertising, data = Carseats)

This is the R command used to fit the linear regression model. It indicates that the model is predicting Sales using all other variables in the Carseats dataset, along with the interaction terms Price:Age and Income:Advertising.

Residuals Summary

    Min      1Q  Median      3Q     Max 
-2.9208 -0.7503  0.0177  0.6754  3.3413

This section describes the distribution of the residuals (the differences between the observed and predicted values of Sales). * Min: The smallest residual is -2.9208. * 1Q: 25% of the residuals are less than -0.7503. * Median: The median residual is 0.0177, which is close to zero, suggesting the model is reasonably well-centered. * 3Q: 75% of the residuals are less than 0.6754. * Max: The largest residual is 3.3413.

In a good model, the residuals should be approximately symmetrically distributed around zero. The fact that the median is close to zero is a good sign, but it is important to also examine a histogram or density plot of the residuals to assess the overall distribution.

Coefficients Table

This table presents the estimated coefficients for each predictor in the model, along with their standard errors, t-values, and p-values.

Estimate Std. Error t value Pr(>
(Intercept) 6.5755654 1.0087470 6.519 2.22e-10
CompPrice 0.0929371 0.0041183 22.567 < 2e-16
Income 0.0108940 0.0026044 4.183 3.57e-05
Advertising 0.0702462 0.0226091 3.107 0.002030
Population 0.0001592 0.0003679 0.433 0.665330
Price -0.1008064 0.0074399 -13.549 < 2e-16
ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16
ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16
Age -0.0579466 0.0159506 -3.633 0.000318
Education -0.0208525 0.0196131 -1.063 0.288361
UrbanYes 0.1401597 0.1124019 1.247 0.213171
USYes -0.1575571 0.1489234 -1.058 0.290729
Price:Age 0.0001068 0.0001333 0.801 0.423812
Income:Advertising 0.0007510 0.0002784 2.698 0.007290

Signif. codes

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These codes indicate the level of statistical significance for each coefficient’s p-value. *** indicates a p-value less than 0.001 ** indicates a p-value less than 0.01 * indicates a p-value less than 0.05 . indicates a p-value less than 0.1.

Residual Standard Error

Residual standard error: 1.011 on 386 degrees of freedom

The residual standard error (RSE) is 1.011. This represents the typical size of the prediction errors (the average distance that observed values fall from the regression line), measured in the units of the response variable (thousands of units of car seat sales). The degrees of freedom are 386 (number of observations minus the number of coefficients estimated).

R-squared

Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8719

F-statistic

F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

The F-statistic tests the overall significance of the model. The very small p-value (less than 2.2e-16) indicates that at least one of the predictor variables is significantly related to car seat sales. The F-statistic is 210, with 13 degrees of freedom for the numerator (number of predictors) and 386 degrees of freedom for the denominator (number of observations minus the number of predictors minus 1).

Suggestions for Checking Assumptions

To ensure the validity of the linear regression model, it’s crucial to check the key assumptions:

Disclaimer: This explanation was generated by a Large Language Model. Critically review the output and consult additional statistical resources or experts to ensure correctness and a full understanding, especially given that the interpretation relies on the provided output and context.

Follow-up Question: Interpreting R-squared

The initial explanation is great, but let’s say a student wants to understand R-squared more deeply for this particular model. We can use the $chat() method of client (an ellmer "Chat" object), which remembers the context of the previous interaction.

query <- paste("Could you explain the R-squared values (Multiple R-squared and",
               "Adjusted R-squared) in simpler terms for this car seat sales",
               "model? What does it practically mean for predicting sales?")
client$chat(query)
#> [1] "## Simplified Explanation of R-squared Values\n\nLet's break down what R-squared and Adjusted R-squared mean in the context of predicting car seat sales:\n\n**Multiple R-squared: 0.8761 (or 87.61%)**\n\nThink of it like this: Imagine you're trying to guess the car seat sales at a store.\n\n*   **Without any information:** If you had absolutely no information about the store (competitor prices, income levels, advertising budgets, etc.), your guesses would likely be pretty far off. There'd be a lot of variation (or \"variance\") in your prediction errors.\n\n*   **With the model:** Now, using our linear regression model (which includes all the predictors like competitor price, income, etc.), we can make much better predictions.\n\nR-squared tells us how much *better* our predictions are with the model compared to just guessing randomly. An R-squared of 0.8761 means that *our model explains 87.61% of the total variation (or variance) in car seat sales across all the stores in our data*. In other words, 87.61% of the differences in sales figures we see between stores can be attributed to the factors included in our model (competitor price, income, advertising, etc.). The remaining 12.39% (100% - 87.61%) is due to other factors *not* included in the model (e.g., random chance, unmeasured variables, store manager skill, local events, etc.).\n\n**In simpler terms:** The model captures a large proportion of the reasons why sales vary from store to store.\n\n**Adjusted R-squared: 0.8719 (or 87.19%)**\n\nAdjusted R-squared is a slightly more refined version of R-squared. The regular R-squared *always* increases when you add more variables to your model, even if those variables don't really help predict sales. It's like saying, \"I can explain sales even better if I add the color of the store's walls to the model!\" (Even though wall color is probably irrelevant).\n\nAdjusted R-squared *penalizes* you for adding useless variables. It only increases if the new variable actually improves the model's predictive power *more than you'd expect by chance*.\n\nIn this case, the Adjusted R-squared (0.8719) is very close to the regular R-squared (0.8761). This suggests that the variables included in our model are, on the whole, useful for predicting sales. The penalty for including 13 variables was very small.\n\n**Practical Meaning for Predicting Sales:**\n\n*   **Good predictive power:** An Adjusted R-squared of 0.8719 indicates that the model has relatively strong predictive power. We can use the model to estimate car seat sales at new stores, and these estimates are likely to be fairly accurate. However, it is essential to remember that there is still 12.81% of the variation that the model *doesn't* explain.\n\n*   **Identifying key factors:** The model helps us identify which factors are most strongly associated with car seat sales (e.g., competitor price, the company's own price, shelf location). This knowledge can be used to make strategic decisions about pricing, advertising, and product placement to improve sales.\n\n*   **Not perfect prediction:** Even with a high R-squared, the model doesn't predict sales *perfectly*. There will still be some error in our predictions. Other factors not included in the model also influence sales.\n\n*   **Useful for comparisons:** The R-squared values are especially useful for comparing this model to other models you might build to predict car seat sales. The model with the higher Adjusted R-squared (assuming both models meet the regression assumptions) is generally considered the better model.\n"

The LLM has provided a more detailed explanation of R-squared, tailored to the fm_carseats model and provided context, discussing how much of the variability in Sales is explained by the predictors in the model.

Example 2: Logistic GLM (glm) - Pima Indians Diabetes

Let’s use the Pima.tr data set from the MASS package to fit a logistic regression model. This data set is about the prevalence of diabetes in Pima Indian women. Our goal is to identify factors associated with the likelihood of testing positive for diabetes.

data(Pima.tr, package = "MASS")  # load the Pima.tr data set

# Fit a logistic regression model
fm_pima <- glm(type ~ npreg + glu + bp + skin + bmi + ped + age,
               data = Pima.tr, family = binomial(link = "logit"))
summary(fm_pima)  # print model summary
#> 
#> Call:
#> glm(formula = type ~ npreg + glu + bp + skin + bmi + ped + age, 
#>     family = binomial(link = "logit"), data = Pima.tr)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.9830  -0.6773  -0.3681   0.6439   2.3154  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -9.773062   1.770386  -5.520 3.38e-08 ***
#> npreg        0.103183   0.064694   1.595  0.11073    
#> glu          0.032117   0.006787   4.732 2.22e-06 ***
#> bp          -0.004768   0.018541  -0.257  0.79707    
#> skin        -0.001917   0.022500  -0.085  0.93211    
#> bmi          0.083624   0.042827   1.953  0.05087 .  
#> ped          1.820410   0.665514   2.735  0.00623 ** 
#> age          0.041184   0.022091   1.864  0.06228 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 256.41  on 199  degrees of freedom
#> Residual deviance: 178.39  on 192  degrees of freedom
#> AIC: 194.39
#> 
#> Number of Fisher Scoring iterations: 5

Now, let’s provide some additional context to accompany the output when requesting an explanation from the LLM:

pima_context <- "
This logistic regression model attempts to predict the likelihood of a Pima
Indian woman testing positive for diabetes. The data is from a study on women of
Pima Indian heritage, aged 21 years or older, living near Phoenix, Arizona. The
response variable 'type' is binary: 'Yes' (tests positive for diabetes) or 'No'.

Predictor variables include:
  - npreg: Number of pregnancies.
  - glu: Plasma glucose concentration in an oral glucose tolerance test.
  - bp: Diastolic blood pressure (mm Hg).
  - skin: Triceps skin fold thickness (mm).
  - bmi: Body mass index (weight in kg / (height in m)^2).
  - ped: Diabetes pedigree function (a measure of genetic predisposition).
  - age: Age in years.

The goal is to understand which of these factors are significantly associated
with an increased or decreased odds of having diabetes. We are particularly
interested in interpreting coefficients as odds ratios.
"

# Establish fresh client connection
client <- ellmer::chat_google_gemini(echo = "none")
#> Using model = "gemini-2.0-flash".

This time, we’ll ask statlingua for an explanation, targeting a "researcher" with "moderate" verbosity. This audience would be interested in aspects like odds ratios and model fit.

explain(fm_pima, client = client, context = pima_context,
        audience = "researcher", verbosity = "moderate")

Explanation of Cox Proportional Hazards Model Output

This output presents the results of a Cox proportional hazards model examining the survival times of lung cancer patients. The model explores the relationship between survival time and ph.ecog (ECOG performance score) and age while accounting for possible non-linear effects of age via a time-transforming spline (tt(age)).

Call:

The model was fit using the formula Surv(time, status) ~ ph.ecog + tt(age), where tt(age) represents a time-transforming function using a spline for age.

Data Summary:

Coefficients Table:

Model Convergence:

The model converged after 4 outer and 10 Newton-Raphson iterations. Theta= 0.7960256 refers to a parameter in the penalized spline fitting process.

Model Fit and Significance:

Recommendations and Cautions:

  1. Proportional Hazards Assumption: Given the time-transforming spline for age, the proportional hazards assumption for age is not assumed. However, it is crucial to assess the proportional hazards assumption for ph.ecog. This can be done using Schoenfeld residuals via cox.zph(). Violations of this assumption would suggest that the effect of ph.ecog changes over time.

  2. Spline Interpretation: While the overall likelihood ratio test is significant, the individual components of the age spline are not. Consider visualizing the effect of the spline using plots to understand how age might be affecting the hazard over time. Given the non-significance of the non-linear component of the age spline, consider refitting the model without the spline (i.e., a standard linear effect for age).

  3. Model Simplicity: If the time-varying effect of age is not critical, simplifying the model by removing the spline and modeling age as a linear term may improve interpretability without sacrificing much predictive power.

  4. Review and Validate: Critically review this interpretation and consult statistical resources or experts to ensure correctness and a full understanding of the model, particularly given the use of splines.

The rendered Markdown output above provides a concise, high-level summary suitable for a manager, focusing on the key predictors of survival and their implications in terms of increased or decreased risk (hazard).

Example 4: Linear Mixed-Effects Model (lmer from lme4) - Sleep Study

Let’s explore the sleepstudy data set from the lme4 package. This data set records the average reaction time per day for subjects in a sleep deprivation study. We’ll fit a linear mixed-effects model to see how reaction time changes over days of sleep deprivation, accounting for random variation among subjects.

This example will also demonstrate the style argument, requesting output as plain text (style = "text") and as a JSON string (style = "json").

library(lme4)

# Load the sleep study data set
data(sleepstudy)

# Fit a linear mixed-effects model allowing for random intercepts and random
# slopes for Days, varying by Subject
fm_sleep <- lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
summary(fm_sleep)  # print model summary
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ Days + (Days | Subject)
#>    Data: sleepstudy
#> 
#> REML criterion at convergence: 1743.6
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.9536 -0.4634  0.0231  0.4634  5.1793 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr
#>  Subject  (Intercept) 612.10   24.741       
#>           Days         35.07    5.922   0.07
#>  Residual             654.94   25.592       
#> Number of obs: 180, groups:  Subject, 18
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  251.405      6.825  36.838
#> Days          10.467      1.546   6.771
#> 
#> Correlation of Fixed Effects:
#>      (Intr)
#> Days -0.138

Now, let’s define context for this sleep study model:

sleepstudy_context <- "
This linear mixed-effects model analyzes data from a sleep deprivation study.
The goal is to understand the effect of days of sleep deprivation ('Days') on
average reaction time ('Reaction' in ms). The model includes random intercepts
and random slopes for 'Days' for each 'Subject', acknowledging that baseline
reaction times and the effect of sleep deprivation may vary across individuals.
We are interested in the average fixed effect of an additional day of sleep
deprivation on reaction time, as well as the extent of inter-subject
variability.
"

# Establish fresh client connection
client <- ellmer::chat_google_gemini(echo = "none")
#> Using model = "gemini-2.0-flash".

Requesting Plain Text Output (style = "text")

Let’s ask statlingua for an explanation as plain text, targeting a "researcher" with "moderate" verbosity.

explain(fm_sleep, client = client, context = sleepstudy_context,
        audience = "researcher", verbosity = "moderate", style = "text")
#> Here's an explanation of the linear mixed-effects model output, incorporating the context of the sleep deprivation study.
#> 
#> LINEAR MIXED-EFFECTS MODEL INTERPRETATION
#> 
#> The model examines the relationship between days of sleep deprivation (Days) and reaction time (Reaction), accounting for individual differences between subjects.  The formula `Reaction ~ Days + (Days | Subject)` indicates that the model includes a fixed effect of 'Days' on 'Reaction' and random intercepts and slopes for 'Days' varying across 'Subject'.
#> 
#> REML VS. ML
#> 
#> The model was fit using REML (Restricted Maximum Likelihood), which is generally preferred for estimating variance components in mixed-effects models.
#> 
#> RANDOM EFFECTS
#> 
#> This section describes the variability between subjects.
#> 
#> *   Subject (Intercept): Variance = 612.10, Std.Dev. = 24.741. This indicates substantial inter-individual variability in baseline reaction times (i.e., at Day 0 of sleep deprivation).  The standard deviation of 24.741 ms suggests that, even before sleep deprivation, subjects' reaction times vary by approximately ±24.741 ms.
#> 
#> *   Subject Days: Variance = 35.07, Std.Dev. = 5.922. This indicates inter-individual variability in the effect of sleep deprivation on reaction time.  Some subjects' reaction times increase more per day of sleep deprivation than others. The standard deviation of 5.922 ms/day suggests that the effect of each additional day of sleep deprivation varies by approximately ±5.922 ms/day across subjects.
#> 
#> *   Corr: 0.07. This is the correlation between random intercepts and random slopes.  A small, positive correlation suggests a weak relationship between a subject's baseline reaction time and how much their reaction time changes with sleep deprivation. Subjects with slightly higher baseline reaction times tend to show a slightly larger increase in reaction time per day of sleep deprivation, although this correlation is very weak.
#> 
#> *   Residual: Variance = 654.94, Std.Dev. = 25.592. This represents the within-subject variability in reaction time that is not explained by the fixed effect of 'Days' or the random effects. This is the variability of reaction time *after* accounting for the subject-specific baselines and slopes. The standard deviation of 25.592 ms represents the typical variation in reaction time for a given subject on a given day.
#> 
#> FIXED EFFECTS
#> 
#> This section describes the average effects across all subjects.
#> 
#> *   (Intercept): Estimate = 251.405, Std. Error = 6.825, t value = 36.838. This is the estimated average reaction time (in ms) at Day 0 of sleep deprivation, across all subjects. The t-value is very large, indicating that this average baseline reaction time is significantly different from zero.
#> 
#> *   Days: Estimate = 10.467, Std. Error = 1.546, t value = 6.771. This is the estimated average increase in reaction time (in ms) for each additional day of sleep deprivation, across all subjects. On average, a subject's reaction time increases by 10.467 ms for each day of sleep deprivation. The t-value is large, indicating a statistically significant effect of days of sleep deprivation on reaction time.
#> 
#> CORRELATION OF FIXED EFFECTS
#> 
#> *   Days -0.138: This is the correlation between the intercept and the slope estimates in the fixed effects part of the model. It's generally less important to interpret than the correlation of random effects. It indicates a slight negative relationship between the estimated average baseline reaction time and the estimated average effect of days.
#> 
#> MODEL APPROPRIATENESS (BASED ON CONTEXT)
#> 
#> Given the repeated measures design (multiple observations per subject), the linear mixed-effects model is appropriate as it accounts for the non-independence of observations within each subject.  The inclusion of both random intercepts and random slopes allows for individual differences in both baseline reaction times and the effect of sleep deprivation, which is a reasonable assumption.  The model assumes linearity between days of sleep deprivation and reaction time.
#> 
#> CHECKING ASSUMPTIONS
#> 
#> To further validate the model:
#> 
#> *   Examine a Q-Q plot of the residuals to assess normality.
#> *   Plot residuals against fitted values to check for homoscedasticity (constant variance).
#> *   Examine Q-Q plots of the random effects for each subject to assess normality of random effects.
#> *   Plot residuals against 'Days' to assess the linearity assumption.
#> 
#> CAUTION: This explanation was generated by a Large Language Model. Critically review the output and consult additional statistical resources or experts to ensure correctness and a full understanding.
#> 

Requesting JSON Output (style = "json")

Now, let’s request the explanation in a structured JSON format (using style = "json"). We’ll target a "student" with "detailed" verbosity.

client <- ellmer::chat_google_gemini(echo = "none")
#> Using model = "gemini-2.0-flash".
ex <- explain(fm_sleep, client = client, context = sleepstudy_context,
              audience = "student", verbosity = "detailed", style = "json")

# The 'text' component of the statlingua_explanation object now holds a JSON
# string which can be parsed using the jsonlite package
jsonlite::prettify(ex$text)
#> {
#>     "title": "Explanation of Linear Mixed-Effects Model for Sleep Deprivation Study",
#>     "model_overview": "This is a linear mixed-effects model, fitted using Restricted Maximum Likelihood (REML), examining the impact of sleep deprivation (Days) on reaction time (Reaction). The model accounts for inter-individual variability by including random intercepts and slopes for each subject, allowing for differences in baseline reaction times and how reaction time changes with sleep deprivation across individuals. This type of model is suitable due to the repeated measures design, where we have multiple observations for each subject.",
#>     "coefficient_interpretation": "The fixed effects estimates provide the average effect of each predictor on reaction time across all subjects. The intercept is estimated at 251.405 ms, which represents the average reaction time at Day 0 (the baseline reaction time) across all subjects. The coefficient for 'Days' is 10.467 ms. This indicates that, on average, reaction time increases by 10.467 milliseconds for each additional day of sleep deprivation. So, each day of sleep deprivation increases reaction time by about 10.5 ms.",
#>     "significance_assessment": "The t-value for the intercept is 36.838 and for 'Days' is 6.771. Although the output doesn't directly provide p-values, these t-values can be used to assess statistical significance. Given the large t-values, especially for 'Days', it's highly likely that the effect of days of sleep deprivation on reaction time is statistically significant. The standard error for 'Days' is 1.546, which provides a measure of the precision of the estimated effect of sleep deprivation. Packages like `lmerTest` can be used with this model to obtain p-values for these fixed effects estimates using methods like Satterthwaite or Kenward-Roger approximations.",
#>     "goodness_of_fit": "The REML criterion at convergence is 1743.6. This value is useful for comparing nested models fitted to the same data. Lower REML values indicate a better fit, but only when comparing models with the same fixed effects. Other metrics like AIC and BIC, which can be obtained from the model, are better suited to non-nested model comparisons.",
#>     "assumptions_check": "Several assumptions should be checked to ensure the validity of this model. First, examine the normality of the residuals using a Q-Q plot or histogram of `resid(object)`. Second, assess homoscedasticity by plotting residuals against fitted values (`fitted(object)`), looking for constant variance. Third, check the normality of the random effects using Q-Q plots of the random effects (`ranef(object)`). Additionally, linearity of the fixed effect 'Days' should be checked, possibly by plotting residuals against 'Days'. Finally, it's important to check for influential observations that might disproportionately affect the model results.",
#>     "key_findings": "- On average, baseline reaction time (Day 0) is estimated to be approximately 251.4 ms.\n- Each additional day of sleep deprivation is associated with an average increase of approximately 10.5 ms in reaction time.\n- There is substantial inter-individual variability in both baseline reaction time (Std.Dev. = 24.741 ms) and the effect of sleep deprivation (Std.Dev. = 5.922 ms) across subjects.\n- The correlation between random intercepts and slopes is low (0.07), suggesting a weak relationship between a subject's baseline reaction time and how their reaction time changes with sleep deprivation.",
#>     "warnings_limitations": "The interpretation relies on the assumption that the model is correctly specified and that the assumptions of linear mixed-effects models are met. The absence of p-values in the base output from `lmer` should be addressed using appropriate packages like `lmerTest` to draw firmer conclusions about statistical significance. Also, remember that correlation does not equal causation; while the model suggests a relationship between sleep deprivation and reaction time, it does not prove that sleep deprivation *causes* the change in reaction time."
#> }
#> 

Inspecting LLM Interaction

If you want to see the exact system and user prompts that statlingua generated and sent to the LLM (via ellmer), as well as the raw response from the LLM, you can print the ellmer "Chat" object (defined as client in this example) after an explain() call. The client object stores the history of the interaction.

print(client)
#> <Chat Google/Gemini/gemini-2.0-flash turns=3 tokens=2337/887 $0.00>
#> ── system [0] ───────────────────────────────────────────────────────────────────────────────────────
#> ## Role
#> 
#> You are an expert statistician and R programmer, gifted at explaining complex concepts simply. Your primary function is to interpret statistical model outputs from R. You understand model nuances, underlying assumptions, and how results relate to real-world research questions.
#> 
#> You are particularly skilled with **Linear Mixed-Effects Models** created using the `lmer()` function from the `lme4` package in R (producing `lmerMod` objects). You understand their estimation via REML or ML, the interpretation of fixed and random effects (including correlations), and the common practice of using packages like `lmerTest` for p-value estimation.
#> 
#> 
#> ## Intended Audience and Verbosity
#> 
#> ### Target Audience: Student
#> Assume the user is learning statistics. Explain concepts thoroughly but clearly, as if teaching. Define key terms as they arise and explain *why* certain statistics are important or what they indicate in the context of the analysis. Maintain an encouraging and educational tone.
#> 
#> ### Level of Detail (Verbosity): Detailed
#> Give a comprehensive interpretation. Include nuances of the model output, a detailed breakdown of all relevant statistics, and a thorough discussion of assumptions and diagnostics. Be as thorough as possible, exploring various facets of the results.
#> 
#> 
#> ## Response Format Specification (Style: Json)
#> 
#> Your response MUST be a valid JSON object that can be parsed directly into an R list.
#> The JSON object should have the following top-level keys, each containing a string with the relevant part of the explanation (formatted as plain text or simple Markdown within the string if appropriate for that section):
#> - "title": A concise title for the explanation.
#> - "model_overview": A general description of the model type and its purpose in this context.
#> - "coefficient_interpretation": Detailed interpretation of model coefficients/parameters.
#> - "significance_assessment": Discussion of p-values, confidence intervals, and statistical significance.
#> - "goodness_of_fit": Evaluation of model fit (e.g., R-squared, AIC, deviance).
#> - "assumptions_check": Comments on important model assumptions and how they might be checked.
#> - "key_findings": A bulleted list (as a single string with newlines `\n` for bullets) of the main conclusions.
#> - "warnings_limitations": Any warnings, limitations, or caveats regarding the model or its interpretation.
#> 
#> Example of expected JSON structure:
#> {
#>   "title": "Explanation of Linear Regression Model for Car Sales",
#>   "model_overview": "This is a linear regression model...",
#>   "coefficient_interpretation": "The coefficient for 'Price' is -0.10, suggesting that...",
#>   "significance_assessment": "The p-value for 'Price' is very small (< 0.001)...",
#>   "goodness_of_fit": "The R-squared value is 0.87, indicating...",
#>   "assumptions_check": "Assumptions such as linearity and homoscedasticity should be checked by examining residual plots.",
#>   "key_findings": "- Price is a significant negative predictor of sales.\n- Advertising has a positive impact on sales.",
#>   "warnings_limitations": "This model is based on simulated data and results should be interpreted with caution."
#> }
#> 
#> Ensure the entire output is ONLY the JSON object.
#> DO NOT wrap your entire response in JSON code fences (e.g., ```json ... ``` or ``` ... ```).
#> DO NOT include any conversational pleasantries or introductory/concluding phrases.
#> 
#> 
#> ## Instructions
#> 
#> You are explaining a **Linear Mixed-Effects Model** (from `lme4::lmer()`, an `lmerMod` object).
#> 
#> **Core Concepts & Purpose:**
#> This model is used for data with hierarchical/nested structures or repeated measures, where observations are not independent. It models fixed effects (average effects of predictors) and random effects (variability between groups/subjects for intercepts and/or slopes).
#> 
#> **Key Assumptions:**
#> * Linearity: The relationship between predictors and the outcome is linear.
#> * Independence of random effects and errors: Random effects are independent of each other and of the residuals (conditional on covariates).
#> * Normality of random effects: Random effects for each grouping factor are normally distributed (typically with mean 0).
#> * Normality of residuals: The errors (residuals) are normally distributed with a mean of 0.
#> * Homoscedasticity: Residuals have constant variance.
#> * (Note on p-values: `lme4::lmer` itself does not calculate p-values for fixed effects by default due to complexities with degrees of freedom. If p-values are present, they often come from wrapper functions or packages like `lmerTest` using approximations like Satterthwaite's or Kenward-Roger methods.)
#> 
#> **Assessing Model Appropriateness (Based on User Context):**
#> If the user provides context:
#> * Comment on the model's appropriateness based on the data structure (e.g., repeated measures, nesting) and research question.
#> * Relate this to the model's assumptions.
#> If no or insufficient context, state inability to fully assess appropriateness.
#> 
#> **Interpretation of the `lmerMod` Output:**
#> * **Formula and Data:** Briefly reiterate what is being modeled.
#> * **REML vs ML:** Note if Restricted Maximum Likelihood (REML) or Maximum Likelihood (ML) was used (REML is default and often preferred for variance components; ML for likelihood ratio tests of fixed effects).
#> * **Random Effects (from `VarCorr()` output):**
#>     * For each grouping factor (e.g., `(1 | group_factor)` or `(predictor | group_factor)`):
#>         * **Variance and Std.Dev. (for intercepts):** Quantifies between-group variability in the baseline outcome.
#>         * **Variance and Std.Dev. (for slopes):** Quantifies how much the effect of that predictor varies across groups.
#>         * **Correlation of Random Effects (e.g., `Corr` between random intercept and slope for the same group):** Explain its meaning (e.g., a correlation of -0.5 between intercept and slope for 'day' within 'subject' means subjects with higher initial values tend to have a less steep increase over days).
#>     * **Residual Variance/Std.Dev.:** Within-group or unexplained variability.
#> * **Fixed Effects (from `coef(summary(object))`):**
#>     * For each predictor:
#>         * **Estimate:** The estimated average effect on the outcome for a one-unit change in the predictor (or difference from the reference level for categorical predictors), accounting for random effects.
#>         * **Std. Error:** Precision of the estimate.
#>         * **t-value (or z-value):** `Estimate / Std. Error`.
#>         * **P-values (if available, typically from `lmerTest` via `summary(as(object, "lmerModLmerTest"))` or `anova(object)` from `lmerTest`):** Interpret as the probability of observing such an extreme t-value if the true fixed effect is zero. If p-values are NOT directly in the basic `summary(object)` output, mention they usually come from add-on packages.
#> * **ANOVA Table for Fixed Effects (if provided, typically from `lmerTest::anova()`):**
#>     * Tests the overall significance of fixed effects. For each term: interpret F-statistic, degrees of freedom (NumDF, DenDF), and p-value.
#> * **Model Fit Statistics (AIC, BIC, logLik, deviance):**
#>     * Explain as measures for model comparison. Lower AIC/BIC, higher logLik (less negative deviance) generally indicate better relative fit.
#> 
#> **Suggestions for Checking Assumptions:**
#> * **Normality of Residuals:** Suggest Q-Q plot or histogram of residuals (`resid(object)`).
#> * **Homoscedasticity:** Suggest plotting residuals vs. fitted values (`fitted(object)`). Look for non-constant spread.
#> * **Normality of Random Effects:** Suggest Q-Q plots of the random effects (e.g., from `ranef(object)`).
#> * **Linearity (Fixed Effects):** Check by plotting residuals against continuous predictors.
#> * **Influential Observations:** Mention checking for influential data points.
#> * **Singularity:** If the model is reported as singular (e.g. random effect variances near zero or correlations near +/-1), this often indicates an overly complex random effects structure that the data cannot support.
#> 
#> **Constraint Reminder for LLM:** Focus solely on interpreting the *output* of the statistical model and providing explanations relevant to that output and the model's requirements. Do not perform new calculations or suggest alternative analyses unless directly prompted by assessing the appropriateness based on provided context. **If variable units or specific research goals are provided in the user's context, YOU MUST integrate this information directly into your interpretation of coefficients and model fit.**
#> 
#> 
#> ## Caution
#> 
#> This explanation was generated by a Large Language Model. Advise the user to critically review the output and consult additional statistical resources or experts to ensure correctness and a full understanding, especially given that the interpretation relies on the provided output and context.
#> ── user [2337] ──────────────────────────────────────────────────────────────────────────────────────
#> Explain the following linear mixed-effects model output:
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Reaction ~ Days + (Days | Subject)
#>    Data: sleepstudy
#> 
#> REML criterion at convergence: 1743.6
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.9536 -0.4634  0.0231  0.4634  5.1793 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr
#>  Subject  (Intercept) 612.10   24.741       
#>           Days         35.07    5.922   0.07
#>  Residual             654.94   25.592       
#> Number of obs: 180, groups:  Subject, 18
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)  251.405      6.825  36.838
#> Days          10.467      1.546   6.771
#> 
#> Correlation of Fixed Effects:
#>      (Intr)
#> Days -0.138
#> 
#> ## Additional context to consider
#> 
#> 
#> This linear mixed-effects model analyzes data from a sleep deprivation study.
#> The goal is to understand the effect of days of sleep deprivation ('Days') on
#> average reaction time ('Reaction' in ms). The model includes random intercepts
#> and random slopes for 'Days' for each 'Subject', acknowledging that baseline
#> reaction times and the effect of sleep deprivation may vary across individuals.
#> We are interested in the average fixed effect of an additional day of sleep
#> deprivation on reaction time, as well as the extent of inter-subject
#> variability.
#> 
#> ── assistant [887] ──────────────────────────────────────────────────────────────────────────────────
#> ```json
#> {
#>   "title": "Explanation of Linear Mixed-Effects Model for Sleep Deprivation Study",
#>   "model_overview": "This is a linear mixed-effects model, fitted using Restricted Maximum Likelihood (REML), examining the impact of sleep deprivation (Days) on reaction time (Reaction). The model accounts for inter-individual variability by including random intercepts and slopes for each subject, allowing for differences in baseline reaction times and how reaction time changes with sleep deprivation across individuals. This type of model is suitable due to the repeated measures design, where we have multiple observations for each subject.",
#>   "coefficient_interpretation": "The fixed effects estimates provide the average effect of each predictor on reaction time across all subjects. The intercept is estimated at 251.405 ms, which represents the average reaction time at Day 0 (the baseline reaction time) across all subjects. The coefficient for 'Days' is 10.467 ms. This indicates that, on average, reaction time increases by 10.467 milliseconds for each additional day of sleep deprivation. So, each day of sleep deprivation increases reaction time by about 10.5 ms.",
#>   "significance_assessment": "The t-value for the intercept is 36.838 and for 'Days' is 6.771. Although the output doesn't directly provide p-values, these t-values can be used to assess statistical significance. Given the large t-values, especially for 'Days', it's highly likely that the effect of days of sleep deprivation on reaction time is statistically significant. The standard error for 'Days' is 1.546, which provides a measure of the precision of the estimated effect of sleep deprivation. Packages like `lmerTest` can be used with this model to obtain p-values for these fixed effects estimates using methods like Satterthwaite or Kenward-Roger approximations.",
#>   "goodness_of_fit": "The REML criterion at convergence is 1743.6. This value is useful for comparing nested models fitted to the same data. Lower REML values indicate a better fit, but only when comparing models with the same fixed effects. Other metrics like AIC and BIC, which can be obtained from the model, are better suited to non-nested model comparisons.",
#>   "assumptions_check": "Several assumptions should be checked to ensure the validity of this model. First, examine the normality of the residuals using a Q-Q plot or histogram of `resid(object)`. Second, assess homoscedasticity by plotting residuals against fitted values (`fitted(object)`), looking for constant variance. Third, check the normality of the random effects using Q-Q plots of the random effects (`ranef(object)`). Additionally, linearity of the fixed effect 'Days' should be checked, possibly by plotting residuals against 'Days'. Finally, it's important to check for influential observations that might disproportionately affect the model results.",
#>   "key_findings": "- On average, baseline reaction time (Day 0) is estimated to be approximately 251.4 ms.\n- Each additional day of sleep deprivation is associated with an average increase of approximately 10.5 ms in reaction time.\n- There is substantial inter-individual variability in both baseline reaction time (Std.Dev. = 24.741 ms) and the effect of sleep deprivation (Std.Dev. = 5.922 ms) across subjects.\n- The correlation between random intercepts and slopes is low (0.07), suggesting a weak relationship between a subject's baseline reaction time and how their reaction time changes with sleep deprivation.",
#>   "warnings_limitations": "The interpretation relies on the assumption that the model is correctly specified and that the assumptions of linear mixed-effects models are met. The absence of p-values in the base output from `lmer` should be addressed using appropriate packages like `lmerTest` to draw firmer conclusions about statistical significance. Also, remember that correlation does not equal causation; while the model suggests a relationship between sleep deprivation and reaction time, it does not prove that sleep deprivation *causes* the change in reaction time."
#> }
#> ```

This transparency is invaluable for debugging, understanding the process, or even refining prompts if you were developing custom extensions for statlingua.

Conclusion

The statlingua package, working hand-in-hand with ellmer, provides a powerful and user-friendly bridge to the interpretive capabilities of Large Language Models for R users. By mastering the explain() function and its arguments—especially context, audience, verbosity, and style—you can transform standard statistical outputs into rich, understandable narratives tailored to your needs.

Important Considerations: * The quality of the LLM’s explanation is heavily influenced by the clarity of the context you provide and the inherent capabilities of the LLM you choose (in this vignette, we’ve focused on Google Gemini). * LLM Output Variability: While statlingua uses detailed prompts to guide the LLM towards the desired output style and content, the nature of generative AI means that responses can vary. The requested style is an aim, and while statlingua includes measures to clean the output (like removing language fences), the exact formatting and content may not always be perfectly consistent across repeated calls or different LLM versions. Always critically review the generated explanations. * For the style = "json" option, which requests JSON output, ensure the jsonlite package is available if you intend to parse the JSON string into an R list within your session.

Remember to critically review all explanations generated by the LLM.

Happy explaining!