Bandwidth Selection and Conley Spatial HAC Standard Errors

Overview

The SpatialInference package provides tools for statistical inference with spatially correlated data. Its main features are:

Conley spatial HAC standard errors — a computationally efficient implementation based on C++ code by Darin Christensen [@ChristensenHartmanSamii2021] of the estimator introduced by @Conley1999.
Covariogram-based bandwidth selection — a data-driven method for choosing the kernel bandwidth, based on extracting the correlation range from the empirical covariogram of regression residuals, as proposed by @Lehner2026.
Diagnostic visualizations — including the inverse-U relationship between bandwidth and Conley SE magnitude [@Lehner2026], and covariogram plots with annotated range estimates.

Setup

Load the package and data containing the centroids of all 3,108 counties of the contiguous United States:

library(SpatialInference)
library(sf); library(ggplot2)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(lfe)
#> Loading required package: Matrix
library(modelsummary)
#> Version 2.0.0 of `modelsummary`, to be released soon, will introduce a
#>   breaking change: The default table-drawing package will be `tinytable`
#>   instead of `kableExtra`. All currently supported table-drawing packages
#>   will continue to be supported for the foreseeable future, including
#>   `kableExtra`, `gt`, `huxtable`, `flextable, and `DT`.
#>   
#>   You can always call the `config_modelsummary()` function to change the
#>   default table-drawing package in persistent fashion. To try `tinytable`
#>   now:
#>   
#>   config_modelsummary(factory_default = 'tinytable')
#>   
#>   To set the default back to `kableExtra`:
#>   
#>   config_modelsummary(factory_default = 'kableExtra')
data("US_counties_centroids")

ggplot(US_counties_centroids) + geom_sf(size = .1) + theme_bw()

plot of chunk setup

Simulated spatially correlated data

The dataset contains two independently generated spatially correlated noise variables (noise1 and noise2), drawn using geostatistical simulation. Although drawn independently, both exhibit strong spatial clustering — nearby counties have similar values:

ggplot(US_counties_centroids) + geom_sf(aes(col = noise1), size = .1) + theme_bw()
ggplot(US_counties_centroids) + geom_sf(aes(col = noise2), size = .1) + theme_bw()

plot of chunk unnamed-chunk-2

For comparison, the dist variable is an example of spatial non-stationarity (a spatial trend rather than spatial autocorrelation):

ggplot(US_counties_centroids) + geom_sf(aes(col = dist), size = .1) + theme_bw()

plot of chunk unnamed-chunk-3

The spurious regression problem

Even though the two noise variables were drawn independently (so the true coefficient is zero), a naive regression finds a highly significant relationship:

spuriouslm <- fixest::feols(noise1 ~ noise2, data = US_counties_centroids, vcov = "HC1")
spuriouslm
#> OLS estimation, Dep. Var.: noise1
#> Observations: 3,108
#> Standard-errors: Heteroskedasticity-robust 
#>                 Estimate Std. Error      t value   Pr(>|t|)    
#> (Intercept) 6.890000e-16   0.017872 3.860000e-14 1.0000e+00    
#> noise2      8.738022e-02   0.015535 5.624836e+00 2.0216e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.996015   Adj. R2: 0.007316

This is because the OLS errors \(\varepsilon_i\) are not i.i.d. — nearby residuals are positively correlated, which inflates the \(t\)-statistic. A map of the regression residuals reveals the spatial clustering:

US_counties_centroids$resid <- spuriouslm$residuals
ggplot(US_counties_centroids) +
  geom_sf(aes(col = resid), size = .1) + theme_bw() + scale_color_viridis_c()

plot of chunk unnamed-chunk-5

Estimating the correlation range

Before computing Conley standard errors, we need to choose a bandwidth: the distance within which residuals are allowed to be correlated. The covgm_range() function estimates this from the data by computing the empirical covariogram of the regression residuals and identifying the distance at which covariation first crosses zero [@Lehner2026; @Pebesma2004].

covgm_range(US_counties_centroids) + theme_bw()

plot of chunk unnamed-chunk-6

The plot shows the empirical covariogram — each point represents the average cross-product of residual pairs within a distance bin. The red vertical line marks the estimated correlation range \(\hat{\varsigma}\) (here around 831 km), and the red dashed line marks zero covariance. Beyond the estimated range, the covariogram fluctuates around zero, indicating that residual correlation has effectively ceased.

This is the key input for the Conley standard error: the bandwidth should match the distance over which residuals are actually correlated.

To extract just the numeric range for programmatic use:

# Compute the covariogram manually
covgm <- gstat::variogram(
  spuriouslm$residuals ~ 1,
  data = sf::as_Spatial(US_counties_centroids),
  covariogram = TRUE,
  width = 2e4,
  cutoff = as.numeric(max(sf::st_distance(US_counties_centroids))) * (2/3)
)
estimated_range <- extract_corr_range(covgm)
estimated_range
#> [1] 820.0129

The inverse-U relationship

A critical insight documented by @Lehner2026 is that the relationship between the bandwidth and the magnitude of Conley standard errors follows an inverse-U shape. This means that both too narrow and too wide bandwidths lead to underestimated standard errors — contradicting the conventional wisdom that wider bandwidths are always more conservative.

The inverseu_plot_conleyrange() function visualises this:

inverseu_plot_conleyrange(US_counties_centroids, seq(1, 2501, by = 200)) + theme_bw()

plot of chunk unnamed-chunk-8

The grey dashed line represents the HC1 standard error. As the cutoff distance increases from zero, the Conley SE first rises (as positively correlated residual pairs are included), reaches a peak near the true correlation range, and then declines (as uncorrelated pairs dilute the estimate). At very wide bandwidths, the Conley SE can fall below the HC1 level — making wide-bandwidth Conley errors less conservative than heteroskedasticity-robust errors.

Computing Conley standard errors

With the estimated range in hand, we compute the spatial HAC standard error using conley_SE():

spuriouslm_fe <- felm(noise1 ~ noise2 | unit + year | 0 | lat + lon,
                       data = US_counties_centroids, keepCX = TRUE)
regfe_conley <- conley_SE(reg = spuriouslm_fe, unit = "unit", time = "year",
                          lat = "lat", lon = "lon",
                          kernel = "epanechnikov", dist_fn = "Haversine",
                          dist_cutoff = 831)

conley <- sapply(regfe_conley, function(x) diag(sqrt(x)))[2] %>% round(5)
conley
#> Spatial.noise2 
#>          0.137

The Conley SE at the estimated bandwidth is substantially larger than the HC1 error. Constructing a corrected \(t\)-statistic:

spuriouslm_fe$coefficients[1] / as.numeric(conley)
#> [1] 0.6378118

The \(t\)-statistic is far below 1.96, so we correctly fail to reject the null hypothesis — as expected, given that the two variables are genuinely independent.

Convenience wrappers

The compute_conley_lfe() function provides a shortcut:

compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov")
#> [1] 0.137

For a complete workflow — regression, Moran’s I test, and Conley SEs in one call — use lm_sac():

lmsac.out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon",
                     US_counties_centroids,
                     conley_cutoff = 831, conley_kernel = "epanechnikov")

lmsac.out$conley_SE
#> [1] 0.0000000 0.1370043

gm.param <- list(
  list("raw" = "nobs", "clean" = "Observations", "fmt" = 0),
  list("raw" = "Moran_y", "clean" = "Moran's I [y]", "fmt" = 3),
  list("raw" = "Moran_resid", "clean" = "Moran's I [resid]", "fmt" = 3),
  list("raw" = "y_mean", "clean" = "Mean [y]", "fmt" = 3),
  list("raw" = "y_SD", "clean" = "Std.Dev. [y]", "fmt" = 3)
)

modelsummary(lmsac.out,
             estimate = c("{estimate}"),
             statistic = c("({std.error})", "([{conley}])"),
             gof_omit = "Model|Range_resid|Range_resp|Range_y",
             gof_map = gm.param)

	(1)
(Intercept)	0.000
	(0.018)
	([0.137])
noise2	0.087
	(0.018)
	([0.137])
Observations	3108

Kernel choices

The spatial HAC estimate is sensitive not only to the bandwidth but also to the kernel function. The package includes six kernels with different distance-decay profiles. The uniform kernel weights all pairs equally within the cutoff; the others downweight more distant pairs with varying decay rates:

compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "uniform")
#> [1] 0.13964
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov")
#> [1] 0.137
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "bartlett")
#> [1] 0.12253
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "parzen")
#> [1] 0.11584
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "gaussian")
#> [1] 0.13799
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "biweight")
#> [1] 0.13087

Based on Monte Carlo evidence in @Lehner2026, the Bartlett and Epanechnikov kernels tend to deliver the best size control in practice.