MQ
RlavaanPerformanceStructural Equation ModelingBenchmarking

Fast Paths in lavaan: A Performance Study

A research-style account of a lavaan performance PR, using matched profiling runs to connect internal fast paths with measured latency and allocation reductions.

2026-05-2418 min readpublishedHard

Abstract

This post studies my PR N3uralN3twork/lavaan#4, feature/fast-v6, as a performance intervention in the lavaan structural equation modeling library. The largest change is not a single algorithmic replacement, but a set of coordinated fast paths: direct baseline-model construction, cached partable metadata, unified model-implied moment calculation, cheaper matrix operations, lower-allocation missing-data likelihood code, and more efficient MIIV/PML linear algebra.

I evaluated the code from the PR against the master branch using a benchmark suite of 2525 diverse scenarios.

RunBranchlavaan versionScenariosTimed iterationsTiming context
baselinemaster0.6.22.2653251000 eachisolated authoritative
V6-Baselinefeature/fast-v60.6.22.2658251000 eachisolated authoritative

Across the 25 matched scenarios, the PR reduced aggregate mean latency by 56.0%56.0\%, aggregate median latency by 58.6%58.6\%, and aggregate p95 latency by 48.7%48.7\%. All matched scenarios improved on median latency. Among the 2323 scenarios with allocation measurements in both runs, aggregate allocation fell by 29.9%29.9\%.

Research Question

The practical question was:

Can a broad internal fast-path PR reduce routine lavaan fitting overhead across diverse SEM scenarios without narrowing the public modeling surface?

That question matters because SEM software spends much of its time in repeated bookkeeping: building parameter tables, deriving observed and latent variable names, producing model-implied covariance and mean structures, differentiating those implied quantities, and constructing baseline models for fit indices. If those operations allocate too much or recompute the same internal state, a user-facing model can feel slow even when the statistical work is modest.

The PR attacks that cost by preserving the existing interface and changing the shape of internal work. It favors cached metadata, direct construction for common cases, fused calculations, and linear algebra helpers that avoid materializing large intermediate matrices.

Methods

The evidence comes from the profiling artifacts in:

FastLavaan\benchmark\runs\baseline
FastLavaan\benchmark\runs\V6-Baseline

Each run contains a profiling-summary.xlsx workbook and a self-contained transcript.html. The benchmark harness used callr, ran each scenario in an isolated authoritative timing context, used 33 orchestration workers, and measured 1000 timed iterations after 22 warmup iterations per scenario.

The analysis below compares only scenario labels present in both runs. For a scenario s, the median speedup is:

speedups=medianbaseline,smedianv6,sspeedup_s = \frac{median_{baseline,s}}{median_{v6,s}}

The aggregate reduction for a latency metric is computed from the sum of matched scenario latencies:

reduction=stbaseline,sstv6,sstbaseline,sreduction = \frac{\sum_s t_{baseline,s} - \sum_s t_{v6,s}}{\sum_s t_{baseline,s}} This is intentionally a benchmark comparison, not a proof that every individual code edit caused a specific share of the speedup.

Intervention

The PR changes can be read as six coordinated interventions.

1. Baseline Models Stop Taking the Scenic Route

The new file R/lav_lavaan_baseline_utils.R adds a direct path for standard ML independence baseline models. It validates whether the model is eligible, constructs the baseline partable from sample moments, computes the baseline objective from log determinants, and returns the standard chi-square test without forcing the full baseline model through the same expensive path as a general SEM.

There are separate payload builders for simple single-group ML, conditional-x ML, and generated independence models. The key design is conservative: return NULL when the assumptions are not met, and let the existing lavaan machinery handle unsupported cases.

2. Partables and Variable Names Become Cacheable Evidence

Several changes reduce repeated scans over the parameter table:

AreaChange
Observed-variable discoverylav_lavaan_step01_ovnames_group_fast() handles simple single-group CFA models directly.
Partable constructionlav_lavaan_step04_partable_cfa_fast() constructs a common CFA partable without walking the full general machinery.
Cached attributeslav_partable_set_cache() now updates attributes in bulk, and lav_partable_remove_cache() keeps only the intended attributes.
Block lookuplav_partable_block_values() can reuse block.values.
Name lookuplav_partable_vnames() can serve cached block, group, and level queries.
Model setuplav_model() detects already-complete cached partables and avoids redundant completion work.

This matters because variable-name and partable operations are classic constant-factor costs. They are individually small, but repeated in almost every model construction, baseline construction, post-check, and downstream inspection path.

3. Implied Moments Are Computed Once, Then Shared

Before this PR, many callers asked separately for model-implied covariance matrices, means, thresholds, and slopes. The PR introduces lav_model_implied_fast() and state-aware use of the same idea throughout the objective, gradient, bootstrap, simulation, and PML paths.

The LISREL and RAM representations gained specialized helpers:

RepresentationNew fast helperPurpose
LISRELlav_lisrel_implied_fast()Computes requested sigma, mu, thresholds, and slopes while reusing shared matrix products.
LISRELlav_lisrel_vy_diag()Computes only the observed variance diagonal when the full covariance is unnecessary.
LISRELlav_lisrel_beta_dag_order()Detects DAG orderings that make (I - BETA)^-1 cheaper.
RAMlav_ram_implied_fast()Computes requested observed covariance and mean quantities with one inverse.
RAMlav_ram_sigmahat_diag()Computes covariance diagonals without constructing the whole observed covariance.

The same fast implied-moment path is then used in bootstrap transformations, old simulation code, pairwise likelihood ratio testing, objective evaluation, and gradient evaluation.

4. Matrix Work Is Rewritten Around the Shape Actually Needed

The PR adds small matrix helpers that show up repeatedly in hot paths:

HelperWhy it matters
lav_matrix_diag_prepost()Replaces explicit diag(d) %*% A %*% diag(d) construction.
lav_matrix_rowcol_idx()Builds matrix-vector indices directly from row, column, and value triples.
lav_matrix_symmetric_inverse_chol_first()Attempts Cholesky inversion before falling back to the existing symmetric inverse machinery.
lav_matrix_delta_A_delta()Names and centralizes the Delta' A Delta information-matrix product.

These are modest individually, but they remove unnecessary dense diagonal matrices, repeated index reconstruction, and general-purpose inverse calls in places where the matrix shape is already known.

5. Gradients, Missing Data, and Two-Level Likelihoods Allocate Less

The gradient path now uses cached model-matrix group indices through the new mm.idx slot on lavModel. It also assembles free-parameter gradients more directly, caches conditional-x sample quantities, and factors lav_model_ddelta_dx() into setup, matrix, group, and many-target helpers.

Missing-data likelihood code now prepares pattern metadata once through lav_mvnorm_missing_prepare_samplestats() and stores that prepared object on sample statistics. Pattern likelihoods and derivatives reuse prepared indices, frequencies, means, and covariance summaries. The two-level multivariate normal derivative path similarly accumulates weighted derivative vectors directly instead of first allocating per-cluster derivative matrices and summing them afterward.

6. MIIV, PML, and Utility Paths Avoid General Work

The MIIV utilities replace repeated solve() calls and full duplication-matrix products with Cholesky solves, chol2inv(chol(...)), vech weights, and a helper that applies WLS weights as an operation rather than materializing the whole Kronecker-derived matrix. PML code now requests sigma, mu, and thresholds from lav_model_implied_fast() in one pass.

Smaller changes follow the same pattern: lav_snake_case() uses a cached map and avoids regex work when names are already simple, option normalization uses chartr(), FMG test detection uses direct prefix checks for scalar input, version lookup is cached by lav_version(), and correlation checks avoid pairwise-complete handling when the numeric data contain no missing values.

Results

The headline result is not confined to one scenario. The matched suite improved across simple CFA, mediation, growth, multigroup, FIML, ordinal, two-level, and synthetic workloads.

Aggregate benchmark reductions for the feature branch compared with masterAggregate reductions across matched scenariosLatency uses 25 matched scenarios; allocation uses the 23 scenarios with measurements in both runs.Mean latencyMedian latencyp95 latencyAllocated bytes56.0%58.6%48.7%29.9%
MetricBaseline averageV6 averageAggregate reductionMedian per-scenario reduction
Mean latency151.75 ms66.74 ms56.0%63.2%
Median latency146.13 ms60.44 ms58.6%66.5%
p95 latency196.27 ms100.78 ms48.7%53.5%
Allocated bytes15.87 MB11.13 MB29.9%39.0%

The strongest median-latency improvements were concentrated in common ML and SEM workloads where baseline construction, parameter table setup, implied moments, and gradient work are repeatedly done.

Top scenario median speedups for feature fast v6 versus masterTop median-latency speedupsSpeedup is baseline median divided by V6 median.cfa_holzinger_ml4.44xbootstrap_mediation_path_sem_ml4.00xmediation_path_sem_ml3.86xgrowth_demo_ml3.79xsample_cov_holzinger_ml3.55xri_clpm_synthetic_panel_ml3.51xpiecewise_growth_synthetic_ml3.42xmultigroup_holzinger_loadings3.37xbifactor_synthetic_battery_ml3.37xparallel_process_growth_ml3.31x
ScenarioBaseline medianV6 medianSpeedupMedian reduction
cfa_holzinger_ml44.45 ms10.02 ms4.44x77.5%
bootstrap_mediation_path_sem_ml685.90 ms171.28 ms4.00x75.0%
mediation_path_sem_ml41.78 ms10.83 ms3.86x74.1%
growth_demo_ml65.55 ms17.31 ms3.79x73.6%
sample_cov_holzinger_ml43.06 ms12.14 ms3.55x71.8%
ri_clpm_synthetic_panel_ml75.12 ms21.38 ms3.51x71.5%
piecewise_growth_synthetic_ml63.97 ms18.72 ms3.42x70.7%
multigroup_holzinger_loadings92.27 ms27.35 ms3.37x70.4%
bifactor_synthetic_battery_ml47.31 ms14.03 ms3.37x70.3%
parallel_process_growth_ml55.41 ms16.73 ms3.31x69.8%

There were 00 median-latency regressions in the matched scenario set. The smallest median-latency improvement was still material: the ordinal latent regression WLSMV scenario improved from 239.02239.02 ms to 156.73156.73 ms, a 34.4%34.4\% reduction.

Mechanism

The profiler summaries help explain why a broad PR produced broad speedups. In both runs, the highest diagnostic scores remained around the same families of work: baseline model construction, gradient calculation, sample statistics, and optimization. But the latencies associated with those scenarios fell.

For example, fiml_holzinger_ml had a baseline median of 412.16412.16 ms and a V6 median of 200.24200.24 ms. twolevel_demo_ml moved from 342.44342.44 ms to 220.71220.71 ms. The PR did not erase these scenarios as bottlenecks, but it lowered their absolute cost. That is the signature of constant-factor engineering: the ranking of hard workloads can remain similar while every row gets cheaper.

The best-supported mechanism is not one isolated change. The PR removes repetition across several layers:

Repeated work before the PRPR-level response
Recompute partable-derived names and block metadataCache partable attributes and serve cached lav_partable_vnames() queries.
Build simple CFA partables through the general pathAdd simple CFA fast paths for observed names and partable construction.
Ask separately for implied covariance, mean, thresholds, and slopesAdd lav_model_implied_fast() and representation-specific fused helpers.
Construct dense diagonal matrices for pre/post scalingReplace with elementwise lav_matrix_diag_prepost().
Materialize full WLS/duplication matrices in MIIV utilitiesApply vech weights and Cholesky solves directly.
Re-scan missing-data pattern objects inside likelihood derivativesPrepare pattern metadata once and reuse it.
Allocate per-cluster derivative matrices before summingAccumulate weighted derivatives directly.

This pattern is visible in allocation measurements. Among scenarios with allocation data in both runs (2323/2525 scenarios), aggregate allocated bytes fell by 29.9%29.9\%. The largest measured allocation reductions were:

ScenarioBaseline allocationV6 allocationReduction
growth_demo_ml7.80 MB3.13 MB59.9%
ri_clpm_synthetic_panel_ml10.05 MB4.53 MB55.0%
sample_cov_holzinger_ml2.23 MB1.17 MB47.6%
synthetic_large_cfa_ml57.63 MB30.23 MB47.5%
mimic_holzinger_covariates_ml11.35 MB6.34 MB44.2%
wisc_higher_order_cfa_ml3.26 MB1.85 MB43.4%
mediation_path_sem_ml0.91 MB0.52 MB42.6%
cfa_holzinger_ml1.61 MB0.94 MB41.8%

The allocation result is especially useful because it is a mechanism check. A latency-only benchmark can be noisy; a latency improvement paired with lower allocation across many scenarios is stronger evidence that the internal work changed in the intended direction.

Code Receipts: Before and After

The benchmark numbers above are the headline, but the PR is easier to understand when the code changes are visible. These examples compare master to feature/fast-v6 and focus on changes inside lavaan/R/ that most clearly explain the latency reductions.

1. Baseline Fit: Try a Direct Independence Payload First

On master, the baseline step always reached for the general independence model fit when baseline tests were requested:

current_verbose <- lav_verbose()
lav_verbose(FALSE)
fit_indep <- try(lav_object_independence(
  object = NULL,
  lavsamplestats = lavsamplestats,
  lavdata = lavdata,
  lavcache = lavcache,
  lavoptions = lavoptions,
  lavpartable = lavpartable,
  lavh1 = lavh1
), silent = TRUE)
lav_verbose(current_verbose)
 
lavbaseline <- list(
  partable = fit_indep@ParTable,
  test = fit_indep@test
)

On feature/fast-v6, lav_lavaan_step15_baseline() first attempts a direct fast payload and only falls back to lav_object_independence() when the model shape is unsupported:

lavbaseline <- lav_lavaan_step15_baseline_fast(
  lavoptions = lavoptions,
  lavsamplestats = lavsamplestats,
  lavdata = lavdata,
  lavpartable = lavpartable,
  lavh1 = lavh1
)
if (!is.null(lavbaseline)) {
  if (lav_verbose()) {
    cat(" done.\n")
  }
  return(lavbaseline)
}
 
fit_indep <- try(lav_object_independence(...), silent = TRUE)

The new helper checks conservative ML eligibility, builds the independence partable from sample moments, and computes the standard chi-square payload directly. This is a good showcase because fit indices touch baseline models even when the user only asked for the main SEM fit. In the matched workbooks, cfa_holzinger_ml improved from 44.45 ms to 10.02 ms median latency (4.44x), while postfit_political_democracy_sem_ml improved from 153.10 ms to 67.22 ms (2.28x).

2. Simple CFA Setup: Skip General Partable Machinery

On master, observed-name discovery repeatedly called the general lav_partable_vnames() path for each name family:

attr(flat_model, "vnames") <- lav_partable_vnames(flat_model, type = "*")
ov_names <- unique(unlist(lav_partable_vnames(flat_model, type = "ov")))
ov_names_y <- unique(unlist(lav_partable_vnames(
  flat_model, type = "ov.nox"
)))
ov_names_x <- unique(unlist(lav_partable_vnames(flat_model, type = "ov.x")))
lv_names <- unique(unlist(lav_partable_vnames(flat_model, type = "lv")))

On feature/fast-v6, simple single-group CFA models get a narrow fast path, and the fallback computes the full vnames cache once before slicing it:

fast <- lav_lavaan_step01_ovnames_group_fast(flat_model, ngroups)
if (!is.null(fast)) {
  return(fast)
}
 
flat_vnames <- lav_partable_vnames(flat_model, type = "*")
attr(flat_model, "vnames") <- flat_vnames
ov_names <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "ov")
ov_names_y <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "ov.nox")
ov_names_x <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "ov.x")
lv_names <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "lv")

The same idea appears in step 4, where lav_lavaan_step04_partable_cfa_fast() constructs a standard CFA partable directly and returns before the general completion path:

lavpartable <- lav_lavaan_step04_partable_cfa_fast(
  slot_par_table = slot_par_table,
  flat_model = flat_model,
  lavoptions = lavoptions,
  lavdata = lavdata,
  constraints = constraints
)
if (!is.null(lavpartable)) {
  return(list(
    lavoptions = lavoptions,
    lavpartable = lavpartable
  ))
}

This removes repeated metadata scans from a very common modeling shape. The benchmark evidence lines up with that target: sample_cov_holzinger_ml moved from 43.06 ms to 12.14 ms median latency (3.55x), and synthetic_large_cfa_ml moved from 80.27 ms to 29.49 ms (2.72x) while allocation fell from 54.96 MB to 28.83 MB.

On master, lav_model_implied() requested each model-implied quantity through a separate public helper:

sigma_hat <- lav_model_sigma(
  lavmodel = lavmodel, glist = glist,
  delta = delta
)
 
if (lavmodel@meanstructure) {
  mu_hat <- lav_model_mu(lavmodel = lavmodel, glist = glist)
}
if (lavmodel@conditional.x) {
  slopes <- lav_model_pi(lavmodel = lavmodel, glist = glist)
}
if (lavmodel@categorical) {
  th <- lav_model_th(lavmodel = lavmodel, glist = glist)
}

On feature/fast-v6, one request advertises exactly which pieces are needed and shares the representation-level work:

implied_fast <- lav_model_implied_fast(
  lavmodel = lavmodel,
  glist = glist,
  need_sigma = TRUE,
  need_mu = lavmodel@meanstructure,
  need_pi = lavmodel@conditional.x,
  need_th = lavmodel@categorical,
  delta = delta
)
 
sigma_hat <- implied_fast$sigma
mu_hat <- implied_fast$mu
slopes <- implied_fast$pi
th <- implied_fast$th

Inside the LISREL fast helper, the expensive shared product is computed once:

need_lambda_ib_inv <- (need_sigma && is.null(mm_wmat)) ||
  need_mu || need_th || need_pi
if (need_lambda_ib_inv) {
  if (is.null(mm_beta)) {
    lambda__ib_inv <- mm_lambda
  } else {
    ib_inv <- lav_lisrel_ibinv(mlist = mlist)
    lambda__ib_inv <- mm_lambda %*% ib_inv
  }
}

This helps ordinary ML fits because objective, gradient, bootstrap, PML, and simulation paths all ask for overlapping implied quantities. The broad effect shows up in nontrivial SEM scenarios: mediation_path_sem_ml improved from 41.78 ms to 10.83 ms (3.86x), growth_demo_ml improved from 65.55 ms to 17.31 ms (3.79x), and conditional_x_ml improved from 104.53 ms to 53.00 ms (1.97x).

4. Gradient Delta Work: Build Shared Setup Once

On master, lav_model_gradient_dd() rebuilt Delta plumbing separately for each target matrix:

delta_lambda <- lav_model_ddelta_dx(
  lavmodel, glist = g_list, target = "lambda"
)[[group]]
delta_tau <- lav_model_ddelta_dx(
  lavmodel, glist = g_list, target = "tau"
)[[group]]
delta_nu <- lav_model_ddelta_dx(
  lavmodel, glist = g_list, target = "nu"
)[[group]]
delta_theta <- lav_model_ddelta_dx(
  lavmodel, glist = g_list, target = "theta"
)[[group]]

On feature/fast-v6, the repeated setup is factored out and multiple Delta targets are produced from the same prepared indices:

deltas <- lav_model_ddelta_dx_many(
  lavmodel = lavmodel,
  glist = g_list,
  targets = c(
    "lambda", "tau", "nu", "theta",
    "beta", "psi", "alpha", "gamma"
  ),
  group = group
)
delta_lambda <- deltas$lambda
delta_tau <- deltas$tau
delta_nu <- deltas$nu
delta_theta <- deltas$theta

The lower-level setup now caches model-matrix indices, free-parameter indices, threshold indices, and the active equality-constraint shape once per group. That is especially relevant for gradient-heavy categorical and post-fit work. The WLSMV scenarios still remain heavier than simple ML, but they improved: ordinal_holzinger_wlsmv fell from 134.91 ms to 81.92 ms (1.65x), and ordinal_latent_regression_wlsmv fell from 239.02 ms to 156.73 ms (1.53x).

5. FIML Missing Patterns: Prepare Pattern Metadata Once

On master, missing-pattern likelihood code repeatedly reached into each pattern object and recomputed simple indexes:

pat_n <- length(yp)
p_1 <- length(yp[[1]]$var.idx)
 
for (p in seq_len(pat_n)) {
  var_idx <- yp[[p]]$var.idx
  na_idx <- which(!var_idx)
 
  p_log_2pi[p] <- sum(var_idx) * log_2pi * yp[[p]]$freq
  tt <- yp[[p]]$SY + tcrossprod(yp[[p]]$MY - mu[var_idx])
  dist_1[p] <- sum(sigma_inv * tt) * yp[[p]]$freq
}

On feature/fast-v6, sample statistics carry prepared missing-pattern metadata, and the likelihood avoids forming tcrossprod() only to multiply it by the inverse:

yp_prepared <- lav_mvnorm_missing_prepare_samplestats(yp)
pat_n <- yp_prepared$pat_n
 
for (p in seq_len(pat_n)) {
  var_idx <- yp_prepared$var.idx[[p]]
  na_idx <- yp_prepared$na.idx[[p]]
  freq <- yp_prepared$freq[p]
 
  p_log_2pi[p] <- yp_prepared$nvar[p] * log_2pi * freq
  diff_1 <- yp_prepared$MY[[p]] - mu[var_idx]
  dist_1[p] <- (
    sum(sigma_inv * yp_prepared$SY[[p]]) +
      sum(as.numeric(crossprod(diff_1, sigma_inv)) * diff_1)
  ) * freq
}

The preparation is attached when the missing sample statistics are built:

attr(yp, "prepared") <- lav_mvnorm_missing_prepare_samplestats(yp)

This is a clean FIML optimization because it does not change the likelihood; it changes how much list lookup, index reconstruction, and temporary matrix work is done per pattern. The FIML scenarios improved materially: fiml_holzinger_ml moved from 412.16 ms to 200.24 ms median latency (2.06x) with allocation falling from 42.92 MB to 30.55 MB, and fiml_political_democracy_sem_ml moved from 333.74 ms to 163.24 ms (2.04x) with allocation falling from 50.93 MB to 37.14 MB.

6. Two-Level Likelihood: Accumulate Derivatives Directly

On master, the two-level derivative code allocated per-cluster matrices and then summed them at the end:

g_muy <- matrix(0, ncluster_sizes, length(mu_y))
g_sigma_w_1 <- matrix(0, ncluster_sizes, length(lav_matrix_vech(sigma_w_1)))
g_sigma_b_1 <- matrix(0, ncluster_sizes, length(lav_matrix_vech(sigma_b_1)))
 
for (clz in seq_len(ncluster_sizes)) {
  g_muy[clz, ] <- -2 * nj * as.numeric(yc %*% sigma_j_inv)
  g_sigma_w_1[clz, ] <- lav_matrix_vech(tmp)
  g_sigma_b_1[clz, ] <- lav_matrix_vech(tmp)
}
 
d_mu_y <- colSums(g_muy * cluster_size_ns)
d_sigma_w1 <- lav_matrix_vech_reverse(colSums(
  g_sigma_w_1 * cluster_size_ns
))

On feature/fast-v6, the same quantities are accumulated directly into weighted derivative vectors:

d_mu_y <- numeric(length(mu_y))
d_sigma_w1_vech <- numeric(length(lav_matrix_vech(sigma_w_1)))
d_sigma_b_vech <- numeric(length(lav_matrix_vech(sigma_b_1)))
 
for (clz in seq_len(ncluster_sizes)) {
  cluster_weight <- cluster_size_ns[clz]
 
  d_mu_y <- d_mu_y +
    cluster_weight * (-2 * nj * as.numeric(yc %*% sigma_j_inv))
  d_sigma_w1_vech <- d_sigma_w1_vech +
    cluster_weight * lav_matrix_vech(tmp)
  d_sigma_b_vech <- d_sigma_b_vech +
    cluster_weight * lav_matrix_vech(tmp)
}
 
d_sigma_w1 <- lav_matrix_vech_reverse(d_sigma_w1_vech)
d_sigma_b <- lav_matrix_vech_reverse(d_sigma_b_vech)

This is the most literal allocation reduction in the set: the code no longer needs a cluster-by-parameter matrix just to immediately collapse it. The two-level scenario improved from 342.44 ms to 220.71 ms median latency (1.55x), and allocation fell from 28.30 MB to 24.26 MB.

Full Matched Scenario Table

ScenarioBaseline medianV6 medianSpeedupMedian reductionAllocation reduction
cfa_holzinger_ml44.45 ms10.02 ms4.44x77.5%41.8%
bootstrap_mediation_path_sem_ml685.90 ms171.28 ms4.00x75.0%n/a
mediation_path_sem_ml41.78 ms10.83 ms3.86x74.1%42.6%
growth_demo_ml65.55 ms17.31 ms3.79x73.6%59.9%
sample_cov_holzinger_ml43.06 ms12.14 ms3.55x71.8%47.6%
ri_clpm_synthetic_panel_ml75.12 ms21.38 ms3.51x71.5%55.0%
piecewise_growth_synthetic_ml63.97 ms18.72 ms3.42x70.7%38.4%
multigroup_holzinger_loadings92.27 ms27.35 ms3.37x70.4%37.5%
bifactor_synthetic_battery_ml47.31 ms14.03 ms3.37x70.3%38.7%
parallel_process_growth_ml55.41 ms16.73 ms3.31x69.8%41.7%
apim_verbal_performance_sem_ml44.54 ms13.70 ms3.25x69.2%34.8%
multigroup_holzinger_invariance_postfit269.04 ms89.31 ms3.01x66.8%39.0%
sem_political_democracy_ml56.99 ms19.12 ms2.98x66.5%40.5%
wisc_higher_order_cfa_ml52.79 ms17.99 ms2.93x65.9%43.4%
mimic_holzinger_covariates_ml92.41 ms32.17 ms2.87x65.2%44.2%
mtmm_trait_method_sample_cov_ml62.73 ms22.24 ms2.82x64.5%39.7%
synthetic_large_cfa_ml80.27 ms29.49 ms2.72x63.3%47.5%
robust_clinical_path_mlr59.69 ms24.16 ms2.47x59.5%20.2%
postfit_political_democracy_sem_ml153.10 ms67.22 ms2.28x56.1%n/a
fiml_holzinger_ml412.16 ms200.24 ms2.06x51.4%28.8%
fiml_political_democracy_sem_ml333.74 ms163.24 ms2.04x51.1%27.1%
conditional_x_ml104.53 ms53.00 ms1.97x49.3%18.8%
ordinal_holzinger_wlsmv134.91 ms81.92 ms1.65x39.3%5.6%
twolevel_demo_ml342.44 ms220.71 ms1.55x35.5%14.3%
ordinal_latent_regression_wlsmv239.02 ms156.73 ms1.53x34.4%2.4%

Threats to Validity

The evidence is strong enough to support a performance claim, but it has boundaries.

First, this is a run-pair comparison on one environment. Both runs used R 4.6.0 and the same benchmark harness shape, which helps comparability, but the result should still be replicated before treating the exact percentages as universal.

Second, the PR is broad. Because many fast paths landed together, the benchmark does not identify how much of the total speedup belongs to baseline-model construction versus implied moments versus matrix helpers versus missing-data preparation. The evidence supports the combined intervention.

Third, the benchmark scenarios are representative, not exhaustive. The fast paths deliberately fall back to existing general code for unsupported cases. That is good engineering, but it means the measured benefit will depend on how often real user workloads hit the newly supported shapes.

Fourth, profiler sampling has finite resolution. The transcript is useful for identifying major hotspots and diagnostic rank changes, but the strongest claims in this post come from matched latency and allocation tables rather than from individual sampled call stacks.

Conclusion

This is a good example of performance work that respects both a mature codebase and the future direction of lavaan. It does not replace lavaan's modeling language or statistical estimators, which means the public API remains unchanged. My work does change quite a bit of the internal structure: cache what the code already knows, compute related implied quantities together, avoid dense intermediates, and reserve the fully general path for models that need it.

The measured result is large. Across 2525 matched scenarios, median latency fell from an average of 146.13146.13 ms to 60.4460.44 ms, with no median-latency regressions in the matched suite. Allocation also moved in the right direction, falling by 29.9%29.9\% in aggregate across scenarios where both runs recorded allocation data.

For SEM users, this should shows up as a nicer feedback loop: fits return sooner, bootstrap and post-fit workflows spend less time on repeated internal work, and common model families benefit without asking the user to write different model syntax. For a PR, that is the sweet spot: the same interface, less waiting.