Fast Paths in lavaan: A Performance Study
A research-style account of a lavaan performance PR, using matched profiling runs to connect internal fast paths with measured latency and allocation reductions.

Abstract
This post studies my PR
N3uralN3twork/lavaan#4,
feature/fast-v6, as a performance intervention in the lavaan structural
equation modeling library. The largest change is not a single algorithmic
replacement, but a set of coordinated fast paths: direct baseline-model
construction, cached partable metadata, unified model-implied moment
calculation, cheaper matrix operations, lower-allocation missing-data
likelihood code, and more efficient MIIV/PML linear algebra.
I evaluated the code from the PR against the master branch using a benchmark suite of diverse scenarios.
| Run | Branch | lavaan version | Scenarios | Timed iterations | Timing context |
|---|---|---|---|---|---|
baseline | master | 0.6.22.2653 | 25 | 1000 each | isolated authoritative |
V6-Baseline | feature/fast-v6 | 0.6.22.2658 | 25 | 1000 each | isolated authoritative |
Across the 25 matched scenarios, the PR reduced aggregate mean latency by , aggregate median latency by , and aggregate p95 latency by . All matched scenarios improved on median latency. Among the scenarios with allocation measurements in both runs, aggregate allocation fell by .
Research Question
The practical question was:
Can a broad internal fast-path PR reduce routine
lavaanfitting overhead across diverse SEM scenarios without narrowing the public modeling surface?
That question matters because SEM software spends much of its time in repeated bookkeeping: building parameter tables, deriving observed and latent variable names, producing model-implied covariance and mean structures, differentiating those implied quantities, and constructing baseline models for fit indices. If those operations allocate too much or recompute the same internal state, a user-facing model can feel slow even when the statistical work is modest.
The PR attacks that cost by preserving the existing interface and changing the shape of internal work. It favors cached metadata, direct construction for common cases, fused calculations, and linear algebra helpers that avoid materializing large intermediate matrices.
Methods
The evidence comes from the profiling artifacts in:
FastLavaan\benchmark\runs\baseline
FastLavaan\benchmark\runs\V6-BaselineEach run contains a profiling-summary.xlsx workbook and a self-contained
transcript.html. The benchmark harness used callr, ran each scenario in an
isolated authoritative timing context, used orchestration workers, and
measured 1000 timed iterations after warmup iterations per scenario.
The analysis below compares only scenario labels present in both runs. For a
scenario s, the median speedup is:
The aggregate reduction for a latency metric is computed from the sum of matched scenario latencies:
This is intentionally a benchmark comparison, not a proof that every individual code edit caused a specific share of the speedup.Intervention
The PR changes can be read as six coordinated interventions.
1. Baseline Models Stop Taking the Scenic Route
The new file R/lav_lavaan_baseline_utils.R adds a direct path for standard ML
independence baseline models. It validates whether the model is eligible,
constructs the baseline partable from sample moments, computes the baseline
objective from log determinants, and returns the standard chi-square test
without forcing the full baseline model through the same expensive path as a
general SEM.
There are separate payload builders for simple single-group ML,
conditional-x ML, and generated independence models. The key design is
conservative: return NULL when the assumptions are not met, and let the
existing lavaan machinery handle unsupported cases.
2. Partables and Variable Names Become Cacheable Evidence
Several changes reduce repeated scans over the parameter table:
| Area | Change |
|---|---|
| Observed-variable discovery | lav_lavaan_step01_ovnames_group_fast() handles simple single-group CFA models directly. |
| Partable construction | lav_lavaan_step04_partable_cfa_fast() constructs a common CFA partable without walking the full general machinery. |
| Cached attributes | lav_partable_set_cache() now updates attributes in bulk, and lav_partable_remove_cache() keeps only the intended attributes. |
| Block lookup | lav_partable_block_values() can reuse block.values. |
| Name lookup | lav_partable_vnames() can serve cached block, group, and level queries. |
| Model setup | lav_model() detects already-complete cached partables and avoids redundant completion work. |
This matters because variable-name and partable operations are classic constant-factor costs. They are individually small, but repeated in almost every model construction, baseline construction, post-check, and downstream inspection path.
3. Implied Moments Are Computed Once, Then Shared
Before this PR, many callers asked separately for model-implied covariance
matrices, means, thresholds, and slopes. The PR introduces
lav_model_implied_fast() and state-aware use of the same idea throughout the
objective, gradient, bootstrap, simulation, and PML paths.
The LISREL and RAM representations gained specialized helpers:
| Representation | New fast helper | Purpose |
|---|---|---|
| LISREL | lav_lisrel_implied_fast() | Computes requested sigma, mu, thresholds, and slopes while reusing shared matrix products. |
| LISREL | lav_lisrel_vy_diag() | Computes only the observed variance diagonal when the full covariance is unnecessary. |
| LISREL | lav_lisrel_beta_dag_order() | Detects DAG orderings that make (I - BETA)^-1 cheaper. |
| RAM | lav_ram_implied_fast() | Computes requested observed covariance and mean quantities with one inverse. |
| RAM | lav_ram_sigmahat_diag() | Computes covariance diagonals without constructing the whole observed covariance. |
The same fast implied-moment path is then used in bootstrap transformations, old simulation code, pairwise likelihood ratio testing, objective evaluation, and gradient evaluation.
4. Matrix Work Is Rewritten Around the Shape Actually Needed
The PR adds small matrix helpers that show up repeatedly in hot paths:
| Helper | Why it matters |
|---|---|
lav_matrix_diag_prepost() | Replaces explicit diag(d) %*% A %*% diag(d) construction. |
lav_matrix_rowcol_idx() | Builds matrix-vector indices directly from row, column, and value triples. |
lav_matrix_symmetric_inverse_chol_first() | Attempts Cholesky inversion before falling back to the existing symmetric inverse machinery. |
lav_matrix_delta_A_delta() | Names and centralizes the Delta' A Delta information-matrix product. |
These are modest individually, but they remove unnecessary dense diagonal matrices, repeated index reconstruction, and general-purpose inverse calls in places where the matrix shape is already known.
5. Gradients, Missing Data, and Two-Level Likelihoods Allocate Less
The gradient path now uses cached model-matrix group indices through the new
mm.idx slot on lavModel. It also assembles free-parameter gradients more
directly, caches conditional-x sample quantities, and factors
lav_model_ddelta_dx() into setup, matrix, group, and many-target helpers.
Missing-data likelihood code now prepares pattern metadata once through
lav_mvnorm_missing_prepare_samplestats() and stores that prepared object on
sample statistics. Pattern likelihoods and derivatives reuse prepared indices,
frequencies, means, and covariance summaries. The two-level multivariate normal
derivative path similarly accumulates weighted derivative vectors directly
instead of first allocating per-cluster derivative matrices and summing them
afterward.
6. MIIV, PML, and Utility Paths Avoid General Work
The MIIV utilities replace repeated solve() calls and full duplication-matrix
products with Cholesky solves, chol2inv(chol(...)), vech weights, and a helper
that applies WLS weights as an operation rather than materializing the whole
Kronecker-derived matrix. PML code now requests sigma, mu, and thresholds
from lav_model_implied_fast() in one pass.
Smaller changes follow the same pattern: lav_snake_case() uses a cached map
and avoids regex work when names are already simple, option normalization uses
chartr(), FMG test detection uses direct prefix checks for scalar input,
version lookup is cached by lav_version(), and correlation checks avoid
pairwise-complete handling when the numeric data contain no missing values.
Results
The headline result is not confined to one scenario. The matched suite improved across simple CFA, mediation, growth, multigroup, FIML, ordinal, two-level, and synthetic workloads.
| Metric | Baseline average | V6 average | Aggregate reduction | Median per-scenario reduction |
|---|---|---|---|---|
| Mean latency | 151.75 ms | 66.74 ms | 56.0% | 63.2% |
| Median latency | 146.13 ms | 60.44 ms | 58.6% | 66.5% |
| p95 latency | 196.27 ms | 100.78 ms | 48.7% | 53.5% |
| Allocated bytes | 15.87 MB | 11.13 MB | 29.9% | 39.0% |
The strongest median-latency improvements were concentrated in common ML and SEM workloads where baseline construction, parameter table setup, implied moments, and gradient work are repeatedly done.
| Scenario | Baseline median | V6 median | Speedup | Median reduction |
|---|---|---|---|---|
cfa_holzinger_ml | 44.45 ms | 10.02 ms | 4.44x | 77.5% |
bootstrap_mediation_path_sem_ml | 685.90 ms | 171.28 ms | 4.00x | 75.0% |
mediation_path_sem_ml | 41.78 ms | 10.83 ms | 3.86x | 74.1% |
growth_demo_ml | 65.55 ms | 17.31 ms | 3.79x | 73.6% |
sample_cov_holzinger_ml | 43.06 ms | 12.14 ms | 3.55x | 71.8% |
ri_clpm_synthetic_panel_ml | 75.12 ms | 21.38 ms | 3.51x | 71.5% |
piecewise_growth_synthetic_ml | 63.97 ms | 18.72 ms | 3.42x | 70.7% |
multigroup_holzinger_loadings | 92.27 ms | 27.35 ms | 3.37x | 70.4% |
bifactor_synthetic_battery_ml | 47.31 ms | 14.03 ms | 3.37x | 70.3% |
parallel_process_growth_ml | 55.41 ms | 16.73 ms | 3.31x | 69.8% |
There were median-latency regressions in the matched scenario set. The smallest median-latency improvement was still material: the ordinal latent regression WLSMV scenario improved from ms to ms, a reduction.
Mechanism
The profiler summaries help explain why a broad PR produced broad speedups. In both runs, the highest diagnostic scores remained around the same families of work: baseline model construction, gradient calculation, sample statistics, and optimization. But the latencies associated with those scenarios fell.
For example, fiml_holzinger_ml had a baseline median of ms and a V6
median of ms. twolevel_demo_ml moved from ms to ms. The
PR did not erase these scenarios as bottlenecks, but it lowered their absolute
cost. That is the signature of constant-factor engineering: the ranking of hard
workloads can remain similar while every row gets cheaper.
The best-supported mechanism is not one isolated change. The PR removes repetition across several layers:
| Repeated work before the PR | PR-level response |
|---|---|
| Recompute partable-derived names and block metadata | Cache partable attributes and serve cached lav_partable_vnames() queries. |
| Build simple CFA partables through the general path | Add simple CFA fast paths for observed names and partable construction. |
| Ask separately for implied covariance, mean, thresholds, and slopes | Add lav_model_implied_fast() and representation-specific fused helpers. |
| Construct dense diagonal matrices for pre/post scaling | Replace with elementwise lav_matrix_diag_prepost(). |
| Materialize full WLS/duplication matrices in MIIV utilities | Apply vech weights and Cholesky solves directly. |
| Re-scan missing-data pattern objects inside likelihood derivatives | Prepare pattern metadata once and reuse it. |
| Allocate per-cluster derivative matrices before summing | Accumulate weighted derivatives directly. |
This pattern is visible in allocation measurements. Among scenarios with allocation data in both runs (/ scenarios), aggregate allocated bytes fell by . The largest measured allocation reductions were:
| Scenario | Baseline allocation | V6 allocation | Reduction |
|---|---|---|---|
growth_demo_ml | 7.80 MB | 3.13 MB | 59.9% |
ri_clpm_synthetic_panel_ml | 10.05 MB | 4.53 MB | 55.0% |
sample_cov_holzinger_ml | 2.23 MB | 1.17 MB | 47.6% |
synthetic_large_cfa_ml | 57.63 MB | 30.23 MB | 47.5% |
mimic_holzinger_covariates_ml | 11.35 MB | 6.34 MB | 44.2% |
wisc_higher_order_cfa_ml | 3.26 MB | 1.85 MB | 43.4% |
mediation_path_sem_ml | 0.91 MB | 0.52 MB | 42.6% |
cfa_holzinger_ml | 1.61 MB | 0.94 MB | 41.8% |
The allocation result is especially useful because it is a mechanism check. A latency-only benchmark can be noisy; a latency improvement paired with lower allocation across many scenarios is stronger evidence that the internal work changed in the intended direction.
Code Receipts: Before and After
The benchmark numbers above are the headline, but the PR is easier to
understand when the code changes are visible. These examples compare
master to feature/fast-v6 and focus on changes inside lavaan/R/ that
most clearly explain the latency reductions.
1. Baseline Fit: Try a Direct Independence Payload First
On master, the baseline step always reached for the general independence
model fit when baseline tests were requested:
current_verbose <- lav_verbose()
lav_verbose(FALSE)
fit_indep <- try(lav_object_independence(
object = NULL,
lavsamplestats = lavsamplestats,
lavdata = lavdata,
lavcache = lavcache,
lavoptions = lavoptions,
lavpartable = lavpartable,
lavh1 = lavh1
), silent = TRUE)
lav_verbose(current_verbose)
lavbaseline <- list(
partable = fit_indep@ParTable,
test = fit_indep@test
)On feature/fast-v6, lav_lavaan_step15_baseline() first attempts a direct
fast payload and only falls back to lav_object_independence() when the model
shape is unsupported:
lavbaseline <- lav_lavaan_step15_baseline_fast(
lavoptions = lavoptions,
lavsamplestats = lavsamplestats,
lavdata = lavdata,
lavpartable = lavpartable,
lavh1 = lavh1
)
if (!is.null(lavbaseline)) {
if (lav_verbose()) {
cat(" done.\n")
}
return(lavbaseline)
}
fit_indep <- try(lav_object_independence(...), silent = TRUE)The new helper checks conservative ML eligibility, builds the independence
partable from sample moments, and computes the standard chi-square payload
directly. This is a good showcase because fit indices touch baseline models
even when the user only asked for the main SEM fit. In the matched workbooks,
cfa_holzinger_ml improved from 44.45 ms to 10.02 ms median latency
(4.44x), while postfit_political_democracy_sem_ml improved from
153.10 ms to 67.22 ms (2.28x).
2. Simple CFA Setup: Skip General Partable Machinery
On master, observed-name discovery repeatedly called the general
lav_partable_vnames() path for each name family:
attr(flat_model, "vnames") <- lav_partable_vnames(flat_model, type = "*")
ov_names <- unique(unlist(lav_partable_vnames(flat_model, type = "ov")))
ov_names_y <- unique(unlist(lav_partable_vnames(
flat_model, type = "ov.nox"
)))
ov_names_x <- unique(unlist(lav_partable_vnames(flat_model, type = "ov.x")))
lv_names <- unique(unlist(lav_partable_vnames(flat_model, type = "lv")))On feature/fast-v6, simple single-group CFA models get a narrow fast path,
and the fallback computes the full vnames cache once before slicing it:
fast <- lav_lavaan_step01_ovnames_group_fast(flat_model, ngroups)
if (!is.null(fast)) {
return(fast)
}
flat_vnames <- lav_partable_vnames(flat_model, type = "*")
attr(flat_model, "vnames") <- flat_vnames
ov_names <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "ov")
ov_names_y <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "ov.nox")
ov_names_x <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "ov.x")
lv_names <- lav_lavaan_step01_ovnames_from_vnames(flat_vnames, "lv")The same idea appears in step 4, where lav_lavaan_step04_partable_cfa_fast()
constructs a standard CFA partable directly and returns before the general
completion path:
lavpartable <- lav_lavaan_step04_partable_cfa_fast(
slot_par_table = slot_par_table,
flat_model = flat_model,
lavoptions = lavoptions,
lavdata = lavdata,
constraints = constraints
)
if (!is.null(lavpartable)) {
return(list(
lavoptions = lavoptions,
lavpartable = lavpartable
))
}This removes repeated metadata scans from a very common modeling shape. The
benchmark evidence lines up with that target: sample_cov_holzinger_ml moved
from 43.06 ms to 12.14 ms median latency (3.55x), and
synthetic_large_cfa_ml moved from 80.27 ms to 29.49 ms (2.72x) while
allocation fell from 54.96 MB to 28.83 MB.
3. Implied Moments: Compute Related Quantities Together
On master, lav_model_implied() requested each model-implied quantity
through a separate public helper:
sigma_hat <- lav_model_sigma(
lavmodel = lavmodel, glist = glist,
delta = delta
)
if (lavmodel@meanstructure) {
mu_hat <- lav_model_mu(lavmodel = lavmodel, glist = glist)
}
if (lavmodel@conditional.x) {
slopes <- lav_model_pi(lavmodel = lavmodel, glist = glist)
}
if (lavmodel@categorical) {
th <- lav_model_th(lavmodel = lavmodel, glist = glist)
}On feature/fast-v6, one request advertises exactly which pieces are needed
and shares the representation-level work:
implied_fast <- lav_model_implied_fast(
lavmodel = lavmodel,
glist = glist,
need_sigma = TRUE,
need_mu = lavmodel@meanstructure,
need_pi = lavmodel@conditional.x,
need_th = lavmodel@categorical,
delta = delta
)
sigma_hat <- implied_fast$sigma
mu_hat <- implied_fast$mu
slopes <- implied_fast$pi
th <- implied_fast$thInside the LISREL fast helper, the expensive shared product is computed once:
need_lambda_ib_inv <- (need_sigma && is.null(mm_wmat)) ||
need_mu || need_th || need_pi
if (need_lambda_ib_inv) {
if (is.null(mm_beta)) {
lambda__ib_inv <- mm_lambda
} else {
ib_inv <- lav_lisrel_ibinv(mlist = mlist)
lambda__ib_inv <- mm_lambda %*% ib_inv
}
}This helps ordinary ML fits because objective, gradient, bootstrap, PML, and
simulation paths all ask for overlapping implied quantities. The broad effect
shows up in nontrivial SEM scenarios: mediation_path_sem_ml improved from
41.78 ms to 10.83 ms (3.86x), growth_demo_ml improved from
65.55 ms to 17.31 ms (3.79x), and conditional_x_ml improved from
104.53 ms to 53.00 ms (1.97x).
4. Gradient Delta Work: Build Shared Setup Once
On master, lav_model_gradient_dd() rebuilt Delta plumbing separately for
each target matrix:
delta_lambda <- lav_model_ddelta_dx(
lavmodel, glist = g_list, target = "lambda"
)[[group]]
delta_tau <- lav_model_ddelta_dx(
lavmodel, glist = g_list, target = "tau"
)[[group]]
delta_nu <- lav_model_ddelta_dx(
lavmodel, glist = g_list, target = "nu"
)[[group]]
delta_theta <- lav_model_ddelta_dx(
lavmodel, glist = g_list, target = "theta"
)[[group]]On feature/fast-v6, the repeated setup is factored out and multiple Delta
targets are produced from the same prepared indices:
deltas <- lav_model_ddelta_dx_many(
lavmodel = lavmodel,
glist = g_list,
targets = c(
"lambda", "tau", "nu", "theta",
"beta", "psi", "alpha", "gamma"
),
group = group
)
delta_lambda <- deltas$lambda
delta_tau <- deltas$tau
delta_nu <- deltas$nu
delta_theta <- deltas$thetaThe lower-level setup now caches model-matrix indices, free-parameter indices,
threshold indices, and the active equality-constraint shape once per group.
That is especially relevant for gradient-heavy categorical and post-fit work.
The WLSMV scenarios still remain heavier than simple ML, but they improved:
ordinal_holzinger_wlsmv fell from 134.91 ms to 81.92 ms (1.65x), and
ordinal_latent_regression_wlsmv fell from 239.02 ms to 156.73 ms
(1.53x).
5. FIML Missing Patterns: Prepare Pattern Metadata Once
On master, missing-pattern likelihood code repeatedly reached into each
pattern object and recomputed simple indexes:
pat_n <- length(yp)
p_1 <- length(yp[[1]]$var.idx)
for (p in seq_len(pat_n)) {
var_idx <- yp[[p]]$var.idx
na_idx <- which(!var_idx)
p_log_2pi[p] <- sum(var_idx) * log_2pi * yp[[p]]$freq
tt <- yp[[p]]$SY + tcrossprod(yp[[p]]$MY - mu[var_idx])
dist_1[p] <- sum(sigma_inv * tt) * yp[[p]]$freq
}On feature/fast-v6, sample statistics carry prepared missing-pattern
metadata, and the likelihood avoids forming tcrossprod() only to multiply it
by the inverse:
yp_prepared <- lav_mvnorm_missing_prepare_samplestats(yp)
pat_n <- yp_prepared$pat_n
for (p in seq_len(pat_n)) {
var_idx <- yp_prepared$var.idx[[p]]
na_idx <- yp_prepared$na.idx[[p]]
freq <- yp_prepared$freq[p]
p_log_2pi[p] <- yp_prepared$nvar[p] * log_2pi * freq
diff_1 <- yp_prepared$MY[[p]] - mu[var_idx]
dist_1[p] <- (
sum(sigma_inv * yp_prepared$SY[[p]]) +
sum(as.numeric(crossprod(diff_1, sigma_inv)) * diff_1)
) * freq
}The preparation is attached when the missing sample statistics are built:
attr(yp, "prepared") <- lav_mvnorm_missing_prepare_samplestats(yp)This is a clean FIML optimization because it does not change the likelihood; it
changes how much list lookup, index reconstruction, and temporary matrix work
is done per pattern. The FIML scenarios improved materially:
fiml_holzinger_ml moved from 412.16 ms to 200.24 ms median latency
(2.06x) with allocation falling from 42.92 MB to 30.55 MB, and
fiml_political_democracy_sem_ml moved from 333.74 ms to 163.24 ms
(2.04x) with allocation falling from 50.93 MB to 37.14 MB.
6. Two-Level Likelihood: Accumulate Derivatives Directly
On master, the two-level derivative code allocated per-cluster matrices and
then summed them at the end:
g_muy <- matrix(0, ncluster_sizes, length(mu_y))
g_sigma_w_1 <- matrix(0, ncluster_sizes, length(lav_matrix_vech(sigma_w_1)))
g_sigma_b_1 <- matrix(0, ncluster_sizes, length(lav_matrix_vech(sigma_b_1)))
for (clz in seq_len(ncluster_sizes)) {
g_muy[clz, ] <- -2 * nj * as.numeric(yc %*% sigma_j_inv)
g_sigma_w_1[clz, ] <- lav_matrix_vech(tmp)
g_sigma_b_1[clz, ] <- lav_matrix_vech(tmp)
}
d_mu_y <- colSums(g_muy * cluster_size_ns)
d_sigma_w1 <- lav_matrix_vech_reverse(colSums(
g_sigma_w_1 * cluster_size_ns
))On feature/fast-v6, the same quantities are accumulated directly into
weighted derivative vectors:
d_mu_y <- numeric(length(mu_y))
d_sigma_w1_vech <- numeric(length(lav_matrix_vech(sigma_w_1)))
d_sigma_b_vech <- numeric(length(lav_matrix_vech(sigma_b_1)))
for (clz in seq_len(ncluster_sizes)) {
cluster_weight <- cluster_size_ns[clz]
d_mu_y <- d_mu_y +
cluster_weight * (-2 * nj * as.numeric(yc %*% sigma_j_inv))
d_sigma_w1_vech <- d_sigma_w1_vech +
cluster_weight * lav_matrix_vech(tmp)
d_sigma_b_vech <- d_sigma_b_vech +
cluster_weight * lav_matrix_vech(tmp)
}
d_sigma_w1 <- lav_matrix_vech_reverse(d_sigma_w1_vech)
d_sigma_b <- lav_matrix_vech_reverse(d_sigma_b_vech)This is the most literal allocation reduction in the set: the code no longer needs
a cluster-by-parameter matrix just to immediately collapse it. The
two-level scenario improved from 342.44 ms to 220.71 ms median latency
(1.55x), and allocation fell from 28.30 MB to 24.26 MB.
Full Matched Scenario Table
| Scenario | Baseline median | V6 median | Speedup | Median reduction | Allocation reduction |
|---|---|---|---|---|---|
cfa_holzinger_ml | 44.45 ms | 10.02 ms | 4.44x | 77.5% | 41.8% |
bootstrap_mediation_path_sem_ml | 685.90 ms | 171.28 ms | 4.00x | 75.0% | n/a |
mediation_path_sem_ml | 41.78 ms | 10.83 ms | 3.86x | 74.1% | 42.6% |
growth_demo_ml | 65.55 ms | 17.31 ms | 3.79x | 73.6% | 59.9% |
sample_cov_holzinger_ml | 43.06 ms | 12.14 ms | 3.55x | 71.8% | 47.6% |
ri_clpm_synthetic_panel_ml | 75.12 ms | 21.38 ms | 3.51x | 71.5% | 55.0% |
piecewise_growth_synthetic_ml | 63.97 ms | 18.72 ms | 3.42x | 70.7% | 38.4% |
multigroup_holzinger_loadings | 92.27 ms | 27.35 ms | 3.37x | 70.4% | 37.5% |
bifactor_synthetic_battery_ml | 47.31 ms | 14.03 ms | 3.37x | 70.3% | 38.7% |
parallel_process_growth_ml | 55.41 ms | 16.73 ms | 3.31x | 69.8% | 41.7% |
apim_verbal_performance_sem_ml | 44.54 ms | 13.70 ms | 3.25x | 69.2% | 34.8% |
multigroup_holzinger_invariance_postfit | 269.04 ms | 89.31 ms | 3.01x | 66.8% | 39.0% |
sem_political_democracy_ml | 56.99 ms | 19.12 ms | 2.98x | 66.5% | 40.5% |
wisc_higher_order_cfa_ml | 52.79 ms | 17.99 ms | 2.93x | 65.9% | 43.4% |
mimic_holzinger_covariates_ml | 92.41 ms | 32.17 ms | 2.87x | 65.2% | 44.2% |
mtmm_trait_method_sample_cov_ml | 62.73 ms | 22.24 ms | 2.82x | 64.5% | 39.7% |
synthetic_large_cfa_ml | 80.27 ms | 29.49 ms | 2.72x | 63.3% | 47.5% |
robust_clinical_path_mlr | 59.69 ms | 24.16 ms | 2.47x | 59.5% | 20.2% |
postfit_political_democracy_sem_ml | 153.10 ms | 67.22 ms | 2.28x | 56.1% | n/a |
fiml_holzinger_ml | 412.16 ms | 200.24 ms | 2.06x | 51.4% | 28.8% |
fiml_political_democracy_sem_ml | 333.74 ms | 163.24 ms | 2.04x | 51.1% | 27.1% |
conditional_x_ml | 104.53 ms | 53.00 ms | 1.97x | 49.3% | 18.8% |
ordinal_holzinger_wlsmv | 134.91 ms | 81.92 ms | 1.65x | 39.3% | 5.6% |
twolevel_demo_ml | 342.44 ms | 220.71 ms | 1.55x | 35.5% | 14.3% |
ordinal_latent_regression_wlsmv | 239.02 ms | 156.73 ms | 1.53x | 34.4% | 2.4% |
Threats to Validity
The evidence is strong enough to support a performance claim, but it has boundaries.
First, this is a run-pair comparison on one environment. Both runs used R 4.6.0 and the same benchmark harness shape, which helps comparability, but the result should still be replicated before treating the exact percentages as universal.
Second, the PR is broad. Because many fast paths landed together, the benchmark does not identify how much of the total speedup belongs to baseline-model construction versus implied moments versus matrix helpers versus missing-data preparation. The evidence supports the combined intervention.
Third, the benchmark scenarios are representative, not exhaustive. The fast paths deliberately fall back to existing general code for unsupported cases. That is good engineering, but it means the measured benefit will depend on how often real user workloads hit the newly supported shapes.
Fourth, profiler sampling has finite resolution. The transcript is useful for identifying major hotspots and diagnostic rank changes, but the strongest claims in this post come from matched latency and allocation tables rather than from individual sampled call stacks.
Conclusion
This is a good example of performance work that respects both a mature codebase and the future direction of lavaan.
It does not replace lavaan's modeling language or statistical estimators, which means the public API remains unchanged.
My work does change quite a bit of the internal structure: cache what the code already
knows, compute related implied quantities together, avoid dense intermediates,
and reserve the fully general path for models that need it.
The measured result is large. Across matched scenarios, median latency fell from an average of ms to ms, with no median-latency regressions in the matched suite. Allocation also moved in the right direction, falling by in aggregate across scenarios where both runs recorded allocation data.
For SEM users, this should shows up as a nicer feedback loop: fits return sooner, bootstrap and post-fit workflows spend less time on repeated internal work, and common model families benefit without asking the user to write different model syntax. For a PR, that is the sweet spot: the same interface, less waiting.