Calculating Local Moran’s I for Infectious Disease Outbreaks
This guide solves one narrow but high-stakes problem: running a defensible, audit-ready Local Moran’s I (LISA) computation on infectious disease incidence so an outbreak response team can declare statistically significant clusters without generating false alarms. It is part of the Global & Local Moran’s I Implementation method family within the broader Disease Clustering & Spatial Statistical Modeling pipeline.
Problem Context & Constraints
Exploratory tutorials run esda.Moran_Local(y, w) on a clean GeoDataFrame and plot the result. That naive path fails the moment surveillance data hits it, for four concrete reasons specific to outbreak response:
- Uncorrected per-tract p-values. A LISA test runs once per spatial unit. Across 600 census tracts at α = 0.05 you expect roughly 30 false-positive “significant” tracts by chance alone. Mapping those as outbreak clusters triggers wasted field investigations and erodes trust in the surveillance product. Multiple-testing control is not optional here — it is the difference between a hotspot and a coin flip.
- Geographic coordinates breaking neighbor topology. Incidence layers frequently arrive in WGS84 (EPSG:4326). Queen/Rook contiguity does not care about units, but any distance- or k-nearest-neighbor weighting computed on degrees produces a topology that is wrong by an amount that varies with latitude. The fix is to enforce a metric projection up front; see how to align WGS84 and UTM for county health data for the canonical reprojection pattern.
- Island polygons. Rural jurisdictions, coastal tracts, and fragmented health districts routinely yield polygons with no contiguity neighbor. A single island makes a row-standardized weights matrix undefined for that row, and
esdawill either error or silently emit a meaningless statistic. - Raw counts masquerading as rates. Local Moran’s I on raw case counts mostly rediscovers where people live. Inputs must be an age-standardized or population-adjusted incidence rate before the statistic is meaningful, or the result is ecological fallacy with a p-value attached.
The pipeline below addresses each constraint with an explicit gate, so every cluster declaration carries a reproducible justification.
Prerequisites
Pin these versions; LISA quadrant codes and the Moran_Local permutation API have shifted across releases, and reproducibility is a compliance requirement, not a nicety.
# Pinned stack (tested):
# geopandas==0.14.4
# libpysal==4.10
# esda==2.5.1
# statsmodels==0.14.2
# shapely==2.0.4
# numpy==1.26.4
Input data requirements:
- A polygon
GeoDataFrameof spatial units (tracts, ZIP code tabulation areas, or health districts). - A numeric
incidence_ratecolumn that is already age-standardized or population-adjusted — not raw counts. - A stable, unique identifier column (e.g.
geoid) used to sort the frame deterministically before weight construction. Spatial weight builders are order-sensitive; without a stable sort the same input can yield different permutation p-values across runs.
Step-by-Step Solution
The workflow isolates four discrete, testable stages: validated ingestion, weights construction with island fallback, LISA computation with False Discovery Rate (FDR) correction, and quadrant classification with audit export. Each stage emits structured logs so the run is reconstructable from the log alone.
1. Ingest, validate geometry, and enforce compliance
import geopandas as gpd
import numpy as np
import logging
from shapely.validation import make_valid
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
def ingest_and_validate_spatial_data(
gdf: gpd.GeoDataFrame,
id_col: str = "geoid",
target_epsg: int = 32617, # UTM zone for the surveillance AOI
rate_col: str = "incidence_rate",
) -> gpd.GeoDataFrame:
"""Enforce CRS alignment, repair geometries, strip PHI, and sort deterministically."""
# Deterministic ordering — required for reproducible permutation p-values
gdf = gdf.sort_values(id_col).reset_index(drop=True)
# Geometry validation & repair
invalid_mask = ~gdf.geometry.is_valid
if invalid_mask.any():
gdf.loc[invalid_mask, "geometry"] = gdf.loc[invalid_mask, "geometry"].apply(make_valid)
logging.warning("Repaired %d invalid geometries.", invalid_mask.sum())
# CRS enforcement: contiguity is unit-agnostic, but any KNN fallback needs metric coords
if gdf.crs is None or gdf.crs.to_epsg() == 4326:
gdf = gdf.to_crs(epsg=target_epsg)
logging.info("Projected to EPSG:%d for metric spatial operations.", target_epsg)
elif gdf.crs.to_epsg() != target_epsg:
gdf = gdf.to_crs(epsg=target_epsg)
# PHI/PII compliance gate — drop direct identifiers before any analysis
sensitive_patterns = ["ssn", "name", "phone", "email", "patient_id", "mrn"]
sensitive_cols = [c for c in gdf.columns if any(p in c.lower() for p in sensitive_patterns)]
if sensitive_cols:
gdf = gdf.drop(columns=sensitive_cols)
logging.info("Stripped PII/PHI columns: %s", sensitive_cols)
# Zero-variance guard — LISA is undefined when the rate is constant
if gdf[rate_col].std() == 0:
raise ValueError("Incidence rate has zero variance; Local Moran's I cannot compute.")
return gdf
2. Build row-standardized weights with island fallback
The weights matrix W defines neighborhood topology and therefore controls LISA sensitivity directly. Start with Queen contiguity for areal incidence data; fall back to k-nearest-neighbor (KNN) only when contiguity produces islands, so the topology stays defined for every row. Row-standardization (transform = "r") is mandatory — it makes the local statistic comparable across units with different neighbor counts.
import libpysal
from libpysal.weights import Queen, KNN
def construct_spatial_weights(gdf: gpd.GeoDataFrame, k: int = 4) -> libpysal.weights.W:
"""Row-standardized weights; switch Queen -> KNN when islands appear."""
W = Queen.from_dataframe(gdf, use_index=True)
if W.islands:
logging.warning("Queen weights have %d island(s); switching to KNN(k=%d).", len(W.islands), k)
W = KNN.from_dataframe(gdf, k=k, use_index=True)
W.transform = "r" # row-standardization is required for LISA comparability
logging.info("Weights built: n=%d, transform='%s'.", W.n, W.transform)
return W
3. Compute Local Moran’s I with FDR correction
Local Moran’s I decomposes the global statistic into a per-unit contribution. With 999 conditional permutations, esda returns a pseudo p-value (p_sim) per unit. Feed every p_sim through Benjamini-Hochberg FDR control so the significant set holds its false-discovery rate at α across the whole map, not per individual test.
The local statistic for unit (i) is
where (w_{ij}) are the row-standardized weights and (\bar{x}) is the mean incidence rate. The sum of all (I_i) is proportional to the global Moran’s I, which is why the two are reported together.
import esda
from statsmodels.stats.multitest import multipletests
from typing import Tuple
def compute_local_moran_with_fdr(
gdf: gpd.GeoDataFrame,
W: libpysal.weights.W,
rate_col: str = "incidence_rate",
permutations: int = 999,
alpha: float = 0.05,
seed: int = 42,
) -> Tuple[gpd.GeoDataFrame, dict]:
"""Local Moran's I with permutation inference + Benjamini-Hochberg FDR."""
np.random.seed(seed) # log the seed; permutation p-values are stochastic
y = gdf[rate_col].to_numpy(dtype=float)
lisa = esda.Moran_Local(y, W, permutations=permutations)
# FDR control across all per-unit tests
_, p_fdr, _, _ = multipletests(lisa.p_sim, alpha=alpha, method="fdr_bh")
gdf = gdf.copy()
gdf["lisa_i"] = lisa.Is
gdf["p_sim"] = lisa.p_sim
gdf["p_fdr"] = p_fdr
gdf["significant_fdr"] = p_fdr < alpha
logging.info("LISA done. Significant units (FDR<%.2f): %d", alpha, int(gdf["significant_fdr"].sum()))
# Global Moran's I for the summary header (reported, not used for classification)
g = esda.Moran(y, W, permutations=permutations)
return gdf, {"global_moran_i": float(g.I), "global_p_sim": float(g.p_sim), "seed": seed}
4. Classify quadrants and export an audit-ready layer
Each significant unit falls into one of four quadrants. esda exposes lisa.q directly (1=HH, 2=LH, 3=LL, 4=HL), which is more reliable than re-deriving the quadrant by hand. High-High units are the operational signal — elevated incidence surrounded by elevated incidence — and are what an outbreak team acts on first.
def classify_and_export(
gdf: gpd.GeoDataFrame,
lisa_q,
output_path: str = "lisa_outbreak_clusters.parquet",
) -> gpd.GeoDataFrame:
"""Map esda quadrant codes, gate on FDR significance, export GeoParquet."""
quad_map = {1: "HH", 2: "LH", 3: "LL", 4: "HL"}
gdf = gdf.copy()
gdf["cluster_type"] = np.where(gdf["significant_fdr"], np.vectorize(quad_map.get)(lisa_q), "NS")
gdf["is_significant_cluster"] = gdf["significant_fdr"] & (gdf["cluster_type"] != "NS")
export_cols = ["lisa_i", "p_sim", "p_fdr", "cluster_type", "is_significant_cluster", "geometry"]
gdf[export_cols].to_parquet(output_path, index=False) # requires geopandas >= 0.8
logging.info("Audit-ready LISA layer exported to %s", output_path)
return gdf[gdf["is_significant_cluster"]]
Wired together, the four stages take a raw surveillance layer to a de-identified, FDR-controlled cluster layer ready for interagency handoff:
gdf = ingest_and_validate_spatial_data(raw_gdf)
W = construct_spatial_weights(gdf)
gdf, summary = compute_local_moran_with_fdr(gdf, W)
clusters = classify_and_export(gdf, esda.Moran_Local(
gdf["incidence_rate"].to_numpy(float), W, permutations=999).q)
logging.info("Global Moran's I=%.3f (p=%.3f); HH clusters=%d",
summary["global_moran_i"], summary["global_p_sim"],
int((clusters["cluster_type"] == "HH").sum()))
Validation & Edge Cases
Three failure modes account for nearly every bad LISA map in surveillance practice. Each has a diagnostic you can assert on before publishing results.
1. Silent island contamination. If Queen contiguity slips through with islands, affected rows return lisa_i near zero with p_sim ≈ 1.0 — they look “not significant” but are actually undefined. Assert the topology before computing:
assert not W.islands, f"Islands present: {W.islands[:5]} — KNN fallback did not trigger."
# Log line you want to see instead:
# WARNING | Queen weights have 3 island(s); switching to KNN(k=4).
2. Counts mistaken for rates. A LISA map dominated by a few enormous HH units in the densest tracts is the classic signature of feeding raw counts. Diagnose by checking the rank correlation between the input and population — a near-perfect correlation means you are mapping population, not risk:
from scipy.stats import spearmanr
rho, _ = spearmanr(gdf["incidence_rate"], gdf["population"])
if abs(rho) > 0.9:
logging.error("Input tracks population (rho=%.2f). Use an age-standardized rate.", rho)
3. Over-permissive significance. Compare the raw vs. FDR-corrected counts every run. A large gap is expected and is exactly the false alarms you are suppressing; a zero gap means FDR was never applied:
raw_sig = int((gdf["p_sim"] < 0.05).sum())
fdr_sig = int((gdf["p_fdr"] < 0.05).sum())
logging.info("Significant: raw=%d, FDR=%d (suppressed %d false positives).",
raw_sig, fdr_sig, raw_sig - fdr_sig)
For very large frames (N > 50k spatial units) the dense permutation step dominates runtime and memory; cap permutations at 999, ensure W is held sparse (the libpysal default), and shard by jurisdiction rather than loading a national layer at once.
Compliance Notes
For regulatory defensibility, the run is only as good as what it logs. Persist the following to a JSON audit trail alongside the output layer:
- Random seed used for the permutation inference — pseudo p-values are stochastic and irreproducible without it.
- CRS EPSG code the analysis ran in, plus confirmation the input was reprojected from its source CRS.
- Weights provenance: which scheme actually ran (Queen or the KNN fallback) and the
kvalue, since this changes every downstream p-value. - FDR method and α, plus the raw-vs-corrected significant counts, so a reviewer can see how many candidate clusters were suppressed.
- Library versions and a SHA-256 hash of the input layer to lock data lineage.
- Confirmation that the PHI/PII gate ran and which columns it dropped — never let direct identifiers reach the exported cluster layer that goes to interagency partners.
Frequently Asked Questions
Should I correct LISA p-values with FDR or Bonferroni?
Use Benjamini-Hochberg FDR (method="fdr_bh") for surveillance. Bonferroni controls the family-wise error rate and is so conservative that real localized outbreaks get discarded; FDR controls the expected proportion of false discoveries and preserves sensitivity to true clusters.
Why use Queen contiguity instead of distance-based weights for outbreak data? For areal incidence (tracts, districts) Queen contiguity respects administrative adjacency and is unit-agnostic, so it survives an un-projected CRS. Distance and KNN weights are appropriate when units vary wildly in size or when contiguity yields islands — which is exactly the documented fallback trigger above.
Which quadrant do I act on first? High-High (HH) units: elevated incidence surrounded by elevated incidence. These are the spatially coherent hotspots. High-Low (HL) units are isolated spikes worth a data-quality check, while Low-Low (LL) units mark consistently low-burden areas.
Can I run this on raw case counts? No. Feed an age-standardized or population-adjusted incidence rate. Raw counts make LISA rediscover population density; the Spearman check in the validation section flags this before you publish.
Related Topics
- Global & Local Moran’s I Implementation — the parent method family and shared weights/inference standards.
- Optimizing Bandwidth for Getis-Ord Gi* Heatmaps — the sibling local statistic when you need distance-based hotspot intensity rather than HH/LL classification.
- How to Align WGS84 and UTM for County Health Data — the CRS enforcement step this pipeline depends on.