Handling API Timeouts in Batch OSM Routing

In spatial epidemiology and public health infrastructure planning, calculating drive-time accessibility across large patient cohorts requires deterministic, fault-tolerant network analysis. When scaling origin-destination (OD) queries to tens of thousands of facility-patient pairs, API timeouts become a critical engineering constraint. Unhandled timeouts introduce spatial sampling bias, compromise healthcare access equity metrics, and violate data integrity requirements for regulatory reporting. Production-grade pipelines must implement stateful retry architectures, topology-aware payload partitioning, and audit-compliant error tracking. This operational rigor aligns directly with established practices in Healthcare Access & Network Analysis Automation where statistical validity depends on complete spatial coverage and reproducible query execution.

Root Cause Analysis in Routing Engines

OSM-derived routing engines (e.g., OSRM, Valhalla) execute graph traversals that scale non-linearly with coordinate density, edge complexity, and turn restrictions. Timeouts typically manifest from three vectors: server-side query saturation, client-side serialization overhead, and transient network degradation. In public health workflows, unvalidated coordinate arrays or mismatched coordinate reference systems (CRS) force engines to perform redundant spatial joins or projection transformations, inflating latency. Additionally, batch matrices exceeding engine-specific limits trigger silent queueing or hard 504 Gateway Timeout responses. Understanding these failure modes is essential for designing resilient Batch Routing & Error Handling architectures that preserve spatial accuracy under sustained computational load.

Retry Architecture & Compliance Logging

Transient failures require stateful retry mechanisms rather than naive polling loops. Production systems should implement exponential backoff with randomized jitter to prevent thundering-herd effects on shared routing infrastructure. The retry strategy must strictly differentiate between recoverable HTTP status codes (429, 502, 504) and terminal failures (400, 404, invalid geometries). A circuit breaker pattern halts requests when consecutive failures exceed a defined threshold, preventing cascading pipeline degradation.

Each attempt must log a deterministic payload hash, UTC timestamp, and spatial bounding box. Raw patient identifiers must never be persisted in retry logs, ensuring alignment with HIPAA minimum necessary standards and GDPR data minimization principles. The Python logging module should be configured with structured JSON formatters to enable downstream audit parsing and compliance verification.

The decision logic that routes each engine response to one of three terminal outcomes — backed-off retry, hard failure, or a validated result — is the core of the error-handling layer:

Routing Response Classification & Circuit Breaker A decision flowchart. A routing response enters a status-code classifier. Network timeout, 429, 502 or 504 responses are recoverable: they pass to a circuit-breaker check that compares consecutive failures against a threshold. Below the threshold, the request is retried with exponential backoff and jitter and re-enters the classifier; at or above the threshold the circuit opens and the chunk is flagged for fallback. Status 400, 404 or an invalid geometry is a terminal failure that logs and flags the chunk for manual review. A 200 response with a valid durations matrix passes spatial validation and is appended to the results set. Engine response status + body classify status code timeout · 429 502 · 504 400 · 404 bad geometry 200 OK circuit breaker fails < threshold? yes backoff + jitter re-issue request retry no terminal failure log + flag chunk spatial validation plausible durations? fallback queue manual / secondary append result validated matrix review queue non-recoverable Every response resolves to one outcome — retry, fallback, append, or review — so no chunk is silently dropped

A single OD chunk flows through the retry layer, which classifies failures and re-issues recoverable requests with backoff before returning a validated matrix:

Batch Routing Retry Sequence A sequence diagram with three participants — Pipeline, Retry layer and Routing engine — shown as lifelines. The Pipeline submits an origin-destination chunk to the Retry layer. The Retry layer sends a Table API request (attempt 1) to the Routing engine, which returns a 504 Gateway Timeout. After exponential backoff with jitter, the Retry layer sends attempt 2, the engine returns 200 OK with a durations matrix, and the Retry layer returns a validated travel-time matrix to the Pipeline. Pipeline Retry layer Routing engine submit OD chunk Table API request (attempt 1) 504 Gateway Timeout exp. backoff + jitter Table API request (attempt 2) 200 OK · durations matrix validated travel-time matrix Retry sequence — a timed-out routing call is retried with backoff until a valid matrix returns

Payload Optimization & CRS Alignment

Batch routing efficiency depends on strategic request partitioning. Monolithic coordinate matrices should be replaced with topology-aware chunks based on spatial proximity and network boundaries. Pre-processing must enforce consistent coordinate precision (typically six decimal places for ~0.11 m resolution at the equator) and validate geometries against the routing engine’s expected CRS (OSRM and most OSM-based engines expect EPSG:4326). Spatial indexing via geopandas enables efficient chunk generation that minimizes cross-boundary route fragmentation. Implementing these optimizations reduces payload serialization overhead and keeps individual requests within engine timeout thresholds, as documented in the OSRM Table API specifications.

Production-Ready Python Implementation

The following pipeline demonstrates deterministic retry logic, spatial validation, and audit-compliant logging using tenacity for backoff management and requests for HTTP execution.

import hashlib
import json
import logging
import geopandas as gpd
import requests
from shapely.geometry import Point
from tenacity import (
    retry, stop_after_attempt, wait_exponential,
    retry_if_exception_type, retry_if_result
)
from requests.exceptions import HTTPError, Timeout, ConnectionError

# Configure audit-ready structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s | %(levelname)s | %(message)s',
    handlers=[logging.FileHandler("routing_audit.log")]
)
logger = logging.getLogger("osm_batch_router")

def generate_payload_hash(coords):
    """Deterministic hash for audit trails without storing raw coordinates."""
    return hashlib.sha256(json.dumps(coords, sort_keys=True).encode()).hexdigest()[:12]

def _is_recoverable(response):
    """Return True if response indicates a recoverable failure for tenacity to retry."""
    return hasattr(response, 'status_code') and response.status_code in {429, 502, 504}

@retry(
    retry=(retry_if_exception_type((Timeout, ConnectionError)) | retry_if_result(_is_recoverable)),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    stop=stop_after_attempt(5),
    reraise=True
)
def fetch_route_batch(engine_url, coords_chunk, timeout=15):
    """
    POST a coordinate chunk to the OSRM Table API.
    coords_chunk is a list of [lon, lat] pairs used as both sources and destinations.
    """
    payload = {"coordinates": coords_chunk, "sources": "all", "destinations": "all"}
    payload_hash = generate_payload_hash(coords_chunk)

    try:
        resp = requests.post(engine_url, json=payload, timeout=timeout)
        if resp.status_code == 200:
            logger.info(f"SUCCESS | hash={payload_hash} | duration={resp.elapsed.total_seconds():.2f}s")
            return resp.json()
        elif _is_recoverable(resp):
            logger.warning(f"RECOVERABLE | hash={payload_hash} | status={resp.status_code}")
            return resp  # tenacity will trigger retry via retry_if_result
        else:
            logger.error(f"TERMINAL | hash={payload_hash} | status={resp.status_code} | body={resp.text[:100]}")
            resp.raise_for_status()
    except (Timeout, ConnectionError) as e:
        logger.warning(f"NETWORK_ERROR | hash={payload_hash} | detail={str(e)}")
        raise

def chunk_od_pairs(gdf_origins, gdf_destinations, chunk_size=25):
    """Topology-aware chunking based on spatial proximity."""
    # OSRM expects EPSG:4326 (longitude, latitude order)
    gdf_origins = gdf_origins.to_crs("EPSG:4326")
    gdf_destinations = gdf_destinations.to_crs("EPSG:4326")

    # Round to 6 decimals (~0.1 m) for deterministic, privacy-safe payloads
    origins_coords = [
        [round(geom.x, 6), round(geom.y, 6)] for geom in gdf_origins.geometry
    ]
    destinations_coords = [
        [round(geom.x, 6), round(geom.y, 6)] for geom in gdf_destinations.geometry
    ]

    # Combine into a single coordinate list for the Table API
    all_coords = origins_coords + destinations_coords
    return [all_coords[i:i + chunk_size] for i in range(0, len(all_coords), chunk_size)]

def run_batch_routing(engine_url, origins_path, destinations_path):
    origins = gpd.read_file(origins_path)
    destinations = gpd.read_file(destinations_path)
    chunks = chunk_od_pairs(origins, destinations)

    results = []
    for i, chunk in enumerate(chunks):
        try:
            results.append(fetch_route_batch(engine_url, chunk))
        except Exception as e:
            logger.critical(f"CHUNK_{i}_FAILED | detail={str(e)}")
            # Implement fallback: mark chunk for manual review or secondary engine
    return results

Spatial Validation & Audit Readiness

Post-processing must verify spatial completeness before calculating accessibility indices. Missing routes should be explicitly flagged rather than imputed to prevent bias in spatial equity metrics. Implement validation checks that verify returned travel times fall within physiologically plausible ranges (e.g., >0 and <24 hours) and cross-reference route geometries against known facility catchments.

Audit trails must support full reproducibility for public health reporting and peer review. Store chunk hashes, retry counts, and final status codes in a version-controlled metadata table. This pattern ensures compliance with federal spatial data standards while maintaining the statistical integrity required for epidemiological modeling and resource allocation decisions.