INTEL DATA PIPELINE
Automated multi-stage intelligence pipeline running on Cloudflare's global edge. Collects, deduplicates, classifies, fact-checks, and indexes conflict news every hour.
SYSTEM ARCHITECTURE
PIPELINE FLOW
Each run is a durable Cloudflare Workflow instance. Steps checkpoint automatically — if a step fails mid-run, only that step retries (not the entire pipeline).
Scheduled trigger fires every hour at :00 via Cloudflare's built-in cron system. Webhook-triggered on-demand runs also supported (POST /api/cron/trigger with Bearer token). Kicks off a new durable Workflow instance.
Gemini 3.1 Pro with real-time Google Search grounding fetches the latest Iran-Israel-US conflict news as a structured JSON array. Thinking mode (high) enables deep multi-step reasoning for better article synthesis. Auto-fallback chain: 3.1 Pro → 2.5 Flash (thinkingBudget 4096) → 2.0 Flash.
Each article's title + first 500 characters are vectorized into a 384-dimensional dense embedding. Runs entirely on Cloudflare's edge inference — zero external network calls, sub-second latency for the full batch. Step timeout: 30 seconds.
Each new embedding is queried against the persistent Vectorize index (growing over time). Articles scoring ≥ 0.85 cosine similarity to any previously seen article are silently dropped as duplicates. Only genuinely novel events advance. Step timeout: 15 seconds.
Each novel article is classified and dual-language summarized in a single LLM call: category (Military / Diplomatic / Nuclear / Economic), severity rating (1–5), tag array, English summary, Chinese summary (summary_zh), Chinese title (title_zh), and event_type. Step timeout: 5 minutes.
A targeted Brave Search query (article title + 'Iran Israel US military 2026') is executed for each article. Results from independent news sources are collected as corroborating evidence — Brave uses its own independent index, not Google or Bing.
LLM cross-references the original article against Brave Search results. Returns: status (verified / uncertain / disputed), confidence score (0–100), human-readable verification notes, and corroborating URLs. Disputed articles are silently dropped from the pipeline — never stored. Step timeout: 3 minutes.
Verified articles are written to D1 with all metadata. Embeddings are upserted to Vectorize so future runs can deduplicate against them. A live_events record is also created for the Timeline view. Data is immediately available via the /api/intel REST endpoint. Step timeout: 30 seconds.
TECHNOLOGY STACK
Serverless compute at the edge. The entire backend runs as a single Workers bundle — zero server management, global distribution.
Long-running durable pipeline with automatic step-level retry and checkpointing. If a step fails, only that step retries — not the whole pipeline.
On-edge ML inference runs directly inside the Worker runtime. No external API calls for embeddings or classification — single-digit millisecond overhead.
Persistent vector database grows with every processed article. Enables semantic deduplication — not just exact-match but concept-level duplicate detection.
Edge-native SQLite. Stores all verified articles, analytics events, live timeline events, and oil/market data snapshots.
Google's frontier model with real-time web access for news collection. Thinking mode produces higher-quality article analysis and structured JSON output.
Independent search index (not Google/Bing-derived) used for fact corroboration. Provides unbiased, real-time corroborating sources for each article.
SPA polls /api/intel every 30 seconds. Articles cached in localStorage for offline resilience. PWA-enabled — installable as a native-like app.
ARTICLE DATA SCHEMA
D1 table articles — each row is a verified, deduplicated, AI-processed news article.
| FIELD | TYPE | DESCRIPTION |
|---|---|---|
id | TEXT (UUID) | Unique article identifier |
title | TEXT | Original article headline |
content | TEXT | Full article body text from source |
summary | TEXT | AI-generated English summary (2–3 sentences) |
summary_zh | TEXT | AI-generated Traditional Chinese summary |
title_zh | TEXT | AI-translated Chinese headline |
category | TEXT | Military | Diplomatic | Nuclear | Economic |
event_type | TEXT | airstrike | missile | diplomatic | nuclear | sanction | ... |
tags | JSON [] | Array of keyword tags (e.g. ['Iran', 'IRGC', 'nuclear']) |
severity | INT 1–5 | Geopolitical significance score (5 = war-changing) |
source | TEXT | News outlet name (e.g. AP News, Al Jazeera) |
source_url | TEXT | Original article URL |
published_at | DATETIME | Original publication timestamp |
fact_check_status | TEXT | verified | uncertain (disputed = dropped) |
fact_check_notes | TEXT | AI verification reasoning and caveats |
brave_sources | JSON [] | Corroborating URLs from Brave Search |
created_at | DATETIME | Database insertion timestamp |
LIVE DATA SOURCES
Four external APIs feed this platform — news intelligence, financial signals, vessel tracking, and fact verification.
Provides Brent crude oil spot price (BZ=F) and 22 defense stock quotes. Uses the undocumented v8/finance/chart endpoint — no API key required. Each ticker is fetched in parallel via Promise.allSettled so a single failure doesn't block the rest. Oil price is sanity-checked ($20–$300 range). 8-second timeout per request.
Independent web search index (not derived from Google or Bing) used in Pipeline step 05. A targeted query is fired for each article — title + 'Iran Israel US military 2026' — with freshness:pw (past week). Returns up to 8 corroborating results with title, URL, and snippet. Disputed articles are silently dropped.
Real-time AIS (Automatic Identification System) vessel data via WebSocket. Subscribes to the Strait of Hormuz bounding box (25–27.5°N, 55–57.5°E), collects for 15 seconds, then returns unique vessels by MMSI. Browser CORS blocks direct access — data is fetched by the Cloudflare Worker backend. Frontend falls back to a deterministic simulation engine when live data is unavailable.
Google's Gemini 3.1 Pro with real-time Google Search grounding is the primary news collection engine (Pipeline step 01). Thinking mode (HIGH) enables multi-step reasoning for better article synthesis. A 3-model fallback chain protects against outages: Gemini 3.1 Pro → 2.5 Flash (thinkingBudget 4096) → 2.0 Flash. Also used for classifying articles and generating bilingual summaries (step 04, via Workers AI Llama 3.1-8B).
REST API ENDPOINTS
Base URL: https://iran-war-intel.672rmwysbs.workers.dev
/api/intel/api/insights/oil/api/insights/market/api/live-events/api/cron/triggerAUTH/api/analytics/dashboard