UAP / Declassified Release Analyzer

> OpenClaw port note: this is an OpenClaw-compatible skill adapted from Chase Wang's MIT-licensed ckpxgfnksd-max/uap-release-analyzer. Use the bundled scripts from the installed skill directory, e.g. python3 ~/.openclaw/workspace/skills/uap-release-analyzer/scripts/run_all.py . If your OpenClaw skills directory is customized, locate the installed uap-release-analyzer directory and run the same scripts/run_all.py entrypoint there.

This skill turns a folder of declassified UAP/UFO documents into a structured analytic report. It was built from a real workflow against the May 2026 war.gov/UFO/ "PURSUE" tranche (162 files, 4,000+ pages, mixed FBI/DOW/NASA/DOS/NARA sources), so it's tuned to the quirks of that universe — but it generalizes to any tranche of FOIA-released government PDFs.

When to use

Trigger on prompts like "analyze the UFO files I just downloaded", "build me a report on this UAP release", "what's in ~/Downloads/release_01/?", "compare release 1 and release 2", "find redaction patterns in these FBI files", "summarize this AARO PDF", or whenever the user references a directory of declassified documents and wants any kind of summary, inventory, or pattern surfacing. Also trigger if the user just dumps a path and asks "what's interesting in here?" — this skill is the right tool.

Why a skill

The work has a fixed shape that repeats across every new tranche:

Inventory — what files came down, sizes, page counts, which agency.
Text extraction — pull text where there is a text layer; flag the (often majority) of files that are scanned and need OCR.
Entity surfacing — locations, agencies, phenomena vocabulary, named people.
Redaction pattern analysis — which FOIA exemptions show up where, which files are most redacted.
Cross-document patterns — year clusters, agency × location heatmap, names that appear in 5+ files.
A standardized report the user can read in ten minutes.

Doing this freshly every time wastes effort and produces inconsistent outputs. The bundled scripts make every tranche analyzable the same way.

The standard workflow

Run scripts in this order. Each writes intermediate artifacts that the next step consumes. They are idempotent and incremental — re-running on the same folder skips work that's already done.

release_root/
  release_NN/                 # the actual PDFs/PNGs/JPGs (input)
  text/                       # extracted text per PDF (created)
  inventory.csv               # one row per file (created)
  analytics/                  # aggregated outputs (created)
    top_terms.csv
    terms_by_agency.csv
    entities.json
    per_file_digest.csv
    cross_doc.json
  REPORT.md                   # human-readable analytic writeup (created)

Step 1 — Inventory. Run scripts/inventory.py . This walks the release directory, classifies each file by filename prefix (see references/agency_vocab.md), reads PDF page counts, and writes inventory.csv. Don't write inventory by hand — the script handles encrypted PDFs, weird filenames with spaces or em-dashes, and files that pypdf can't open.

Step 2 — Text extraction. Run scripts/extract_text.py [start] [end]. Extracts text via pdfplumber, writing one .txt per PDF into text/. Skips files that already have a non-empty .txt. Many FBI / NARA / older photo-PDFs have no text layer — those will produce 0-char files; that's expected and fine, the analytics treat them as "scanned, OCR needed". The optional [start] [end] slice arguments let you process in chunks if your sandbox has a per-call timeout (the war.gov FBI sections are 200+ pages each — extract them in batches of ~25 if running in a 45-second-call environment).

Run scripts in the foreground of your turn, not via background-and-end-turn patterns. The pipeline is fast enough (a few minutes from cold) that you can stay in-turn. If a single extract_text.py call would actually time out, prefer the [start] [end] chunking pattern over backgrounding — chunked calls each finish quickly, the script is idempotent, and progress is visible.

Step 3 — Analytics. Run scripts/analyze.py . Reads the extracted text + inventory, then writes the contents of analytics/. This is fast even on 800K+ characters of text.

Step 4 — Report. Run scripts/build_report.py . Reads inventory + analytics and writes a REPORT.md with the sections listed under "Report structure" below.

When the user just says "analyze the release at ", run all four in sequence with that path. When they ask a narrower question ("how many files?", "which file is most redacted?"), call only the relevant script or read the existing artifacts directly.

Report structure

Always use this exact section order in REPORT.md so reports across tranches stay comparable. If a section has no data for this tranche, leave a one-line "no data" note — don't omit the heading.

# <Release name> — Raw Analytics
**Source:** ... · **Cleared for release:** ...
**Files in this analysis:** N of M (note any gaps)

## 1. Inventory                    — counts, total size, page counts, by agency
## 2. What's actually in the release  — narrative summary of the major buckets
## 3. Where the activity is concentrated  — top locations
## 4. Phenomena terminology         — UAP/craft/orb/disc/etc. with counts
## 5. Agency cross-references       — agencies named in text
## 6. Year clusters                 — when is this material from
## 7. Redactions                    — top markers + most-redacted files
## 8. Notable individual files
## 9. Cross-document patterns
## 10. What's missing / caveats     — OCR gaps, files we couldn't pull, etc.
## 11. Files in this analysis       — paths to inventory.csv / analytics/*

The "What's missing" section matters — it's what makes the report honest. Always call out files we couldn't OCR, files referenced on a source page but not downloaded, and heuristic limits of the entity extraction.

Agency classification

Files are classified by filename prefix. The full vocabulary is in references/agency_vocab.md. The high-confidence prefixes from the war.gov universe:

65_hs1, fbi-photo-, usper-, serial, 2024-04-30-* → FBI
dow-uap, western_us_event → DOW (Department of War)
nasa-uap* → NASA
dos-uap, 059uap → DOS (State)
18_, 38_, 59_, 255, 331_, 341_, 342_* → NARA (record-group prefixes)
otherwise → OTHER (flag for the user; might be a new bucket worth adding to the vocab)

If you encounter a tranche with prefixes not in the vocab, add them to references/agency_vocab.md (the table) and scripts/inventory.py + scripts/analyze.py (PREFIX_RULES) rather than scattering inline filename checks across scripts. A useful threshold: if OTHER exceeds ~3% of files in any tranche, that's a signal the vocab needs extending, not the data being weird.

When bootstrapping a brand-new tranche (e.g., the user has just downloaded release_02/ and asks "what's the fastest way to a written report?"), surface this vocab-extension workflow in your reply alongside run_all.py. Otherwise the user will discover the OTHER bucket only after the fact.

FOIA / classification markers

references/foia_codes.md lists the FOIA exemptions and classification stamps to look for. Most of the meaningful redaction signal in modern tranches comes from (b)(1) (national security), (b)(3) (statutory), (b)(6) (personal privacy), and the classification banners SECRET//NOFORN, REL TO USA, CUI, FOUO. The analyzer counts these by file so the report can name the most-redacted documents.