Methodology

How 3,352 recordings were recovered, processed, and published.

Overview

The archive was built using a fully automated Python data pipeline. No recordings were manually transcribed or altered — we preserved the original audio files exactly as captured by the Wayback Machine, including all original metadata where available.

The pipeline

  1. Discovery

    The Internet Archive's Wayback Machine CDX API was queried for captures of orgasmsoundlibrary.com audio files. Over 3,000 unique audio URLs were identified.

  2. Harvest

    Each audio file was downloaded from Wayback Machine captures. SHA-256 hashes were computed for integrity verification. Download state was tracked in SQLite.

  3. Processing

    Metadata was extracted from CSV manifests and merged with the download state. Duration, format, tags, and provenance data were normalized into a unified catalog.

  4. Publishing

    Audio files are served from Cloudflare R2. The catalog is a static Astro site with 3,352 individual recording pages, all deployed to Cloudflare Pages.

Technical stack

  • Data pipeline: Python 3, SQLite (state tracking), CSV manifests
  • Audio storage: Cloudflare R2 object storage
  • Site: Astro 5.5, static output, TypeScript
  • Hosting: Cloudflare Pages (CDN-served static files)
  • Audio source: Wayback Machine CDX API + direct file download

Integrity

Every audio file was checksummed with SHA-256 immediately after download. The hash is stored in the catalog and displayed on each recording's detail page. You can verify any file has not been tampered with by comparing the displayed hash to the downloaded file.

Why static?

A static site means no server-side code, no database, no runtime dependencies. The entire site is a folder of HTML, CSS, JavaScript, and a single JSON file. It can be hosted anywhere, mirrored trivially, and will continue to work for decades. This is the right architecture for a permanent public archive.

Catalog format

The catalog is a single JSON file (catalog.json) containing an array of 3,352 recording objects. Each object contains all metadata about a recording including title, tags, format, duration, SHA-256, provenance URLs, and a playable audio URL. The schema is documented in src/types.ts.