384,000 trains, one SQLite

May 2026 · RailCast pipeline, phase 1+2

In early March I had a simple question: which of the eleven train stations I regularly use is actually the most unreliable? A quick Google search gives: "It depends." There isn't a more honest answer, because Deutsche Bahn aggregates its own delay statistics on a monthly level and only publishes a global number. Per-station data doesn't exist publicly.

Eight weeks later, I have it myself. RailCast is a Python daemon that queries Deutsche Bahn's IRIS-TTS interface every minute, attaches Open-Meteo weather data every 10 minutes, and writes everything into a single SQLite file. As of today: 384,059 stops, 46,298 weather observations, roughly 425 MB on disk. Growing by about 8,000 stops per day.

That's a dataset I can work with — and I am. This text describes what the pipeline looks like, why it's intentionally small, and what I've already learned without training any model.

What gets collected

Deutsche Bahn runs an XML interface at iris.noncd.db.de/iris-tts/timetable that has been reachable for years without an auth layer and without rate limits. Two endpoints matter:

My collector pulls plan every 5 minutes for the current and next hour, fchg every 60 seconds. From the XML I extract per stop: train number, line, route start/end, planned and actual arrival/departure times, platform, cancellation flags, disruption messages.

Stations currently:

NRW:        Köln Hbf, Köln Messe/Deutz, Düsseldorf Hbf, Bonn Hbf, Wuppertal Hbf
Ruhrgebiet: Essen Hbf, Dortmund Hbf
Major hubs: Frankfurt(Main)Hbf, Berlin Hbf, Hamburg Hbf, München Hbf

That's a mix of personal interest (Wuppertal — where I live) and systematic spread, so I have commuter routes (NRW), node stations (Berlin, Frankfurt), and well-served main stations (München, Hamburg). The 11 stations net me about 8,000 stops per day — enough for statistically meaningful aggregations, small enough to coexist on a t3.small VPS with 2 GB of RAM without bothering other services.

Why SQLite

First question I get when I show the setup: Why not Postgres? Or TimescaleDB? Or Parquet on S3?

Answer: because it isn't necessary. SQLite with write-ahead logging writes a few hundred rows per minute without breaking a sweat. Concurrent reads from the EDA notebook while the collector runs work transparently. The file is 425 MB, the entire backup model is cp railcast.db railcast-backup.db. If I want to migrate the data somewhere else, that's a VACUUM INTO and an aws s3 cp away.

Postgres would have cost me two configuration files, another systemd service, a backup cron, and a connection-pool conversation — for a single-writer workload that comfortably stays under 100 inserts per minute. That's the wrong complexity.

The rule of thumb I use for projects like this: SQLite, until you can prove SQLite is hurting you. So far it isn't. If I get to phase-3 ML and run parallel model trainings with large I/O spikes, I'll reconsider. Not today.

The collector

A single Python file, around 600 lines, runs as a systemd service on the zoopa VPS. No external dependencies beyond urllib, xml.etree, sqlite3 — all stdlib. That was deliberate: I didn't want to debate a pip dependency on the first crash.

The control flow is trivial:

while not stop_signal:
    now = datetime.now(timezone.utc)

    if now - last_plan_fetch > PLAN_INTERVAL:
        for eva in STATIONS:
            fetch_plan(eva, now.hour)         # current hour
            fetch_plan(eva, (now + 1h).hour)  # next hour, lookahead
        last_plan_fetch = now

    if now - last_fchg_fetch > CHANGE_INTERVAL:
        for eva in STATIONS:
            fetch_fchg(eva)                   # all changes
        last_fchg_fetch = now

    if now - last_weather_fetch > WEATHER_INTERVAL:
        fetch_weather_for_all_stations()
        last_weather_fetch = now

    sleep(15)

The dataset grows with each fchg update — when a train gets a new actual arrival time from the plan, I write a new row. train_id is the IRIS stop ID, unique per train-and-station; with a (train_id, fchg_seq) index I can reconstruct the delay history of an individual stop across all fchg updates.

Health metrics (fetch latency, HTTP errors per 5min, rows/min) go into a second table collector_log. If the daemon ever dies I'd see it because nothing's coming in — but in the last 8 weeks that has happened exactly once (a VPS reboot). systemd brought it back up before I noticed.

What I already know without training a model

Phase 2 (EDA in a Jupyter notebook) surfaced a few things I didn't know or only suspected:

Wuppertal Hbf has roughly a 78% on-time rate. Defined as "<6 minutes departure delay" across all train types. That's noticeably worse than DB's group-wide reported average (~92%) and noticeably better than my gut feeling (60%, which I'd voiced repeatedly).

Delays cascade strongly along lines. If the RE1 in Köln Hbf is 12 minutes late, the probability that the same RE1 doesn't recover two stations later in Wuppertal is around 80%. That's intuitive, but I now have it as a correlation plot.

Weather correlations are weaker than expected. At precipitation >5mm/h I see a delay increase of ~3% relative to dry conditions. There isn't enough snow in my dataset yet to be statistically meaningful. Main delay drivers are line-internal effects, not weather events.

Thursday and Friday evenings are the worst on-time slots. Rush hour plus weekend travel. Sunday evening is surprisingly punctual.

These are correlations, not models. Phase 3 (a stop-level delay model with line, station, day-of-week, and weather features) is next month's task.

What I learned

Collecting your own data changes your relationship to the dataset. When I load a Kaggle dataset I don't ask where the edge cases come from. When I parse the source data, I know the edge cases — trains with plan_arr but no plan_dep (terminus stops), trains with actual_dep but no actual_arr (someone let the train leave late without recording the arrival), trains with no route information because IRIS-TTS doesn't expose it for certain operator contracts.

Those quirks I don't have to anticipate in model code — I've already modeled them in the schema, because I saw them in the XML.

A single Python file is a valid architecture. I used to think "real pipelines" had to be orchestrated with Airflow or Prefect. That's true — when you have parallel workloads with dependencies and retry strategies. For a single-writer daemon polling one interface, while not stop_signal: … is the right answer. Add systemd. Done.

One million stops is closer than you think. At 8,000 per day I'll cross 1 million by late June. By then phase 3 should be done. Datasets grow faster than models mature — that'll be a recurring pattern, I suspect.


The repo is private, the dataset isn't publicly available. When phase 3 lands I'll likely publish both as an anonymized + sample export — the raw IRIS data occasionally contains dispatcher notes I'd rather not publish unfiltered.