# Reproducing Allesupers research editions

<!-- PUBLIC MIRROR: this file is published verbatim at
     https://allesupers.nl/onderzoek/reproducibility.md via
     frontend/public/onderzoek/reproducibility.md (CHANGE-387 D-C-2/3).
     Edit THIS file, then re-copy to the frontend mirror in the same
     commit — the two must stay byte-identical below this comment block. -->

Audience: journalists, statisticians, peer reviewers.

Every monthly **Allesupers Prijsmonitor** edition (`/onderzoek/prijsmonitor/<edition-slug>`) is computed from a frozen public dataset using a versioned SQL filter that lives in this repository. This page tells you how to reproduce the headline numbers from scratch.

Scope of this page (#01 release): **Set A — intersection-basket of A-brands stocked by all in-scope retailers.** Set B (huismerk-paren) will be added in a follow-up section once that publication ships.

---

## 1. What is published

Each edition has three artefacts:

| Artefact | Where | What it contains |
|---|---|---|
| **Public dataset CSV (Set A)** | `https://allesupers.nl/onderzoek/dataset/<YYYY-MM-DD>-sample-NL.csv` | One row per (canonical product × retailer × date) drawn from the A-brand editorial sample, with brand, GTIN, quantity, unit, and observed retail price. Header rows carry `snapshot_date`, `region`, `methodology_version`, `n_rows`, `csv_sha256`. |
| **Set B sibling CSVs** (edition #02+) | `<YYYY-MM-DD>-setB-anchors-NL.csv` + `<YYYY-MM-DD>-setB-peers-NL.csv` in the same directory | The house-brand basket. The anchors file lists the 30–80 HB canonical products picked per CBS row; the peers file lists every candidate peer at every retailer that the cheapest-cell rule chooses from. See §6 for the rule. |
| **Snapshot SQL filters** | `docs/research/snapshots/<edition-slug>.sql` + `<edition-slug>-setB.sql` (if Set B is published) in this repo | Frozen SQL that selects rows from the dataset, applies the editorial scope (e.g. retailers in scope, intersection rule), and emits the `headline_numbers` JSON object that the page renders. One file per edition (Set A); one sibling file per edition that publishes Set B. Never edited after publish. |
| **Edition record** | `edition_snapshots` table in the production database | The exact JSON payload (`headline_numbers`) the editor froze at publish time. Read via `GET /api/v1/research/edition/<slug>` or directly in psql. From edition #02 onward `headline_numbers.set_b` carries the house-brand basket alongside Set A's fields. |

The CSV files (Set A + Set B sibling) are the **citable primary source**. The SQL files are the **citable transformations**. The edition record is the **citable result**.

---

## 2. How to find the dataset version for an edition

Every edition is pinned to exactly one `research_dataset_snapshots` row. There are two equivalent ways to read the pin, depending on when the edition was published.

### 2.1 Editions published from #02 onward (FK pinning)

The publisher writes the dataset snapshot ID into the edition row at publish time. Read it directly:

```sql
SELECT edition_slug,
       methodology_version,
       research_dataset_snapshot_id,
       published_at
FROM edition_snapshots
WHERE edition_slug = '<slug>';
```

Then resolve the snapshot:

```sql
SELECT id, snapshot_date, region, methodology_version, n_rows, csv_sha256, filename
FROM research_dataset_snapshots
WHERE id = <research_dataset_snapshot_id>;
```

`snapshot_date` is the calendar date the dataset was frozen. `csv_sha256` lets you verify the published CSV byte-for-byte.

### 2.2 Edition #01 (pre-FK pinning)

Edition #01 was published before the FK column existed. The pin is encoded in the header comment of the snapshot SQL file:

```
-- Edition: 2026-05
-- Snapshot: research_dataset_snapshots.snapshot_date = 2026-05-13
-- Region: NL
-- Methodology: v1 - intersection-basket (CPs stocked by all retailers in scope)
```

The triple `(region, snapshot_date, methodology_version)` is unique in `research_dataset_snapshots` (enforced by the `uq_snapshot` constraint), so it resolves to a single `id`.

---

## 3. Set A SQL filter (copy-paste)

The frozen SQL for any edition is at `docs/research/snapshots/<slug>.sql` in this repository. The full text of `2026-05.sql` is reproduced below for convenience and reference; the canonical, audit-trailed copy is the file in the repo.

The query must return **exactly one row with one column named `headline_numbers` of type JSONB**. That is the contract enforced by `publish_edition.py` and validated against the `EditionSnapshotHeadlineNumbers` Pydantic schema.

```sql
-- Edition: 2026-05
-- Snapshot: research_dataset_snapshots.snapshot_date = 2026-05-13
-- Region: NL
-- Methodology: v1 - intersection-basket (CPs stocked by all retailers in scope)

WITH sample AS (
  SELECT r.*
  FROM research_dataset_rows r
  JOIN research_dataset_snapshots s ON s.id = r.snapshot_id
  WHERE s.region = 'NL'
    AND s.snapshot_date = '2026-05-13'::date
    AND s.methodology_version = 'v1'
    AND r.price IS NOT NULL
    AND r.price > 0
    AND r.retailer_id <> 'aldi'   -- ISS-303 scope exclusion (see note 4.3)
),
retailers_in_scope AS (
  SELECT DISTINCT retailer_id FROM sample
),
cp_coverage AS (
  SELECT canonical_product_id, COUNT(DISTINCT retailer_id) AS n_retailers
  FROM sample
  GROUP BY canonical_product_id
),
intersection_cps AS (
  SELECT canonical_product_id
  FROM cp_coverage
  WHERE n_retailers = (SELECT COUNT(*) FROM retailers_in_scope)
),
intersection_sample AS (
  SELECT s.*
  FROM sample s
  JOIN intersection_cps ic USING (canonical_product_id)
),
retailer_basket AS (
  SELECT retailer_id,
         SUM(price) AS basket_eur,
         COUNT(*)   AS n_products
  FROM intersection_sample
  GROUP BY retailer_id
),
spread AS (
  SELECT MAX(basket_eur) AS max_b,
         MIN(basket_eur) AS min_b
  FROM retailer_basket
)
SELECT jsonb_build_object(
  'products_tracked',    (SELECT COUNT(*) FROM intersection_cps),
  'stores_measured',     (SELECT COUNT(*) FROM retailers_in_scope),
  'datapoints_total',    (SELECT COUNT(*) FROM intersection_sample),
  'hero_conclusion_pct', (SELECT ROUND((-100.0 * (max_b - min_b) / NULLIF(max_b, 0))::numeric, 1) FROM spread),
  'basket_savings_eur',  (SELECT ROUND((max_b - min_b)::numeric, 2) FROM spread),
  'ranking', (
    SELECT jsonb_agg(
             jsonb_build_object(
               'retailer',   rb.retailer_id,
               'pct',        ROUND(((rb.basket_eur - sp.min_b) / NULLIF(sp.min_b, 0) * 100)::numeric, 1),
               'basket_eur', ROUND(rb.basket_eur::numeric, 2)
             )
             ORDER BY rb.basket_eur ASC
           )
    FROM retailer_basket rb, spread sp
  )
) AS headline_numbers;
```

The full file in the repo also computes `meetperiode_start`/`meetperiode_end`, `findings`, and a per-CP `basket` block (CHANGE-386b). Those are out of scope for Set A reproducibility — see §5.

---

## 4. Worked re-derivation — edition `2026-05`

### 4.1 Prerequisites

You need a PostgreSQL database that contains the rows from the published CSV. Two paths:

1. **Load from the public CSV.** Create `research_dataset_snapshots` and `research_dataset_rows` tables matching the schema in `backend/alembic/versions/`, then `\copy` the rows from `2026-05-13-sample-NL.csv` after stripping its `#`-prefixed header rows.
2. **Sync a copy of the Allesupers research dataset.** If you are an Allesupers operator, a local sync of the production `research_dataset_*` tables is sufficient.

### 4.2 Re-derive `ranking`

The SQL above is the exact filter used at publish time. Run it against the loaded data:

```bash
psql -d <your-db> -f docs/research/snapshots/2026-05.sql
```

The `ranking` array in the resulting `headline_numbers` JSON object reproduces the published per-retailer basket totals, byte-for-byte, for edition `2026-05`.

To compare against the published value:

```bash
curl -s https://allesupers.nl/api/v1/research/edition/2026-05 \
  | jq '.headline_numbers.ranking'
```

The two arrays are identical when the data load is correct.

### 4.3 Scope notes that affect interpretation

These are editorial decisions baked into the SQL. They are not bugs, but you should know about them:

- **Aldi is excluded** for edition `2026-05` (ISS-303). Of 75 NL canonical products in the 2026-05-13 snapshot, only 12 are stocked by all 7 NL retailers — too few for a defensible basket. Aldi carries the smallest A-brand assortment of the seven, so any all-retailer intersection collapses to "products Aldi happens to stock". Dropping Aldi raises the intersection to 46 products over 6 retailers. The `WHERE r.retailer_id <> 'aldi'` clause is the implementation of that decision. Aldi re-enters the methodology via Set B (CHANGE-384 huismerk-paren basket) in a later edition.
- **`meetperiode_start = 2026-05-01`, `meetperiode_end = 2026-05-13`** are static literals in the SQL file. They describe the calendar window the dataset prices cover; they are not derived from the data. If you load a different snapshot date you must update those literals to match.

---

## 5. What this page does NOT reproduce

In line with the publish contract, this Set A SQL reproduces the **quantitative** parts of `headline_numbers`: `products_tracked`, `stores_measured`, `datapoints_total`, `hero_conclusion_pct`, `basket_savings_eur`, and the `ranking` array.

Three published fields are operator-curated, not data-derived, and therefore cannot be re-derived from the dataset alone:

- **`findings`** — five editorial claim sentences (the claim-tier framework, CHANGE-369). Operator-written, fact-checked against the data, but not produced by the SQL aggregation.
- **`basket`** (CHANGE-386b) — per-CP per-retailer prices serialised for the editorial deep-link block on the edition page. Re-derivable in principle from `intersection_sample`, but the published form is a presentation artefact, not a citation surface.
- **`meetperiode_start` / `meetperiode_end`** — calendar-window literals (see §4.3).

If you want to verify a `findings` claim, the source data for the underlying number is in the `headline_numbers` JSON or in the per-edition snapshot SQL. The text of the claim itself is reviewed at publish time and is not a data artefact.

---

## 6. Set B (huismerk-gelijkwaardig mandje) reproducibility

Set B is a second basket published alongside Set A from edition `2026-06` onward (CHANGE-384). Where Set A picks one A-brand canonical-product per CBS row and compares each retailer's price for that same CP, Set B picks one **house-brand** canonical-product per CBS row as the anchor and compares each retailer's cheapest **EQUIVALENT or COMPARABLE peer** to that anchor — the winning peer may itself be a house brand OR a national brand (whichever admissible product is cheapest at that retailer). Both baskets cite the same dataset snapshot and meetperiode_end.

The Set B basket numbers (`anchors_in_basket`, `cheapest_basket_eur`, `spread_eur`, the `ranking` array, the per-retailer `tier_mix`) are re-derivable from a single public CSV.

### 6.1 What you download

For each edition that publishes a Set B basket, three files are released alongside the Set A CSV in the same `/onderzoek/dataset/` directory:

| File | Content |
|---|---|
| `<date>-sample-NL.csv` | Set A — unchanged; one row per (anchor CP × retailer) for the 75 A-brand anchors. |
| `<date>-setB-anchors-NL.csv` | Set B anchors — one row per HB anchor CP admitted to the basket (typically 30–80 rows). Columns: `snapshot_id, cbs_row, anchor_cp_id, anchor_name, anchor_brand, anchor_retailer_owner, cbs_weight_share`. |
| `<date>-setB-peers-NL.csv` | Set B peer cells — one row per (anchor × retailer × candidate peer) BEFORE the cheapest-cell rule applies (typically anchors × 7 retailers × 1–4 candidates). Columns: `snapshot_id, anchor_cp_id, retailer_id, peer_cp_id, peer_match_tier, peer_match_source, peer_price, peer_price_per_kg, peer_price_per_l, peer_package_size_value, peer_package_size_unit, peer_on_offer_flag, measurement_date`. |

The peers file is the one you aggregate from. The anchors file is the human-readable index of *which 75 HB products* the basket compares.

The dataset listing API (`GET /api/v1/research/snapshots?region=NL`) returns the Set B filenames as `setb_anchors_filename` + `setb_peers_filename` on each snapshot row.

### 6.2 Re-derivation rule (cheapest-cell)

For each anchor × retailer cell:

1. From the peers CSV, take all rows with that `(anchor_cp_id, retailer_id)` where `peer_match_tier ∈ {EQUIVALENT, COMPARABLE}` and `peer_price > 0`.
2. Pick **one winning row** by sorting on three keys (ascending): tier rank (EQUIVALENT = 0, COMPARABLE = 1), then `peer_price`, then `peer_cp_id`. ALTERNATIVE rows must not appear in the published CSV; if they do, ignore them.
3. The winning row contributes its `peer_price` to that retailer's basket total.

**Equal-basket restriction (since 2026-06-11, SPIKE-444a):** before summing, keep only
**full-coverage anchors** — anchors with a winning cell at EVERY retailer in scope. The
anchor admission (a 7→6→5 coverage waterfall since June 2026) allows anchors without a
peer at every retailer; summing those would make the per-retailer totals cover different
product sets and the ranking meaningless. The excluded anchors stay published in the
anchors/peers CSVs; the payload discloses both counts.

Then (over full-coverage anchors only):

- `cheapest_basket_eur` = MIN over retailers of (SUM of winning peer_price per anchor).
- `spread_eur` = MAX − MIN of the per-retailer basket totals.
- `ranking[i].pct` = `(basket_i − min_basket) / min_basket × 100`, rounded to one decimal.
- `tier_mix[retailer].n_equivalent` = count of winning rows with tier = EQUIVALENT at that retailer (same for `n_comparable`).
- `anchors_in_basket` = number of FULL-COVERAGE anchors (the identical basket every ranked retailer sums).
- `anchors_admitted` = number of distinct `anchor_cp_id` appearing in any winning cell (the waterfall-admitted total; the page renders "N van M toegelaten").

The same rule is captured in machine form in the per-edition SQL file `<slug>-setB.sql`, published alongside the CSV at `https://allesupers.nl/onderzoek/dataset/<slug>-setB.sql`. The SQL reads from `research_dataset_setb_peers`; the rule is identical to the rule above.

### 6.3 Reproducer command (Python only, no Postgres)

We ship a journalist-runnable Python script that performs the aggregation above and diffs the result against the published `headline_numbers.set_b`. It needs only Python ≥ 3.11 + the public CSV:

```bash
# Clone the public repo (read-only) and check out the edition's commit:
git clone https://github.com/allesupers/mychoice && cd mychoice/backend

# Download the published peers CSV for the edition (example: 2026-06):
curl -O https://allesupers.nl/onderzoek/dataset/2026-06-01-setB-peers-NL.csv

# Reconcile against the published JSONB via the public API:
uv run python scripts/research/reconcile_setb_published.py \
    --slug 2026-06 \
    --peers-csv 2026-06-01-setB-peers-NL.csv \
    --api-base https://allesupers.nl
```

Expected output on a clean run:

```
INFO Reconciliation PASS for slug=2026-06 — CSV aggregation matches
     published headline_numbers.set_b for the four numeric fields,
     ranking, and tier_mix.
```

Any disagreement prints a per-key diff (basket_eur, ranking entry, tier_mix retailer, etc.) and exits non-zero. The script also exposes the pure aggregator function `aggregate_from_peers_csv(path) → dict` if you want to embed it in your own pipeline.

### 6.4 Worked example — 2026-05-20 Set B dry-run (pre-publication)

> **Status note (2026-06-11):** the numbers below come from the 2026-05-20 Set B
> *dry-run* (52 anchors), captured before any Set B edition was published. They
> demonstrate the aggregation mechanics; they are **not** published edition
> figures. When edition `2026-06` ships (the first with Set B), this section
> will be replaced with that edition's actual published numbers, against which
> the reconciler must match byte-for-byte.

The peers CSV for the 2026-05-20 dry-run contained 364 rows (52 anchors × 7 retailers). The cheapest-cell aggregation produced:

| Retailer | Basket (EUR) | Spread vs cheapest (%) |
|---|---|---|
| Aldi | 68.39 | 0.0 |
| Dirk | 78.77 | 15.2 |
| Hoogvliet | 83.97 | 22.8 |
| Plus | 84.43 | 23.5 |
| Albert Heijn | 87.30 | 27.7 |
| Jumbo | 87.87 | 28.5 |
| Vomar | 92.19 | 34.8 |

`cheapest_basket_eur = 68.39`, `spread_eur = 23.80`, `anchors_in_basket = 52`, all 7 NL retailers in scope — dry-run values, see the status note above. For published Set B editions, running the reconciler script on the public CSV must reproduce the edition's `headline_numbers.set_b.ranking` byte-for-byte. (For pre-2026-06 editions the basket is Set A only — `headline_numbers.set_b = null`.)

### 6.5 Scope notes specific to Set B

- **CBS coverage**: early Set B dry-runs (May 2026) could not admit `Vlees`, `Fruit`, `Drogisterij` and `Vis` under the original all-7-retailer admission rule. Since June 2026 the admission uses a 7→6→5 coverage waterfall (floor 5 retailers per anchor), which restored all 12 CBS rows; per-anchor retailer coverage is disclosed in the anchors CSV.
- **Tiebreak determinism**: the cheapest-cell rule sorts on (tier_rank, price, peer_cp_id). The deterministic third sort key matters when two candidate peers at the same retailer have the same tier and price.
- **Anchor selection algorithm not re-derivable from the CSVs**: the 75 anchors in `setB-anchors-NL.csv` are the *result* of the sampler's CBS-row enumeration + admission filter + 1-hop dedup + Hare quota + brand cap. The CSV is the audit-trail of which anchors made it in, not a step-by-step trace of the sampling. CHANGE-384 SD §7 documents the algorithm; the per-edition SQL `<slug>-setB.sql` does NOT regenerate anchors — it only re-aggregates the cells.
- **What the rule excludes**: ALTERNATIVE-tier peers (the optimizer's third tier) never appear in Set B's basket. The published cart-optimizer on the consumer site is allowed to drop to ALTERNATIVE; Set B is not.

---

## 7. Errors, corrections, mailbox

If you find that a published number cannot be reproduced from this procedure, please write to `pers@allesupers.nl` with the edition slug and the value you saw. We log every correction at `/onderzoek/correcties` (CHANGE-374).
