Semijoin Duplicate Weedout — dedup via temp table

Parameters (Semijoin Duplicate Weedout)

Outer table rows (customers): 1000

Number of rows in the outer (driving) table.

Avg inner matches per outer row: 5

Average orders per customer matching the WHERE clause. This creates duplicates.

Row size (bytes): 200

Size of each row in the outer table.

Example query this animation executes

-- IN subquery → semijoin → duplicate weedout
SELECT * FROM customers c
WHERE  c.id IN (
  SELECT o.customer_id
  FROM   orders o
  WHERE  o.total > 1000
);

MySQL rewrites the IN subquery as an inner join, which may produce duplicate customer rows. DuplicateWeedout uses a temp table keyed on c.rowid to remove duplicates.

What you'll see in the animation

Phase 1 — inner join: MySQL rewrites the IN subquery as a normal inner join between customers and orders. Each customer with multiple matching orders appears multiple times (e.g. Alice appears twice because she has two orders > $1000).
Phase 2 — weedout: for each join result row, MySQL tries to INSERT the outer-table rowid into a temporary table with a unique key. If the insert succeeds (✓ green), the row is new — emit it. If it fails (✗ red), the rowid was already seen — discard the duplicate.
The temp table is keyed on the outer-table rowid (8 bytes). If unique_rows × 8 fits in tmp_table_size, the temp table stays in memory. Otherwise it spills to disk.
The duplicate counter on the right tracks how many rows were discarded. The deduplicated result set at the bottom shows only the unique customers that survived.
Total work = join_rows = outer_rows × inner_matches. Every row must be checked against the temp table. Higher fan-out means more wasted work on duplicates.

0.0s 0.0s

Cost readout (Semijoin Duplicate Weedout)

Join rows (before dedup) ?Total rows produced by the inner join = outer_rows × inner_matches. All of these must be checked against the temp table.

—

Unique rows (after dedup) ?At most outer_rows survive after weedout. Each unique outer rowid appears exactly once in the result.

—

Duplicates discarded ?join_rows − unique_rows. These rows were produced by the inner join but rejected because their outer rowid was already in the temp table.

—

Temp table inserts ?Every join row triggers an INSERT attempt into the weedout temp table. Successful inserts mean new row; failed inserts mean duplicate.

—

Duplication factor ?inner_matches — how many times each outer row is duplicated on average. Higher = more wasted work.

—

Temp table location ?If unique_rows × 8 bytes fits in tmp_table_size (16 MiB default), the temp table stays in memory. Otherwise it spills to disk.

—

Weedout work vs materialization (log–log)

Learn more — DuplicateWeedout among the four semijoin strategies

MySQL’s optimizer can rewrite IN (SELECT …) and EXISTS (SELECT …) subqueries as semijoins. It then chooses among four execution strategies:

1. FirstMatch — as soon as the first inner row matches an outer row, stop scanning the inner table for that outer row. Works well when the subquery is correlated and selective.

2. LooseScan — scans the inner table’s index and skips duplicate key values, feeding only distinct keys to the outer join. Requires a suitable index on the inner side.

3. DuplicateWeedout (this lesson) — runs the full inner join, then removes duplicates using a temporary table keyed on the outer-table rowid. The most general strategy — works even when FirstMatch and LooseScan can’t.

4. Materialization — materializes the subquery into a temp table once, then probes it for each outer row. Good when the subquery result is small and reusable.

DuplicateWeedout is controlled by optimizer_switch=duplicateweedout=on (enabled by default). Its cost depends on the join fan-out (how many inner rows match each outer row) and whether the weedout temp table fits in memory.

When the fan-out is high, DuplicateWeedout does significant wasted work processing duplicate rows. In those cases, Materialization or FirstMatch may be cheaper — but DuplicateWeedout is the fallback that always works.

Sources: MySQL 8.4 Reference Manual §8.2.2.1 “Optimizing IN and EXISTS Subquery Predicates with Semijoin Transformations”; MariaDB Knowledge Base “Semijoin Subquery Optimizations”.