Parameters (Semijoin Duplicate Weedout)
Example query this animation executes
-- IN subquery → semijoin → duplicate weedout SELECT * FROM customers c WHERE c.id IN ( SELECT o.customer_id FROM orders o WHERE o.total > 1000 );
MySQL rewrites the IN subquery as an inner join, which may produce duplicate customer rows. DuplicateWeedout uses a temp table keyed on c.rowid to remove duplicates.
What you'll see in the animation
- Phase 1 — inner join: MySQL rewrites the IN subquery as a normal inner join between customers and orders. Each customer with multiple matching orders appears multiple times (e.g. Alice appears twice because she has two orders > $1000).
- Phase 2 — weedout: for each join result row, MySQL tries to INSERT the outer-table rowid into a temporary table with a unique key. If the insert succeeds (✓ green), the row is new — emit it. If it fails (✗ red), the rowid was already seen — discard the duplicate.
- The temp table is keyed on the outer-table rowid (8 bytes). If unique_rows × 8 fits in tmp_table_size, the temp table stays in memory. Otherwise it spills to disk.
- The duplicate counter on the right tracks how many rows were discarded. The deduplicated result set at the bottom shows only the unique customers that survived.
- Total work = join_rows = outer_rows × inner_matches. Every row must be checked against the temp table. Higher fan-out means more wasted work on duplicates.
Cost readout (Semijoin Duplicate Weedout)
Join rows (before dedup) ?Total rows produced by the inner join = outer_rows × inner_matches. All of these must be checked against the temp table.
—
Unique rows (after dedup) ?At most outer_rows survive after weedout. Each unique outer rowid appears exactly once in the result.
—
Duplicates discarded ?join_rows − unique_rows. These rows were produced by the inner join but rejected because their outer rowid was already in the temp table.
—
Temp table inserts ?Every join row triggers an INSERT attempt into the weedout temp table. Successful inserts mean new row; failed inserts mean duplicate.
—
Duplication factor ?inner_matches — how many times each outer row is duplicated on average. Higher = more wasted work.
—
Temp table location ?If unique_rows × 8 bytes fits in tmp_table_size (16 MiB default), the temp table stays in memory. Otherwise it spills to disk.
—
Weedout work vs materialization (log–log)
Learn more — DuplicateWeedout among the four semijoin strategies
MySQL’s optimizer can rewrite IN (SELECT …) and
EXISTS (SELECT …) subqueries as semijoins. It then
chooses among four execution strategies:
1. FirstMatch — as soon as the first inner row matches an outer row, stop scanning the inner table for that outer row. Works well when the subquery is correlated and selective.
2. LooseScan — scans the inner table’s index and skips duplicate key values, feeding only distinct keys to the outer join. Requires a suitable index on the inner side.
3. DuplicateWeedout (this lesson) — runs the full inner join, then removes duplicates using a temporary table keyed on the outer-table rowid. The most general strategy — works even when FirstMatch and LooseScan can’t.
4. Materialization — materializes the subquery into a temp table once, then probes it for each outer row. Good when the subquery result is small and reusable.
DuplicateWeedout is controlled by
optimizer_switch=duplicateweedout=on (enabled by default).
Its cost depends on the join fan-out (how many inner rows match each
outer row) and whether the weedout temp table fits in memory.
When the fan-out is high, DuplicateWeedout does significant wasted work processing duplicate rows. In those cases, Materialization or FirstMatch may be cheaper — but DuplicateWeedout is the fallback that always works.
Sources: MySQL 8.4 Reference Manual §8.2.2.1 “Optimizing IN and EXISTS Subquery Predicates with Semijoin Transformations”; MariaDB Knowledge Base “Semijoin Subquery Optimizations”.