What is the difference between WHERE and HAVING in SQL?

WHERE filters rows BEFORE grouping; HAVING filters groups AFTER aggregation. So WHERE can reference any column in the table but cannot reference aggregate functions like COUNT or SUM. HAVING is the only place you can filter on aggregates — for example, HAVING COUNT(*) > 5. The execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT, which is also why ORDER BY can reference column aliases but WHERE cannot.

Why does GROUP BY require every non-aggregated column in SELECT?

GROUP BY collapses many input rows into one output row per group. If your SELECT mentions a column that's neither in GROUP BY nor inside an aggregate function, the engine doesn't know which value from the group to return. Strict modes (Postgres, modern MySQL with only_full_group_by, BigQuery) reject this as an error. Lenient older MySQL silently picked an arbitrary value, which produced wrong results that were hard to debug. The rule: every column in SELECT must either be in GROUP BY or wrapped in an aggregate.

How do I count rows that meet a condition in SQL?

Use conditional aggregation: SUM(CASE WHEN condition THEN 1 ELSE 0 END) or, more concisely, COUNT(CASE WHEN condition THEN 1 END). Both work because COUNT ignores NULL — the ELSE branch returns NULL by default. This pattern lets you compute multiple conditional counts in a single query: SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped, SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) AS pending. Faster than running multiple queries with WHERE clauses.

What does ROLLUP do in SQL?

ROLLUP adds subtotal rows to a GROUP BY result. GROUP BY ROLLUP(country, city) returns: (1) every (country, city) combination as normal, (2) per-country subtotals with city = NULL, and (3) a grand total with both NULL. CUBE adds every possible combination of subtotals, including per-city subtotals across all countries. GROUPING SETS lets you specify exactly which subtotals you want. ROLLUP is the most-used of the three because hierarchical totals are common in reports.

Can I use GROUP BY without an aggregate function?

Yes — GROUP BY without an aggregate is functionally equivalent to SELECT DISTINCT for the grouped columns. SELECT department FROM employees GROUP BY department returns the same set as SELECT DISTINCT department FROM employees. Modern query planners optimize them to similar execution plans. Use DISTINCT when you only want unique rows; use GROUP BY when you'll add aggregates later. Conceptually GROUP BY signals 'I'm computing per-group metrics' even when there's only one column.

What is conditional aggregation and why is it useful?

Conditional aggregation is wrapping a CASE expression inside an aggregate function — SUM(CASE WHEN x > 100 THEN amount ELSE 0 END), or AVG(CASE WHEN status = 'active' THEN response_time END). It's the cleanest way to compute multiple per-condition metrics in a single query. The alternative — multiple subqueries with different WHERE clauses, joined together — is slower, harder to read, and easier to break. Conditional aggregation also pivots data: turning row-shaped status counts into a column-shaped status table is a CASE-inside-SUM operation.

SQL GROUP BY Tutorial — Aggregation, HAVING, and the 5 Mistakes Most Analysts Make

Published May 2026 · 13 min read · Drill 12+ aggregation challenges →

GROUP BY is where most SQL learners cross from "I can write a SELECT" to "I can answer real business questions." It's also where most early bugs land — every analyst has produced a wrong number because they put a non-aggregated column in SELECT, filtered an aggregate in WHERE, or forgot that NULL is its own group. This guide walks the mental model, the grouping rule, HAVING vs WHERE, and the five mistakes that defeat most analysts — with runnable examples and a clear practice path.

Contents

The mental model — collapsing rows into groups
The grouping rule (the one rule)
HAVING vs WHERE — the most-asked interview question
GROUP BY on multiple columns
The five aggregates and how they handle NULL
The 5 mistakes that defeat most analysts
Conditional aggregation — SUM(CASE WHEN)
ROLLUP, CUBE, GROUPING SETS
Performance notes
The GROUP BY checklist
Practice with real challenges

The mental model — collapsing rows into groups

Think of GROUP BY as a two-step process the engine runs in your head.

Step 1 — partition. The engine reads every row from the source and drops it into a bucket based on the GROUP BY columns. All rows with the same department value land in the same bucket. All rows with the same (department, year) pair land together if you grouped by both.
Step 2 — collapse. Each bucket becomes exactly one output row. The aggregate functions in your SELECT — COUNT, SUM, AVG, MIN, MAX — fold each bucket's rows down into single values. The output table has one row per bucket.

Once that picture is in your head, the grouping rule explains itself: any column you SELECT must either be in the GROUP BY (so every row in the bucket has the same value, and there's one obvious answer) or wrapped in an aggregate (so the bucket's many rows fold into one value).

Toy example: per-department headcount and avg salary

SELECT department,
       COUNT(*) AS headcount,
       AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

Three departments in the table → three output rows. The engine partitions on department, then for each bucket it computes COUNT(*) and AVG(salary). The headcount and avg_salary columns aren't in GROUP BY because they're aggregates — they collapse the bucket. The department column is allowed because every row in a single bucket has the same department value.

The grouping rule (the one rule)

Every column in SELECT must be in GROUP BY, or wrapped in an aggregate function. No exceptions.

This is the single rule that, if you internalize it, prevents 80% of GROUP BY bugs. The reason: a single output row has to have a single value in every column, but a group of input rows often has many different values in a non-grouped column. The engine has no way to pick "the right one" — so it errors, or worse, it picks arbitrarily and lies to you.

Bug: name is neither grouped nor aggregated

SELECT department,
       name,                       -- ⚠ ambiguous: which name?
       COUNT(*) AS headcount
FROM employees
GROUP BY department;
-- Postgres / SQL Server / BigQuery: ERROR
-- Old MySQL: returns ARBITRARY name from each bucket (silent bug)

The lenient-MySQL trap: MySQL versions before 5.7 (and 5.7+ without only_full_group_by) silently picked an arbitrary value from the group, producing wrong results that looked plausible. Modern MySQL enables only_full_group_by by default. If you're inheriting an old codebase, audit every GROUP BY query for non-aggregated, non-grouped columns — they're hidden bugs.

The fix: either group by name as well (creating per-(department, name) buckets), or replace name with an aggregate that defines which name you want — MIN(name), MAX(name), or STRING_AGG(name, ', ') to concatenate them.

HAVING vs WHERE — the most-asked interview question

If you've ever interviewed for a SQL role, you've been asked the difference between WHERE and HAVING. Here's the answer that interviewers want, in two sentences:

WHERE filters rows BEFORE grouping. It can reference any column in the source table but cannot reference aggregates because the groups don't exist yet.
HAVING filters groups AFTER aggregation. It's the only place you can put a condition on COUNT, SUM, AVG, etc.

Departments with more than 5 high-earners

SELECT department,
       COUNT(*) AS high_earners
FROM employees
WHERE salary > 100000          -- pre-group: drop low-salary rows
GROUP BY department
HAVING COUNT(*) > 5;            -- post-group: drop small departments

Knowing the execution order makes this stick. SQL evaluates clauses in this order, regardless of how you write them:

Step	Clause	What happens
1	FROM	Read source tables, apply joins
2	WHERE	Filter rows (no aggregates allowed)
3	GROUP BY	Partition surviving rows into buckets
4	HAVING	Filter buckets (aggregates allowed)
5	SELECT	Compute output columns + aggregates
6	ORDER BY	Sort (can reference SELECT aliases)
7	LIMIT	Truncate rows

That's also why WHERE can't reference column aliases (the SELECT hasn't run yet) but ORDER BY can. And it's why HAVING COUNT(*) > 5 works while WHERE COUNT(*) > 5 errors.

A common shortcut question: "Can I always use HAVING instead of WHERE?" Technically yes for non-aggregate predicates — HAVING department = 'Eng' works — but it's slower because the engine groups every row and then throws away the wrong groups, instead of filtering rows before grouping. Use WHERE for row predicates and HAVING for group predicates. Performance and intent both line up.

GROUP BY on multiple columns

Adding columns to GROUP BY makes the buckets finer-grained. GROUP BY department produces one bucket per department; GROUP BY (department, year) produces one bucket per (department, year) pair.

Headcount per (department, hire-year)

SELECT department,
       EXTRACT(YEAR FROM hire_date) AS hire_year,
       COUNT(*) AS hires
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hire_date)
ORDER BY department, hire_year;

Note the GROUP BY repeats the EXTRACT expression — most databases require the exact expression rather than the alias. Postgres, SQLite, and modern MySQL also accept positional references like GROUP BY 1, 2 (the first and second SELECT columns), but expressions are clearer and survive column reordering.

The five aggregates and how they handle NULL

Five aggregates cover ~95% of real work. Each ignores NULL except where noted.

Function	Returns	NULL behavior
COUNT(*)	Row count	Counts every row, including all-NULL ones
COUNT(col)	Non-NULL count of col	Ignores NULLs in col
SUM(col)	Total	Skips NULLs; SUM of all-NULL is NULL
AVG(col)	Mean	Skips NULLs; AVG of all-NULL is NULL (not 0)
MIN/MAX(col)	Extreme value	Skips NULLs

The COUNT(*) vs COUNT(col) gotcha: COUNT(*) answers "how many rows" while COUNT(email) answers "how many rows have a non-NULL email." If 30% of users have no email, the two queries return different numbers and that's the bug. Most analysts mean COUNT(*) 95% of the time.

Two more aggregates worth knowing: STRING_AGG (Postgres / SQL Server) or GROUP_CONCAT (MySQL / SQLite) concatenates a column's values per group, often with a separator. ARRAY_AGG (Postgres) packs them into an array. Both are useful for collapsing many rows into a single comma-separated value or array per group.

The 5 mistakes that defeat most analysts

Mistake 1: SELECTing a non-aggregated, non-grouped column

Already covered above — the grouping rule violation. Modern engines error; old MySQL silently lies. Fix: add the column to GROUP BY, wrap in an aggregate, or remove from SELECT.

Mistake 2: Filtering an aggregate in WHERE

⚠ WRONG — aggregate in WHERE

SELECT department, COUNT(*) AS headcount
FROM employees
WHERE COUNT(*) > 5            -- ERROR: aggregates not allowed in WHERE
GROUP BY department;

Fix: move the aggregate predicate into HAVING. The grouping rule and the WHERE-vs-HAVING rule together explain almost every "my query won't run" complaint when first learning GROUP BY.

Mistake 3: Forgetting NULL is its own group

If you GROUP BY country and 50 rows have NULL in country, those 50 rows form a single NULL group with one output row. Beginners often expect NULL rows to be silently dropped — they aren't. Fix: filter explicitly with WHERE country IS NOT NULL if you want them excluded, or use COALESCE(country, 'Unknown') to bucket them with a label.

Mistake 4: Sorting with positional GROUP BY then renaming a column

Brittle: positional GROUP BY breaks if you reorder SELECT

SELECT hire_year, department, COUNT(*)
FROM employees
GROUP BY 1, 2;            -- breaks if you swap column order in SELECT

Positional GROUP BY (numeric references to SELECT positions) is a tool for ad-hoc work. In production code or anything you'll edit later, name the columns or expressions explicitly. Future-you will swap two columns in SELECT and not realize the GROUP BY semantics changed silently.

Mistake 5: Using DISTINCT inside an aggregate to "fix" double counting

If your JOIN structure is wrong and rows are duplicated, COUNT(DISTINCT user_id) patches the symptom but hides a real bug. The right fix is usually to find the unintended fan-out — usually a many-to-many JOIN that should have been a subquery — and eliminate the duplication at its source. DISTINCT inside an aggregate is also slower because the engine has to deduplicate before counting. Use it when DISTINCT is genuinely the requirement (count of unique users), not as a JOIN bug bandage.

Conditional aggregation — SUM(CASE WHEN), the workhorse pattern

Conditional aggregation is putting a CASE expression inside an aggregate function. It's the single most useful pattern in business reporting because it lets you compute multiple per-condition metrics in a single query.

Per-department: total, shipped count, avg priority of pending

SELECT department,
       COUNT(*) AS total_orders,
       SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped_n,
       SUM(CASE WHEN status = 'pending' THEN amount END) AS pending_revenue,
       AVG(CASE WHEN status = 'pending' THEN priority END) AS avg_pending_priority
FROM orders
GROUP BY department;

Three things to notice. First, the CASE expression returns a value (or NULL) per row, then the aggregate folds it across the group. Second, when there's no ELSE clause the false branch returns NULL — and aggregates skip NULL — so AVG(CASE WHEN active THEN response_time END) averages only active rows. Third, you can mix conditional and unconditional aggregates in the same query; COUNT(*) totals all rows while the CASE-wrapped sums apply only to subsets.

The alternative — running multiple queries with different WHERE clauses and joining the results — is slower, harder to read, and more bug-prone. Conditional aggregation is also how SQL pivots row-shaped data into column-shaped data, before databases added explicit PIVOT operators.

ROLLUP, CUBE, GROUPING SETS — subtotals without UNION

Once you've used GROUP BY a while, you'll hit a need for subtotals: per-(country, city) row counts plus per-country totals plus a grand total. Three modifiers handle this:

ROLLUP — adds hierarchical subtotals. GROUP BY ROLLUP(country, city) gives you (country, city) rows, per-country subtotals (city = NULL), and a grand total (both NULL).
CUBE — adds every possible combination of subtotals. GROUP BY CUBE(country, city) gives you per-(country, city), per-country, per-city across all countries, and the grand total.
GROUPING SETS — explicit list of grouping combinations. GROUP BY GROUPING SETS ((country, city), (country)) gives only those two levels, no others.

ROLLUP: subtotals in one query

SELECT country, city, COUNT(*) AS users
FROM users
GROUP BY ROLLUP(country, city)
ORDER BY country NULLS LAST, city NULLS LAST;
-- Output: per-(country,city) rows, per-country subtotals, grand total

The subtotal rows have NULL in the rolled-up columns, which can collide with actual NULL values in your data. Use the GROUPING(col) function to tell them apart — GROUPING(city) = 1 means "this is a subtotal row" while GROUPING(city) = 0 with NULL city means "the original data has NULL here." Rare in practice but worth knowing.

Engine support note: Postgres, SQL Server, Oracle, BigQuery, Snowflake, and recent MySQL all support ROLLUP/CUBE/GROUPING SETS. SQLite doesn't — fall back to UNION ALL. Most real reports lean on ROLLUP and skip CUBE because hierarchical totals (region → country → city) are more common than every-combination subtotals.

Performance notes

GROUP BY is one of the more expensive operations in SQL. Two patterns cover most performance work:

Index the GROUP BY columns. A B-tree index on (department) lets the engine skip the sort step before grouping. Composite indexes help multi-column groups when the index column order matches the GROUP BY column order.
Filter before grouping with WHERE. Reducing row count with WHERE is dramatically cheaper than grouping all rows and HAVING-filtering the result. Most "slow GROUP BY" queries are actually "GROUP BY operating on too many rows" queries.

For very large fact tables, materialized views or pre-aggregated rollup tables become necessary — refreshing once an hour is much cheaper than computing the same aggregate per query. That's a separate topic; the index + WHERE pattern handles 90% of GROUP BY performance issues you'll hit before then.

The GROUP BY checklist (paste into code review)

Before merging any GROUP BY query, verify:

Every column in SELECT is either in GROUP BY or wrapped in an aggregate
Aggregate predicates are in HAVING, not WHERE
Row predicates are in WHERE, not HAVING (cheaper)
NULL groups are handled — either filtered out, COALESCE-bucketed, or intentional
COUNT(*) vs COUNT(col) — you want the right one
No COUNT(DISTINCT) hiding a JOIN fan-out bug
GROUP BY columns are named, not positional (in production code)
If subtotals are needed, ROLLUP / GROUPING SETS, not 5 UNION ALLs

Practice with real challenges

Reading is the easy part. The patterns above stick when you write them cold five times — that's why every SQL Quest aggregation challenge gives you a wrong-answer diagnostic instead of just "incorrect." Here are 4 challenges that drill GROUP BY at increasing difficulty:

📊 Aggregation track — 12+ challenges

Then the related deep-dives:

Read the rule, write it cold five times.

The five mistakes above show up in nearly every analyst's first month on the job. Drilling GROUP BY in 12 progressive challenges burns the patterns into muscle memory. The wrong-answer diagnostic catches what the rule didn't.

Start the Aggregation Track — Free

Found a typo or want a topic covered? Email us.

SQL GROUP BY Tutorial — Aggregation, HAVING, and the 5 Mistakes Most Analysts Make

The mental model — collapsing rows into groups

The grouping rule (the one rule)

HAVING vs WHERE — the most-asked interview question

GROUP BY on multiple columns

The five aggregates and how they handle NULL

The 5 mistakes that defeat most analysts

Mistake 1: SELECTing a non-aggregated, non-grouped column

Mistake 2: Filtering an aggregate in WHERE

Mistake 3: Forgetting NULL is its own group

Mistake 4: Sorting with positional GROUP BY then renaming a column

Mistake 5: Using DISTINCT inside an aggregate to "fix" double counting

Conditional aggregation — SUM(CASE WHEN), the workhorse pattern

ROLLUP, CUBE, GROUPING SETS — subtotals without UNION

Performance notes

The GROUP BY checklist (paste into code review)

Practice with real challenges

Related reading

Read the rule, write it cold five times.