Stats Common pitfalls

Stats mistakes are usually not arithmetic mistakes. They are meaning mistakes: mixing up what the number is allowed to say.

A good stats habit is to pause before calculating and ask:

What generated the data, what uncertainty is still present, and what claim am I trying to justify?

If those three are fuzzy, the equations can be perfectly executed and still mog you.

Probability vs likelihood

Common mistake:

“The likelihood is the probability that the parameter is true.”

Nope. Probability and likelihood often use the same mathematical expression, but they point in different directions.

A probability model says:

P (x ∣ θ)

meaning: if the parameter/model value $θ$ were fixed, how probable is the data $x$ ?

A likelihood says:

L (θ ∣ x)

meaning: after observing data $x$ , which parameter values make this data look least surprising?

Same expression, different variable treated as movable.

Analogy: imagine footprints in the snow.

Probability: “If it was a fox, how likely are these footprints?”
Likelihood: “Given these footprints, which animal hypotheses fit well? Fox? Dog? Cat?”

The footprints do not assign an automatic probability to foxes unless you also include prior information about how common foxes, dogs, and cats are. That is why likelihood alone is not the same as posterior probability.

$p$ -values are not the probability the null is true

Common mistake:

p = P (H_{0} true ∣ data) .

That is not what a $p$ -value is.

A $p$ -value is closer to:

p = P (data at least this extreme ∣ H_{0} is true) .

So the null hypothesis is assumed for the calculation. The $p$ -value asks how surprising the observed result would be inside the null-hypothesis world.

Analogy: a smoke alarm.

$p$ -value question: “If there were no fire, how often would this alarm be this loud or louder?”
What people want: “Given the alarm is loud, how likely is there a fire?”

Those are not the same. To answer the second question, you need base rates and false-alarm behaviour.

A tiny $p$ -value says “this data is weird under the null.” It does not by itself say “the alternative is definitely true,” “the effect is large,” or “the model assumptions are fine.”

Statistical significance vs practical importance

A result can be statistically significant and practically boring.

With a huge sample size, tiny effects can produce tiny $p$ -values. If you measure enough people, you might detect that one teaching method improves exam scores by 0.03 marks on average. Statistically real? Maybe. Useful? Probably lol no.

Think of significance as a microscope. A powerful enough microscope can reveal a scratch on a tank. That does not mean the tank is broken.

Always ask:

How large is the effect?
What are the units?
Is the effect large compared with natural variation?
Would it change a decision?

Report effect sizes and uncertainty, not just stars like p < 0.05.

Correlation is not causation

Correlation measures association. It does not prove mechanism.

If ice cream sales and drowning deaths are correlated, ice cream is not necessarily drowning people. A third variable — hot weather — can increase both.

Three classic traps:

Confounding: $X$ and $Y$ move together because $Z$ affects both.
Reverse causation: maybe $Y$ causes $X$ , not $X$ causes $Y$ .
Selection effects: the dataset only includes a biased slice of reality.

Analogy: two shadows moving together on a wall. The shadows are correlated, but one shadow is not dragging the other. Something outside the wall is moving both.

To argue causation, you usually need design: randomisation, natural experiments, causal assumptions, controls, or a physically credible mechanism.

Forgetting base rates

Base-rate neglect is when a test sounds accurate, but the event is so rare that most positives can still be false positives.

Example:

Disease prevalence: 1 in 1000.
Test sensitivity: 99%.
False positive rate: 1%.

Test 100,000 people:

About 100 actually have the disease.
About 99 of those test positive.
About 99,900 do not have the disease.
1% of those gives about 999 false positives.

So among positive tests, only about

\frac{99}{99 + 999} \approx 9%

actually have the disease.

The test sounds good, but the rare-event base rate dominates. This is the “where did all these false positives come from?” goblin.

Natural-frequency tables often make Bayes’ theorem much easier than symbolic probability notation.

Assuming independence when the data are linked

Many tidy formulae assume independent observations. Real data often violate this.

Examples:

Daily stock returns are time-ordered.
Weather measurements near each other are spatially correlated.
Repeated measurements from the same person are related.
Pixels in an image are not independent little universes.
Simulation samples from the same random seed or trajectory may share structure.

If observations are dependent, the effective sample size is smaller than the row count.

Analogy: copying the same friend’s opinion 100 times into a spreadsheet does not give 100 independent opinions. It gives one opinion with a fake moustache 99 times.

Dependence mainly causes overconfidence: standard errors get too small, intervals get too tight, and results look more certain than they are.

Population variance vs sample variance

Population variance uses

σ^{2} = \frac{1}{N} i = 1 \sum N (x_{i} - μ)^{2} .

Sample variance commonly uses

s^{2} = \frac{1}{n - 1} i = 1 \sum n (x_{i} - \overset{x}{ˉ})^{2} .

The $n - 1$ appears because the sample mean $\overset{x}{ˉ}$ was estimated from the same data. Once the sample mean is fixed, the deviations cannot all vary freely: they must sum to zero.

Analogy: if three friends split a bill and you know the total plus two people’s contributions, the third contribution is forced. You do not have three free pieces of information anymore.

This is called a degrees-of-freedom correction. It is not arbitrary pedantry; it compensates for estimating the centre from the sample.

Assuming every histogram is normal

The Normal distribution is important, but it is not the default truth of the universe.

Real data can be:

skewed, like income;
heavy-tailed, like financial losses;
bounded, like percentages;
discrete, like counts;
multimodal, like heights from a mixed adult/child sample;
zero-inflated, like number of insurance claims.

Analogy: not every hill is a bell curve. Some distributions are cliffs, plateaus, mountain ranges, or cursed dragon backs.

Before using normal-based formulae, look at the data and think about the generating process. Measurement noise often becomes normal-ish. Counts, waiting times, extremes, and mixtures often do not.

Confusing standard deviation and standard error

Standard deviation describes spread of individual data points:

σ \approx typical scatter of observations .

Standard error describes uncertainty in an estimate, often the mean:

SE (\overset{x}{ˉ}) = \frac{s}{n} .

Analogy: throwing darts.

Standard deviation: how spread out the darts are on the board.
Standard error: how uncertain you are about the centre of the dart cloud.

If you throw more darts, the cloud might stay equally wide, but you learn its centre more accurately. That is why standard error shrinks with $n$ , while the underlying standard deviation need not.

Reporting a point estimate without uncertainty

A point estimate alone is a lonely little number. It says what your best guess is, but not how fragile that guess is.

Bad:

The mean is 12.4.

Better:

The mean is 12.4 with a 95% confidence interval of [10.9, 13.8].

Even better if relevant:

The estimate is sensitive to outliers / model choice / small sample size.

Analogy: saying “the treasure is at this coordinate” without saying whether your map is accurate to 1 metre or 100 km. The coordinate is not enough.

Uncertainty is not weakness; it is part of the answer.

Training on the test set

In machine learning, the test set is supposed to simulate new unseen data. If you use it to choose features, tune hyperparameters, pick models, or retry until it looks good, it is no longer a test set.

That is information leakage.

Analogy: taking a mock exam, checking the mark scheme, changing your answers, and then claiming the score measures exam readiness. Kek. It measures exposure to the answers.

Use separate roles:

training set: fit parameters;
validation set: tune choices;
test set: final honest estimate.

If the test set influenced your decisions, call it validation and get a new test set.

Overfitting: learning the noise goblins

Overfitting happens when a model learns accidental quirks of the sample rather than stable structure.

A high-degree polynomial can pass through every noisy training point, but between points it may wiggle like a possessed shoelace. Low training error does not guarantee good prediction.

Symptoms:

excellent training performance;
poor validation/test performance;
unstable conclusions when data are perturbed;
coefficients or feature importances that change wildly.

Good antidotes:

held-out validation data;
cross-validation;
regularisation;
simpler models;
more data;
checking residuals.

Ignoring residuals

A residual is observed minus predicted:

r_{i} = y_{i} - \overset{y}{^}_{i} .

Residuals are where the model’s lies leave fingerprints.

If residuals show a pattern, the model missed structure. For example:

curved residual pattern: linear model is too simple;
increasing residual spread: non-constant variance;
clusters of residuals: missing groups or dependence;
extreme residuals: outliers or heavy tails.

This connects to Residuals in NF2 and PINNs too: residuals are not just statistics housekeeping; they are a general idea of “leftover equation/model error.”

Analogy: if you sweep dust under a rug, residual plots are the lump in the rug.

Simpson’s paradox

Simpson’s paradox happens when a trend appears in several groups but reverses when the groups are combined.

Example vibe:

Treatment A works better than B for mild cases.
Treatment A works better than B for severe cases.
But overall B appears better because B was given to many more mild cases.

The group mixture changed the aggregate.

Analogy: comparing two players’ batting averages without noticing one faced easy pitches and the other faced nightmares. The pooled number hides context.

When a result looks surprisingly clean, ask whether grouping variables are hiding underneath.

Multiple comparisons: enough fishing catches something

If you test 100 unrelated hypotheses at the 5% level, you should expect about 5 false positives by chance.

That does not mean the tests are malicious. It means repeated searching creates opportunities for random noise to look meaningful.

Analogy: if you shake enough cereal boxes, one will sound like it contains a prize.

Be wary of:

trying many outcomes;
trying many subgroups;
trying many model specifications;
only reporting the significant ones.

Corrections, preregistration, validation, and honest reporting help control this.

Confusing prediction with explanation

A model can predict well without explaining causally.

Example: a model might predict hospital mortality partly from whether a patient received aggressive treatment. But treatment can be a marker of already being very sick. The model predicts risk; it does not prove treatment causes death.

Prediction asks:

What will happen for new cases like these?

Explanation/causation asks:

What would happen if we intervened and changed one thing?

Those are different questions. Do not use a prediction model as a causal story unless the study design supports that interpretation.

Final stats sanity checklist

Before trusting a result, ask:

What is the data-generating process?
What is the target quantity?
Are the observations independent?
What assumptions does the method need?
Are effect sizes practically meaningful?
Where are the uncertainty intervals?
What do residuals look like?
Could base rates, confounding, or selection effects explain this?
Was the test set kept clean?
Would the conclusion survive new data?

Stats is not just number-crunching. It is disciplined uncertainty management.

Knowledge Garden

Explorer

Stats Common pitfalls

Stats Common pitfalls

Probability vs likelihood

$p$ -values are not the probability the null is true

Statistical significance vs practical importance

Correlation is not causation

Forgetting base rates

Assuming independence when the data are linked

Population variance vs sample variance

Assuming every histogram is normal

Confusing standard deviation and standard error

Reporting a point estimate without uncertainty

Training on the test set

Overfitting: learning the noise goblins

Ignoring residuals

Simpson’s paradox

Multiple comparisons: enough fishing catches something

Confusing prediction with explanation

Final stats sanity checklist

Graph View

Table of Contents

Backlinks

Knowledge Garden

Explorer

Stats Common pitfalls

Stats Common pitfalls

Probability vs likelihood

p-values are not the probability the null is true

Statistical significance vs practical importance

Correlation is not causation

Forgetting base rates

Assuming independence when the data are linked

Population variance vs sample variance

Assuming every histogram is normal

Confusing standard deviation and standard error

Reporting a point estimate without uncertainty

Training on the test set

Overfitting: learning the noise goblins

Ignoring residuals

Simpson’s paradox

Multiple comparisons: enough fishing catches something

Confusing prediction with explanation

Final stats sanity checklist

Graph View

Table of Contents

Backlinks

$p$ -values are not the probability the null is true