Statistics and Data Science Glossary

Alphabetical Statistical Glossary

ANOVA (Analysis of Variance): Hypothesis testing procedure for comparing means between multiple groups. Used to analyze differences between group means.

Arithmetic Mean: Measure of central tendency, calculated as the sum of all values divided by their count; the commonly known “average.”

ARIMA Models: Autoregressive Integrated Moving Average; time series analysis models for analyzing and forecasting time series with complex dependency structures.

Autocorrelation: Measure in time series analysis that describes the correlation of a time series with time-lagged versions of itself.

Bayesian Estimation: Estimation method that incorporates prior knowledge (prior) when estimating parameters and uses Bayesian statistics.

Bayesian Statistics: Modern statistical method that combines prior knowledge with observed data to draw conclusions.

Big Data: Modern method for analyzing very large, complex datasets that are difficult to process with traditional methods due to their size, complexity, and dynamics.

Binomial Distribution: Discrete probability distribution that describes the number of successes in a fixed number of independent trials with constant probability of success.

Block Design: Experimental design where similar experimental units are grouped into blocks to reduce variance and increase statistical power.

Bootstrapping: Modern resampling method for estimating the sampling distribution by repeatedly drawing with replacement from the original sample.

Canonical Correlation: Multivariate procedure for analyzing relationships between two groups of variables.

Chi-Square Distribution: Probability distribution that describes the sum of squares of independent standard normally distributed random variables.

Chi-Square Test: Hypothesis test for independence in contingency tables or for testing the goodness of fit of distributions.

Cluster Analysis: Multivariate procedure that groups similar objects into clusters based on their characteristics.

Cluster Sample: Sampling design where natural groups (clusters) are selected instead of individual elements.

Conditional Probability: Concept from probability theory that describes the probability of an event occurring given that another event has occurred.

Confidence Interval: Range in estimation theory and sampling theory that contains the true value of a parameter with a specified probability.

Correlation: Measure from probability theory for the strength and direction of the linear relationship between two variables; standardized form of covariance.

Covariance: Measure from probability theory for the joint variability of two random variables.

Crossover Design: Experimental design where each subject receives multiple treatments in different sequences to separate treatment effects from individual differences.

Cyclical Component: Element of time series analysis that describes medium-term fluctuations with variable length.

Data Mining: Modern method for discovering patterns in large datasets using statistical methods, machine learning, and database systems.

Data Types: Classification of data according to their information content and permissible statistical operations:

  • Nominal: Categorical data without natural order (e.g., hair color)
  • Ordinal: Categorical data with natural ranking (e.g., school grades)
  • Metric: Numerical data with defined intervals (interval) or with a natural zero point (ratio)

Descriptive Statistics: Basic statistical methods for summarizing, presenting, and describing data without further inference.

Discriminant Analysis: Multivariate procedure for classifying objects into predefined groups based on their characteristics.

Exponential Distribution: Continuous probability distribution that models the time between events in a Poisson process.

Expected Value: Concept from probability theory; the “average value” of a random variable, calculated as the weighted mean of all possible values.

F-Distribution: Probability distribution for the ratio of two independent chi-square distributed random variables; important for analyses of variance.

Factor Analysis: Multivariate procedure for identifying underlying factors that influence multiple observed variables.

Factorial Design: Experimental design that examines multiple factors and their interactions in a single experiment.

Geometric Mean: Measure of central tendency, calculated as the nth root of the product of n values; particularly useful for growth rates and ratios.

Goodness of Fit: Assessment of how well a statistical model fits a set of observations.

Harmonic Mean: Measure of central tendency, calculated as the reciprocal of the arithmetic mean of the reciprocals; particularly useful for average speeds.

Heteroscedasticity: Property in regression analysis where the variance of the residuals varies systematically with the independent variables.

Hypothesis Test: Statistical procedure for testing assumptions about population parameters based on sample data.

Independence: Concept from probability theory where the occurrence of one event does not influence the probability of another.

Inferential Statistics: Fundamental area of statistics that includes methods to draw conclusions about population properties from sample results.

Interquartile Range: Measure of dispersion, calculated as the difference between the 3rd and 1st quartiles; encompasses the middle 50% of the data.

Interval Estimation: Estimation method for determining an interval in which a parameter lies with a certain probability.

Irregular Component: Element of time series analysis that describes the random, unexplainable fluctuations in a time series.

Kruskal-Wallis Test: Non-parametric hypothesis test as an alternative to one-way ANOVA for independent samples.

Latin Square: Experimental design where each treatment appears exactly once in each row and column.

Linear Regression: Regression analysis method for modeling the linear relationship between a dependent variable and an independent variable.

Logistic Regression: Regression analysis method for binary dependent variables that models the probability of an event.

Machine Learning: Modern method with algorithms that can learn from data and make predictions without being explicitly programmed.

Mann-Whitney U Test: Non-parametric hypothesis test for independent samples; alternative to the t-test when the normal distribution assumption is violated.

Maximum Likelihood Estimation: Estimation method for parameter estimation that maximizes the likelihood of the observed data.

Mean Absolute Deviation: Measure of dispersion, calculated as the average of the absolute deviations from the mean or median.

Measures of Central Tendency: Statistical figures that describe the central tendency of a distribution (e.g., mean, median, mode).

Measures of Dispersion: Statistical figures that describe the variation or dispersion of data (e.g., variance, standard deviation).

Median: Measure of central tendency, defined as the middle value of a data series ordered by size; divides the data into two equal halves.

Mode: Measure of central tendency, defined as the most frequently occurring value in a dataset.

Monte Carlo Simulation: Modern random-based simulation technique for solving complex problems through repeated sampling.

Multicollinearity: Problem in regression analysis where strong correlations exist between independent variables.

Multidimensional Scaling: Multivariate procedure for visualizing similarities between objects as distances in a low-dimensional space.

Multiple Regression: Extension of linear regression in regression analysis to multiple independent variables.

Normal Distribution/Gaussian Distribution: Symmetric, bell-shaped probability distribution; many natural phenomena approximately follow it.

Null and Alternative Hypothesis: Opposing assumptions in hypothesis tests that are examined to draw statistical conclusions.

p-value: Metric in hypothesis tests that indicates the probability of obtaining a result at least as extreme as the observed result under the assumption that the null hypothesis is true.

Point Estimation: Estimation method for determining a single value as the best estimate of a parameter.

Poisson Distribution: Discrete probability distribution that models the number of events in a fixed time or space interval.

Population: Fundamental concept in statistics for the complete set of all study units about which statements are to be made.

Power: Measure in hypothesis tests for the probability of correctly rejecting a false null hypothesis; equals 1 minus the probability of a Type II error.

Probability: Fundamental concept of probability theory; numerical measure (between 0 and 1) for the chance of an event occurring.

Probability Distribution: Concept of probability theory that describes the assignment of probabilities to all possible values of a random variable.

Probability Theory: Mathematical foundation of statistics that deals with modeling chance and uncertainty.

Quantiles: Measures of central tendency that divide an ordered dataset into equal parts:

  • Quartiles: Divide into four parts (25%, 50%, 75%)
  • Deciles: Divide into ten parts (10%, 20%, …, 90%)
  • Percentiles: Divide into one hundred parts (1%, 2%, …, 99%)

R-squared: Coefficient of determination in regression analysis that indicates the proportion of variance in the dependent variable explained by the model.

Random Sample: Sampling design where each element of the population has an equal chance of being included in the sample.

Random Variable: Concept of probability theory for a variable whose values depend on chance and occur with certain probabilities.

Randomization: Basic principle of experimental design where study units are randomly assigned to experimental conditions.

Regression Analysis: Statistical method for investigating relationships between a dependent variable and one or more independent variables.

Regression Coefficient: Parameter in regression analysis that quantifies the influence of an independent variable on the dependent variable.

Residuals: Differences in regression analysis between observed values and values predicted by the model.

Sample: Fundamental concept in statistics for a subset of the population that is used for investigations.

Sampling Error: Deviation in sampling theory between sample values and the true values of the population.

Seasonal Component: Element of time series analysis that describes regular, periodically recurring fluctuations within a year.

Significance Level: Predetermined probability for Type I error in hypothesis tests; typically 5% or 1%.

Split-Plot Design: Experimental design with different randomization units for different factors.

Standard Deviation: Important measure of dispersion, calculated as the square root of the variance; has the same unit as the data.

Standard Error: Measure in sampling theory for the standard deviation of the sampling distribution of a statistic.

Statistical Programming: Modern method for using programming languages and software (such as R, Python, SAS, SPSS) for statistical analysis.

Stationarity: Property of a time series in time series analysis whose statistical properties (mean, variance) remain constant over time.

Stratified Sample: Sampling design where the population is divided into strata from which separate random samples are drawn.

Systematic Sample: Sampling design where after selecting a starting element, additional elements are selected at regular intervals.

t-Distribution: Probability distribution, similar to the normal distribution but with heavier tails; important for small samples.

t-Test: Hypothesis test for mean comparisons with normally distributed data, especially useful for small samples.

Time Series Analysis: Statistical method for examining data collected in a temporal sequence.

Trend: Long-term development tendency of a time series in time series analysis.

Type I Error: Error in hypothesis tests where a true null hypothesis is erroneously rejected.

Type II Error: Error in hypothesis tests where a false null hypothesis is erroneously retained.

Unbiased Estimator: Estimator whose expected value equals the true parameter value.

Variance: Important measure of dispersion, calculated as the mean of the squared deviations from the arithmetic mean.

Variation Coefficient: Relative measure of dispersion, calculated as standard deviation divided by the mean; allows comparison of dispersion across different datasets.

Variable: Fundamental concept in statistics for a measurable property or characteristic that can vary among the study units.

Weighted Mean: Measure of central tendency where certain values are given more weight than others during calculation.

Wilcoxon Test: Non-parametric hypothesis test as an alternative to the t-test for paired samples.