Statistics is the science of collecting and interpreting numerical data. It is closely related to mathematics and uses mathematical methods based on probability theory, but it has empirical roots and applications ranging from economics to biology, epidemiology, genetics, sociology and even literary text analysis.
The terms “statistics” itself is used to described to two related but different sets of techniques and measures. Descriptive statistics are used to summarily present characteristics of a set of data. Inferential statistics is used to generalize from a sample to the population at large.
General terms used in statistics
Population is the entire set of items that are under the consideration. The population can be composed of pretty much any items as long as all items in the population share at least one measurable characteristic. For a sociologist, a population might consist of all workers in a company or inhabitants of a town; educational research might be concerned with schools while quality control will check bolts, nuts or whole cars. Populations can be small (children in a family) or very large (trees in an Amazon rainforest). While descriptive statistics measure and describe all units in a population, sampling is used to select some units in order to draw inferences about the population.
Sample is the sub-set of units selected from the population and used as a basis for inferential statistics. Sampling is used when the population is too large to practically measure all the units. There are several types of a sample, among which a random sample is the one that's most desirable (if hard to achieve) for the purposes of statistical analysis. The goal is to select a representative sample of all units in the population.
Random sample is a specific type of sample and a crucial one for inferential statistics. A true random sample is a sample in which each element, as it is drawn, has equal probability of being selected: the items are chosen purely by chance. In practice, drawing truly random samples can pose many problems and, especially with large populations, other sampling methods are used.
Parameter is a value of a certain measure that describes a population. Mean age at marriage, as recorded by the Registry Office, is a parameter of the population consisting of all people who get married in the UK in a year.
Statistic is a value of a certain measurement in a sample. Mean age at marriage in a random sample of marriage records retrieved from the Registry Office database is a statistic. The main purpose of calculating statistics is to enable estimation of the value of the parameter in the population.
Estimation is a process of working out the value of a parameter in the population on the basis of the value of the statistic in the sample.
Estimate is the predicted (indicated) value in the population, obtained in the process of estimation. Estimates can be given as “point estimates”, where one value is given, or “confidence intervals”, where a range of values is given in which the likely population value falls.
Variable is a characteristic that comprises of a set of different, but logical, attributes. A variable can be measured on different scales of measurements. These measurement scales are extremely important as they define what further statistical techniques can be used to analyze the data measured thus. The measurement levels are defined as: nominal (labels only, e.g. sex), ordinal (ordered from high to low, e.g. birth order), interval (equal distances between attributes but arbitrary zero, e.g. Centigrade temperature) and ratio (equal distances and non-arbitrary zero, e.g. age, weight, bank balance).
Categorical variables are variables that can take only a limited set of attributes (these are usually measured on a nominal or ordinal, sometimes interval scale) while continuous variables can take any value (usually ratio scale).
Normal distribution is specific type of distribution of probability which it typical for continuous random variables. A graph of a probability distribution that is normal takes the familiar “bell curve”. Very high and very low values have a low probability of occurrence while middle values have high probability (in a perfectly normal distribution, the mean, the median and the mode are equal). Many statistical techniques are based on the assumption that the variable tested or estimated has a normal distribution in the population. Such tests will not work for variables that don't conform to that assumption, although “normal-like”, or “almost-normal” distributions are often considered to be satisfactory; and the requirement for the variable to be a truly random variable measured on a ratio scale is also often relaxed in practice, especially for larger populations.
Descriptive statistics: the main terms
Descriptive statistics are used to describe important features of a set of data. This is particularly useful for a large data sets, where presenting hundreds (or even millions) of individual values would result in too much noise and make any analysis impossible. The main terms used in descriptive statistics are mostly such measures used to summarize data.
Central tendency measures include mean (the average value, calculated by adding all the individual values and dividing by the number of cases), median (the middle value, defining the pint that splits the data into two equal halves) and mode (the most frequently occurring value). Mean and median are commonly used, although mean is only useful for symmetrical, and ideally normal distributions. Median is a safer measure of a central tendency to use, although there are fewer test that can utilize it for further inferences.
Spread measures include range (distance between highest and lowest values), standard deviation (describes how tightly the data points are concentrated around the mean: the larger the standard deviation, the bigger the spread, or variability of the data) and variance (standard deviation squared).
Shape measures include skewness (symmetry, or a lack thereof: the larger the skewness, the longer is one “tail” of a distribution graph than the other) and kurtosis (how peaked or flat the distribution graph is in its central part and its tails).
Inferential statistics: the main terms
Statistical inference is used to draw inferences about a “population” from a “sample” (see above). These two concepts themselves form the basis for all of statistical inference. All statistical inference tools make assumptions about the sample and the population and all are meaningless if those assumptions are not fulfilled.
Statistical test is an essential part of statistical inference. It tests a hypothesis (often about a difference between two groups, or a parameter value) against a “null hypothesis” (which states a lack of difference). All statistical tests are based on mathematical characteristics of various distributions (of which the normal distribution is the most commonly used) and all perform inferences with a certain degree of certainty, defined by significance level.
Significance level is perhaps the most misinterpreted and misunderstood statistical term. For a statistical test that tests a difference between two hypotheses, significance level defines the probability of wrongly rejecting the null hypothesis (which supposes lack of difference), if it is in fact true. Thus, a claim of “statistically significant” difference (or any other result) doesn't inform of the scale or meaningfulness of difference, but only of its non-random character. A difference between two groups, for example, confirmed with a statistical significance level of 95%, is unlikely to be due to random chance (it would be observed 95 times for each 100 tests performed), but the difference itself might be clinically or practically meaningless. In large samples drawn from large populations even very small differences achieve statistical significance, but it doesn't make them any more meaningful.
Confidence interval is a way of estimating the unknown population parameter. The larger the confidence interval, the less precise the estimate is. This is usually reported in combination with a significance level (confidence level) and thus an estimate might specify that based on the sample, the percentage of the children in a given school that will pass the reading test for their age is 70% plus/minus 3%, at a confidence level of 95%. This means that it's very likely that the actual value in the population will be somewhere between 67% and 73%.
Statistics offers crucial tools useful in analysis of experimental and other scientific research data. It is also widely used in forecasting and quality control, business management, industrial production and government.
Knowing and understanding the basic statistical concepts and terminology is useful not only for anybody wishing to learn how to use statistical techniques, understand technical reports and research journal articles, but also to anybody who wants to understand how scientific and survey data is reported (and often misinterpreted) in the mass media.
Sources and further reading:
Internet Glossary of Statistical Terms. Retrieved on 17 Jan 2011 from http://www.animatedsoftware.com/statglos/statglos.htm#index
Easton V. & McColl J. Statistics Glossary. Retrieved on 17 Jan 2011 from http://www.stats.gla.ac.uk/steps/glossary/
Engineering statistics handbook. Retrieved on 17 Jan 2011 from http://www.itl.nist.gov/div898/handbook/index.htm