8.2. Descriptive Statistics

8.2.1. Mean

Sample mean is defined as:

\[\mu = \frac{1}{n} \sum_i x_i\]
  • Mean of a sample is the summary statistic computed with above formula.

  • Mean is one way to describe the central tendency of data.

  • Average is one of many summary statistics one might choose to describe the typical value or the central tendency of a sample.

8.2.2. Variance

Variance of a sample is defined by:

\[\sigma^2 = \frac{1}{n-1} \sum_i (x_i - \mu)^2\]
  • \(x_i - \mu\) is called deviation from mean.

  • Square root of variance (\(\sigma\)) is called standard deviation.

8.2.3. Distribution

  • Summary statistics are concise but dangerous.

  • Histogram is a graph which shows the frequency or probability of each value.

  • Probability in this context is a frequency expressed as a fraction of the sample size.

  • Process of converting frequency to probability is called normalization.

  • Normalized histogram is called PMF or Probability Mass Function.

  • The most common value in a distribution is called its mode.

  • Mode is also a summary statistic. In certain cases, mode does a very good job of describing the typical value.

  • Outliers are the values which are far away from central tendency.

  • It is difficult to compare two histograms.

8.2.4. Outliers

  • Outliers are values far away from central tendency.

8.2.5. Relative Risk

  • Relative risk is a ratio of two probabilities.

Example

  • Probability that a first baby is born early is 18.2%.

  • Probability that other babies are born early is 16.8%.

  • Relative risk is 1.08%.

  • First babies are about 8% more likely to be early.

8.2.6. Conditional Probability

  • Conditional probability is a probability which depends on some condition.

Central tendency

A characteristic of a sample or population; intuitively, it is the most average value.

Spread

A characteristic of a sample or population; intuitively it describes how much variability there is.

Variance

A summary statistic often used to quantify spread.

Standard deviation

The square root of variance, also used as a measure of spread.

Frequency

The number of times a value appears in a sample.

Histogram

A mapping from values to frequencies or a graph that shows this mapping.

Probability

A frequency expressed as a fraction of the sample size.

Normalization

The process of dividing a frequency by a sample size to get a probability.

Distribution

A summary of the values that appear in a sample and the frequency, or probability of each.

PMF

Probability mass function: a representation of a distribution as a function that maps from values to probabilities.

Mode

Most frequent value in a sample.

Outlier

A value far from the central tendency.

Trim

To remove outliers from a dataset.

Bin

A range used to group nearby values.

Relative Risk

A ratio of two probabilities, often used to measure a difference between distributions

Conditional probability

A probability computed under the assumption that some condition holds.

Clinically significant

A result, a difference between groups, that is relevant in practice.

8.2.7. Reference

Change log

Last Modified

$Id: descriptive.rst 249 2012-08-05 06:17:57Z shailesh $