chickadee » statistics

Statistics

This library is a port of Larry Hunter's Lisp statistics library to chicken scheme.

The library provides a number of formulae and methods taken from the book "Fundamentals of Biostatistics" by Bernard Rosner (5th edition).

Statistical Distributions

To use this library, you need to understand the underlying statistics. In brief:

The Binomial distribution is used when counting discrete events in a series of trials, each of which events has a probability p of producing a positive outcome. An example would be tossing a coin n times: the probability of a head is p, and the distribution gives the expected number of heads in the n trials. The binomial distribution is defined as B(n, p).

The Poisson distribution is used to count discrete events which occur with a known average rate. A typical example is the decay of radioactive elements. A poisson distribution is defined Pois(mu).

The Normal distribution is used for real-valued events which cluster around a specific mean with a symmetric variance. A typical example would be the distribution of people's heights. A normal distribution is defined N(mean, variance).

Provided Functions

Utilities

(average-rank value sorted-values) procedure

returns the average position of given value in the list of sorted values: the rank is based from 1.

> (average-rank 2 '(1 2 2 3 4))
5/2
(beta-incomplete x a b) procedure
(bin-and-count items n) procedure

Divides the range of the list of items into n bins, and returns a vector of the number of items which fall into each bin.

> (bin-and-count '(1 1 2 3 3 4 5) 5)
#(2 1 2 1 1)
(combinations n k) procedure

returns the number of ways to select k items from n, where the order does not matter.

(factorial n) procedure

returns the factorial of n.

(find-critical-value p-function p-value #:increasing?) procedure

given a monotonic function p-function taking a single value x to y, returns the value of x which makes (p-function x) closest to p-value. A boolean keyword parameter #:increasing? determines if function should be increasing or decreasing (the default).

(fisher-z-transform r) procedure

returns the transformation of a correlation coefficient r into an approximately normal distribution.

(gamma-incomplete a x) procedure
(gamma-ln x) procedure
(permutations n k) procedure

returns the number of ways to select k items from n, where the order does matter.

(random-normal mean sd) procedure

returns a random number distributed with specified mean and standard deviation.

(random-pick items) procedure

returns a random item from the given list of items.

(random-sample n items) procedure

returns a random sample from the list of items without replacement of size n.

(random-weighted-sample n items weights) procedure

returns a random sample from the list of items without replacement of size n, where each sample has a defined probability of selection (weight).

(sign n) procedure

returns 0, 1 or -1 according to if n is zero, positive or negative.

(square n) procedure
(cumsum sequences) procedure

returns the cumulative sum of a sequence.

Descriptive statistics

These functions provide information on a given list of numbers, the items. Note, the list does not have to be sorted.

(mean items) procedure

returns the arithmetic mean of the items (the sum of the numbers divided by the number of numbers).

(mean '(1 2 3 4 5)) => 3
(median items) procedure

returns the value which separates the upper and lower halves of the list of numbers.

(median '(1 2 3 4)) => 5/2
(mode items) procedure

returns two values. The first is a list of the modes and the second is the frequency. (A mode of a list of numbers is the most frequently occurring value.)

> (mode '(1 2 3 4))
(1 2 3 4)
1
> (mode '(1 2 2 3 4))
(2)
2
> (mode '(1 2 2 3 3 4))
(2 3)
2
(geometric-mean items) procedure

returns the geometric mean of the items (the result of multiplying the items together and then taking the nth root, where n is the number of items).

(geometric-mean '(1 2 3 4 5)) => 2.60517108469735
(range items) procedure

returns the difference between the biggest and the smallest value from the list of items.

(range '(5 1 2 3 4)) => 4
(percentile items percent) procedure

returns the item closest to the percent value if the items are sorted into order; the returned item may be in the list, or the average of adjacent values.

(percentile '(1 2 3 4) 50) => 5/2
(percentile '(1 2 3 4) 67) => 3
(variance items) procedure
(standard-deviation items) procedure
(coefficient-of-variation items) procedure

returns 100 * (std-dev / mean) of the items.

(coefficient-of-variation '(1 2 3 4)) => 51.6397779494322
(standard-error-of-the-mean items) procedure

returns std-dev / sqrt(length items).

 (standard-error-of-the-mean '(1 2 3 4)) => 0.645497224367903
(mean-sd-n items) procedure

returns three values, one for the mean, one for the standard deviation, and one for the length of the list.

> (mean-sd-n '(1 2 3 4))
5/2
1.29099444873581
4

Distributional functions

(binomial-probability n k p) procedure

returns the probability that the number of positive outcomes for a binomial distribution B(n, p) is k.

> (do-ec (: i 0 11) 
         (format #t "i = ~d P = ~f~&" i (binomial-probability 10 i 0.5)))
i = 0 P = 0.0009765625
i = 1 P = 0.009765625
i = 2 P = 0.0439453125
i = 3 P = 0.1171875
i = 4 P = 0.205078125
i = 5 P = 0.24609375
i = 6 P = 0.205078125
i = 7 P = 0.1171875
i = 8 P = 0.0439453125
i = 9 P = 0.009765625
i = 10 P = 0.0009765625
(binomial-cumulative-probability n k p) procedure

returns the probability that less than k positive outcomes occur for a binomial distribution B(n, p).

> (do-ec (: i 0 11) 
         (format #t "i = ~d P = ~f~&" i (binomial-cumulative-probability 10 i 0.5)))
i = 0 P = 0.0
i = 1 P = 0.0009765625
i = 2 P = 0.0107421875
i = 3 P = 0.0546875
i = 4 P = 0.171875
i = 5 P = 0.376953125
i = 6 P = 0.623046875
i = 7 P = 0.828125
i = 8 P = 0.9453125
i = 9 P = 0.9892578125
i = 10 P = 0.9990234375
(binomial-ge-probability n k p) procedure

returns the probability of k or more positive outcomes for a binomial distribution B(n, p).

(binomial-le-probability n k p) procedure

returns the probability k or fewer positive outcomes for a binomial distribution B(n, p).

(poisson-probability mu k) procedure

returns the probability of k events occurring when the average is mu.

> (do-ec (: i 0 20) 
         (format #t "P(X=~2d) = ~,4f~&" i (poisson-probability 10 i)))
P(X= 0) = 0.0000
P(X= 1) = 0.0005
P(X= 2) = 0.0023
P(X= 3) = 0.0076
P(X= 4) = 0.0189
P(X= 5) = 0.0378
P(X= 6) = 0.0631
P(X= 7) = 0.0901
P(X= 8) = 0.1126
P(X= 9) = 0.1251
P(X=10) = 0.1251
P(X=11) = 0.1137
P(X=12) = 0.0948
P(X=13) = 0.0729
P(X=14) = 0.0521
P(X=15) = 0.0347
P(X=16) = 0.0217
P(X=17) = 0.0128 
P(X=18) = 0.0071
P(X=19) = 0.0037
(poisson-cumulative-probability mu k) procedure

returns the probability of less than k events occurring when the average is mu.

> (do-ec (: i 0 20) 
         (format #t "P(X=~2d) = ~,4f~&" i (poisson-cumulative-probability 10 i)))
P(X= 0) = 0.0000
P(X= 1) = 0.0000
P(X= 2) = 0.0005
P(X= 3) = 0.0028
P(X= 4) = 0.0103
P(X= 5) = 0.0293
P(X= 6) = 0.0671
P(X= 7) = 0.1301
P(X= 8) = 0.2202
P(X= 9) = 0.3328
P(X=10) = 0.4579
P(X=11) = 0.5830
P(X=12) = 0.6968
P(X=13) = 0.7916
P(X=14) = 0.8645
P(X=15) = 0.9165
P(X=16) = 0.9513
P(X=17) = 0.9730
P(X=18) = 0.9857
P(X=19) = 0.9928
(poisson-ge-probability mu k) procedure

returns the probability of k or more events occurring when the average is mu.

(normal-pdf x mean variance) procedure

returns the likelihood of x given a normal distribution with stated mean and variance.

> (do-ec (: i 0 11) 
         (format #t "~3d ~,4f~&" i (normal-pdf i 5 4)))
 0 0.0088
 1 0.0270
 2 0.0648
 3 0.1210
 4 0.1760
 5 0.1995
 6 0.1760
 7 0.1210
 8 0.0648
 9 0.0270
10 0.0088
(convert-to-standard-normal x mean variance) procedure

returns a value for x rescaling the given normal distribution to a standard N(0, 1).

> (convert-to-standard-normal 5 6 2)
-1/2
(phi x) procedure

returns the cumulative distribution function (CDF) of the standard normal distribution.

> (do-ec (: x -2 2 0.4)
         (format #t "~4,1f ~,4f~&" x (phi x)))
-2.0 0.0228
-1.6 0.0548
-1.2 0.1151
-0.8 0.2119
-0.4 0.3446
 0.0 0.5000
 0.4 0.6554
 0.8 0.7881
 1.2 0.8849
 1.6 0.9452
(z percentile) procedure

returns the inverse of the standard normal distribution. Input is a percentile, between 0.0 and 1.0.

( t-distribution degrees-of-freedom percentile) procedure

returns the point in the t-distribution given the degrees-of-freedom and percentile. degrees-of-freedom must be a positive integer, and percentile a value between 0.0 and 1.0.

(chi-square degrees-of-freedom percentile) procedure

returns the point at which chi-square distribution has percentile to its left, using given degrees-of-freedom.

(chi-square-cdf x degrees-of-freedom) procedure

returns the probability that a random variable is to the left of x using the chi-square distribution with given degrees-of-freedom.

Confidence intervals

These functions report bounds for an observed property of a distribution: the bounds are tighter as the confidence level, alpha, varies from 0.0 to 1.0.

(binomial-probability-ci n p alpha) procedure

returns two values, the upper and lower bounds on an observed probability p from n trials with confidence (1-alpha).

> (binomial-probability-ci 10 0.8 0.9)
0.724273681640625 
0.851547241210938
; 2 values
(poisson-mu-ci k alpha) procedure

returns two values, the upper and lower bounds on the poisson parameter if k events are observed; the bound is for confidence (1-alpha).

> (poisson-mu-ci 10 0.9)
8.305419921875
10.0635986328125
; 2 values
(normal-mean-ci mean standard-deviation k alpha) procedure

returns two values, the upper and lower bounds on the mean of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

> (normal-mean-ci 0.5 0.1 10 0.8)
0.491747852700165
0.508252147299835
; 2 values
(normal-mean-ci-on-sequence items alpha) procedure

returns two values, the upper and lower bounds on the mean of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

> (normal-mean-ci-on-sequence '(1 2 3 4 5) 0.9)
2.40860081649174
3.59139918350826
; 2 values
(normal-variance-ci standard-deviation k alpha) procedure

returns two values, the upper and lower bounds on the variance of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

(normal-variance-ci-on-sequence items alpha) procedure

returns two values, the upper and lower bounds on the variance of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

normal-sd-ci standard-deviation k alpha) procedure

returns two values, the upper and lower bounds on the standard deviation of the normal distibution of k events are observed; the bound is for confidence (1-alpha).

(normal-sd-ci-on-sequence sequence items) procedure

returns two values, the upper and lower bounds on the standard deviation of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).

Hypothesis testing

These functions report on the significance of an observed sample against a given distribution.

(parametric)
(z-test x-bar n #:mu #:sigma #:tails) procedure

Given x-bar the sample mean, n the number in the sample, #:mu the distribution mean (defaults to 0), #:sigma the distribution standard deviation (defaults to 1), and #:tails the significance to report on:

  • ':both, the probability of the difference between x-bar and #:mu
  • ':positive, the probability that observation is >= x-bar
  • ':negative, the probability that observation is <= x-bar

e.g. given a distribution with mean 50 and standard deviation 10

; probability that a single observation is <= 40
> (z-test 40 1 #:mu 50 #:sigma 10 #:tails ':negative)
0.158655
; probability that 10 observations are <= 40
> (z-test 40 10 #:mu 50 #:sigma 10 #:tails ':negative)
0.000783
; probability that 5 observations give a mean of 40
> (z-test 40 5 #:mu 50 #:sigma 10)
0.025347
(z-test-on-sequence observations #:mu #:sigma #:tails) procedure

As for z-test except x-bar and n are computed from given observations.

(t-test-one-sample x-bar sd n mu #:tails) procedure

Given observed data with mean x-bar, standard devation sd and number of observations n (n < 30), return the significance of the sample compared with the population mean mu. #:tails is one of:

  • ':both two-sided (default)
  • ':positive one-sided, x-bar >= mu
  • ':negative one-sided, x-bar <= mu
(t-test-one-sample-on-sequence observations mu #:tails) procedure

As for t-test-one-sample except x-bar, sd and n are computed from given observations.

(t-test-paired t-bar sd n #:tails) procedure

Computes the significance of the differences between two sequences of data: the differences are given as their mean, t-bar, standard deviation, sd, and number of measurements, n.

(t-test-paired-on-sequences before after #:tails) procedure

Computes the significance of the difference between two sequences of data: one before an experimental change and one after. #:tails is as for t-significance.

> (t-test-paired-on-sequences '(4 3 5) '(1 1 3))
0.0198039411803931
(t-test-two-sample mean-1 sd-1 n-1 mean-2 sd-2 n-2 #:variances-equal? #:variance-significance-cutoff #:tails) procedure

Computes the significance of the difference of two means given the sample standard deviations and sizes.

(t-test-two-sample-on-sequences sequence-1 sequence-2 #:tails) procedure

Significance of difference of two sequences of observations.

(f-test variance-1 n1 variance-2 n2 #:tails) procedure

Tests for the equality of two variances.

(chi-square-test-one-sample observed-variance sample-size test-variance #:tails) procedure

Tests for significance of difference between an observed and a test variance.

(binomial-test-one-sample p-hat n p #:tails #:exact?) procedure

Returns the significance of a one sample test with n observations, observed probability p-hat and expected probability p.

(binomial-test-two-sample p-hat-1 n-1 p-hat-2 n-2 #:tails #:exact?) procedure

Returns the significance of a two sample test.

(fisher-exact-test a b c d #:tails) procedure

Given a 2x2 contingency table, returns a p value using Fisher's exact test. a and b form the first row of the contingency table, c and d the second row.

(mcnemars-test a-discordant-count b-discordant-count #:exact?) procedure

For measuring effectiveness of, e.g., one treatment over another. a-discordant-count is the number of times when A worked, b-discordant-count the number of times B worked.

(poisson-test-one-sample observed mu #:tails #:approximate?) procedure

Computes significance of the number of observed events under a Poisson distribution against mu expected events.

(non parametric)
(sign-test plus-count minus-count #:exact? #:tails) procedure
(sign-test-on-sequence sequence-1 sequence-2 #:exact? #:tails) procedure

Takes two equal-sized sequences of observations, and reports if the entries of one are different to those in the other.

(wilcoxon-signed-rank-test differences #:tails) procedure

Given at least 16 differences, reports if the positive differences are significantly larger or smaller than the negative differences.

(wilcoxon-signed-rank-test-on-sequences sequence-1 sequence-2 #:tails) procedure

Given two sequences of at least 16 observations, computes wilcoxon-signed-rank-test on the differences.

(chi-square-test-rxc contingency-table) procedure

Given a contingency table (a SRFI-63 array), returns significance of relation between row and column variable.

(chi-square-test-for-trend row1-counts row2-counts) procedure

Returns p significance of trend, and prints a string to show if increasing or decreasing.

Sample size estimates

(t-test-one-sample-sse mean-1 mean-2 sigma-1 #:alpha #:1-beta #:tails) procedure

Returns the size of sample necessary to distinguish a normally distributed sample with mean-2 from a population mean-1 standard deviation sigma-1. The significance #:alpha (defaults to 0.05), power #:1-beta (0.95) and sides #:tails (':both) may be altered.

> (t-test-one-sample-sse 5.0 5.2 0.5)
163
(t-test-two-sample-sse mean-1 sigma-1 mean-2 sigma-2 #:alpha #:1-beta #:tails #:sample-ratio) procedure

Returns the size of sample necessary to distinguish a normally distributed sample N(mean-1, sigma-1) from a normally distributed sample N(mean-2, sigma-2). The significance #:alpha (defaults to 0.05), power #:1-beta (0.95), sides #:tails (':both) and sample-ratio #:sample-ratio (1) may be altered.

(t-test-paired-sse difference-mean difference-sigma #:alpha #:1-beta #:tails) procedure

Returns the size of sample to produce a given mean and standard deviation in the differences of two samples.

(binomial-test-one-sample-sse p-estimated p-null #:alpha #:1-beta #:tails) procedure

Returns the size of sample needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta.

(binomial-test-two-sample-sse p-one p-two #:alpha #:1-beta #:tails #:sample-ratio) procedure

Returns the size of sample needed to test if given two binomial probabilities are significantly different. #:sample-ratio can be given if the two samples differ in size.

(binomial-test-paired-sse pd pa #:alpha #:1-beta #:tails) procedure

Sample size estimate for McNemar's discordant pairs test.

(correlation-sse rho #:alpha #:1-beta) procedure

Returns the size of sample necessary to find a correlation of value rho with significance #:alpha (defaults to 0.05) and power #:1-beta (defaults to 0.95).

> (correlation-sse 0.80 #:alpha 0.05 #:1-beta 0.9)
11

Correlation and regression

(linear-regression line-defn) procedure

Given a line definition as a list of point pairs, first prints to the terminal and then returns 5 values for the best fitting line through the points:

  • the y-intercept
  • the slope
  • the correlation coefficient, r
  • the square of the correlation coefficient, r^2
  • the significance of the difference of the slope from zero, p

(This is also called the Pearson correlation; used when relation expected to be linear. Also see spearman-rank-correlation.)

> (linear-regression '((1.0 0.1) (2.0 0.3) (3.0 0.8)))
Intercept = -0.3, slope = 0.35, r = 0.970725343394151, R^2 = 0.942307692307692, p = 0.154420958311267
-0.3
0.35
0.970725343394151
0.942307692307692
0.154420958311267
; 5 values
(correlation-coefficient line-defn) procedure

As above, but only returns the value of r:

> (correlation-coefficient '((1.0 0.1) (2.0 0.3) (3.0 0.8)))
0.970725343394151
(correlation-test-two-sample r1 n1 r2 n2 #:tails) procedure

Returns the significance of the similarity between two correlations. #:tails determines how the comparison is made: ':both measures the difference, ':negative if r1 < r2 and #':positive if r2 > r1.

(correlation-test-two-sample-on-sequences points-1 points-2 #:tails) procedure

As above, but computes the correlations from given lists of points.

(spearman-rank-correlation points) procedure

Returns two values, the Spearman Rank measure of correlation between given list of points, and the p-significance of the correlation. (This correlation is used for non-linear relations; compare with linear-regression.)

Significance test functions

(t-significance t-value degrees-of-freedom #:tails) procedure

returns the probability of t-value for given degrees-of-freedom. The keyword #:tails modifies the calculation to be two-sided (the default) with ':both, or one-sided, ':positive or ':negative.

> (t-significance 0.2 5)
0.849360513995829
> (t-significance 0.2 5 #:tails ':positive)
0.424680256997915
> (t-significance 0.2 5 #:tails ':negative)
0.575319743002086
(f-significance f-value numerator-dof denominator-dof #:one-tailed?) procedure

returns the probability of f-value for given numerator-dof and denominator-dof. The boolean keyword #:one-tailed? indicates if calculation is two-sided (the default) or not.

> (f-significance 1.5 8 2)
0.920449812578091
> (f-significance 1.5 8 2 #:one-tailed? #t)
0.460224906289046

Authors

Peter Lane wrote the Scheme version of this library. The original Lisp version was written by Larry Hunter.

License

GPL version 3.0.

Requirements

Needs srfi-1, srfi-25, srfi-63, srfi-69, vector-lib, numbers, extras, foreign, format

Uses the GNU scientific library for basic numeric processing, so requires libgsl, libgslcblas and the development files for libgsl.

Version History

Contents »