## Statistics

This library is a port of Larry Hunter's Lisp statistics library to chicken scheme.

The library provides a number of formulae and methods taken from the book "Fundamentals of Biostatistics" by Bernard Rosner (5th edition).

## TOC »

### Statistical Distributions

To use this library, you need to understand the underlying statistics. In brief:

The Binomial distribution is used when counting discrete events in a series of trials, each of which events has a probability p of producing a positive outcome. An example would be tossing a coin `n` times: the probability of a head is `p`, and the distribution gives the expected number of heads in the `n` trials. The binomial distribution is defined as B(n, p).

The Poisson distribution is used to count discrete events which occur with a known average rate. A typical example is the decay of radioactive elements. A poisson distribution is defined Pois(mu).

The Normal distribution is used for real-valued events which cluster around a specific mean with a symmetric variance. A typical example would be the distribution of people's heights. A normal distribution is defined N(mean, variance).

### Provided Functions

#### Utilities

`(average-rank value sorted-values)`procedurereturns the average position of given value in the list of sorted values: the rank is based from 1.

> (average-rank 2 '(1 2 2 3 4)) 5/2

`(beta-incomplete x a b)`procedure

`(bin-and-count items n)`procedureDivides the range of the list of

`items`into`n`bins, and returns a vector of the number of items which fall into each bin.> (bin-and-count '(1 1 2 3 3 4 5) 5) #(2 1 2 1 1)

`(combinations n k)`procedurereturns the number of ways to select

`k`items from`n`, where the order does not matter.

`(factorial n)`procedurereturns the factorial of

`n`.

`(find-critical-value p-function p-value #:increasing?)`proceduregiven a monotonic function

`p-function`taking a single value`x`to`y`, returns the value of`x`which makes`(p-function x)`closest to`p-value`. A boolean keyword parameter`#:increasing?`determines if function should be increasing or decreasing (the default).

`(fisher-z-transform r)`procedurereturns the transformation of a correlation coefficient

`r`into an approximately normal distribution.

`(gamma-incomplete a x)`procedure

`(gamma-ln x)`procedure

`(permutations n k)`procedurereturns the number of ways to select

`k`items from`n`, where the order does matter.

`(random-normal mean sd)`procedurereturns a random number distributed with specified mean and standard deviation.

`(random-pick items)`procedurereturns a random item from the given list of items.

`(random-sample n items)`procedurereturns a random sample from the list of items without replacement of size

`n`.

`(random-weighted-sample n items weights)`procedurereturns a random sample from the list of items without replacement of size

`n`, where each sample has a defined probability of selection (weight).

`(sign n)`procedurereturns 0, 1 or -1 according to if

`n`is zero, positive or negative.

`(square n)`procedure

`(cumsum sequences)`procedurereturns the cumulative sum of a sequence.

#### Descriptive statistics

These functions provide information on a given list of numbers, the `items`. Note, the list does not have to be sorted.

`(mean items)`procedurereturns the arithmetic mean of the

`items`(the sum of the numbers divided by the number of numbers).(mean '(1 2 3 4 5)) => 3

`(median items)`procedurereturns the value which separates the upper and lower halves of the list of numbers.

(median '(1 2 3 4)) => 5/2

`(mode items)`procedurereturns two

**values**. The first is a list of the*modes*and the second is the frequency. (A mode of a list of numbers is the most frequently occurring value.)> (mode '(1 2 3 4)) (1 2 3 4) 1 > (mode '(1 2 2 3 4)) (2) 2 > (mode '(1 2 2 3 3 4)) (2 3) 2

`(geometric-mean items)`procedurereturns the geometric mean of the

`items`(the result of multiplying the items together and then taking the nth root, where n is the number of items).(geometric-mean '(1 2 3 4 5)) => 2.60517108469735

`(range items)`procedurereturns the difference between the biggest and the smallest value from the list of

`items`.(range '(5 1 2 3 4)) => 4

`(percentile items percent)`procedurereturns the item closest to the

`percent`value if the`items`are sorted into order; the returned item may be in the list, or the average of adjacent values.(percentile '(1 2 3 4) 50) => 5/2 (percentile '(1 2 3 4) 67) => 3

`(variance items)`procedure

`(standard-deviation items)`procedure

`(coefficient-of-variation items)`procedurereturns 100 * (std-dev / mean) of the

`items`.(coefficient-of-variation '(1 2 3 4)) => 51.6397779494322

`(standard-error-of-the-mean items)`procedurereturns std-dev / sqrt(length items).

(standard-error-of-the-mean '(1 2 3 4)) => 0.645497224367903

`(mean-sd-n items)`procedurereturns three

**values**, one for the mean, one for the standard deviation, and one for the length of the list.> (mean-sd-n '(1 2 3 4)) 5/2 1.29099444873581 4

#### Distributional functions

`(binomial-probability n k p)`procedurereturns the probability that the number of positive outcomes for a binomial distribution B(n, p) is k.

> (do-ec (: i 0 11) (format #t "i = ~d P = ~f~&" i (binomial-probability 10 i 0.5))) i = 0 P = 0.0009765625 i = 1 P = 0.009765625 i = 2 P = 0.0439453125 i = 3 P = 0.1171875 i = 4 P = 0.205078125 i = 5 P = 0.24609375 i = 6 P = 0.205078125 i = 7 P = 0.1171875 i = 8 P = 0.0439453125 i = 9 P = 0.009765625 i = 10 P = 0.0009765625

`(binomial-cumulative-probability n k p)`procedurereturns the probability that less than

`k`positive outcomes occur for a binomial distribution B(n, p).> (do-ec (: i 0 11) (format #t "i = ~d P = ~f~&" i (binomial-cumulative-probability 10 i 0.5))) i = 0 P = 0.0 i = 1 P = 0.0009765625 i = 2 P = 0.0107421875 i = 3 P = 0.0546875 i = 4 P = 0.171875 i = 5 P = 0.376953125 i = 6 P = 0.623046875 i = 7 P = 0.828125 i = 8 P = 0.9453125 i = 9 P = 0.9892578125 i = 10 P = 0.9990234375

`(binomial-ge-probability n k p)`procedurereturns the probability of

`k`or more positive outcomes for a binomial distribution B(n, p).

`(binomial-le-probability n k p)`procedurereturns the probability

`k`or fewer positive outcomes for a binomial distribution B(n, p).

`(poisson-probability mu k)`procedurereturns the probability of

`k`events occurring when the average is`mu`.> (do-ec (: i 0 20) (format #t "P(X=~2d) = ~,4f~&" i (poisson-probability 10 i))) P(X= 0) = 0.0000 P(X= 1) = 0.0005 P(X= 2) = 0.0023 P(X= 3) = 0.0076 P(X= 4) = 0.0189 P(X= 5) = 0.0378 P(X= 6) = 0.0631 P(X= 7) = 0.0901 P(X= 8) = 0.1126 P(X= 9) = 0.1251 P(X=10) = 0.1251 P(X=11) = 0.1137 P(X=12) = 0.0948 P(X=13) = 0.0729 P(X=14) = 0.0521 P(X=15) = 0.0347 P(X=16) = 0.0217 P(X=17) = 0.0128 P(X=18) = 0.0071 P(X=19) = 0.0037

`(poisson-cumulative-probability mu k)`procedurereturns the probability of less than

`k`events occurring when the average is`mu`.> (do-ec (: i 0 20) (format #t "P(X=~2d) = ~,4f~&" i (poisson-cumulative-probability 10 i))) P(X= 0) = 0.0000 P(X= 1) = 0.0000 P(X= 2) = 0.0005 P(X= 3) = 0.0028 P(X= 4) = 0.0103 P(X= 5) = 0.0293 P(X= 6) = 0.0671 P(X= 7) = 0.1301 P(X= 8) = 0.2202 P(X= 9) = 0.3328 P(X=10) = 0.4579 P(X=11) = 0.5830 P(X=12) = 0.6968 P(X=13) = 0.7916 P(X=14) = 0.8645 P(X=15) = 0.9165 P(X=16) = 0.9513 P(X=17) = 0.9730 P(X=18) = 0.9857 P(X=19) = 0.9928

`(poisson-ge-probability mu k)`procedurereturns the probability of

`k`or more events occurring when the average is`mu`.

`(normal-pdf x mean variance)`procedurereturns the likelihood of

`x`given a normal distribution with stated mean and variance.> (do-ec (: i 0 11) (format #t "~3d ~,4f~&" i (normal-pdf i 5 4))) 0 0.0088 1 0.0270 2 0.0648 3 0.1210 4 0.1760 5 0.1995 6 0.1760 7 0.1210 8 0.0648 9 0.0270 10 0.0088

`(convert-to-standard-normal x mean variance)`procedurereturns a value for

`x`rescaling the given normal distribution to a standard N(0, 1).> (convert-to-standard-normal 5 6 2) -1/2

`(phi x)`procedurereturns the cumulative distribution function (CDF) of the standard normal distribution.

> (do-ec (: x -2 2 0.4) (format #t "~4,1f ~,4f~&" x (phi x))) -2.0 0.0228 -1.6 0.0548 -1.2 0.1151 -0.8 0.2119 -0.4 0.3446 0.0 0.5000 0.4 0.6554 0.8 0.7881 1.2 0.8849 1.6 0.9452

`(z percentile)`procedurereturns the inverse of the standard normal distribution. Input is a percentile, between 0.0 and 1.0.

`( t-distribution degrees-of-freedom percentile)`procedurereturns the point in the t-distribution given the

`degrees-of-freedom`and`percentile`.`degrees-of-freedom`must be a positive integer, and`percentile`a value between 0.0 and 1.0.

`(chi-square degrees-of-freedom percentile)`procedurereturns the point at which chi-square distribution has

`percentile`to its**left**, using given`degrees-of-freedom`.

`(chi-square-cdf x degrees-of-freedom)`procedurereturns the probability that a random variable is to the

**left**of`x`using the chi-square distribution with given`degrees-of-freedom`.

#### Confidence intervals

These functions report bounds for an observed property of a distribution: the bounds are tighter as the confidence level, alpha, varies from 0.0 to 1.0.

`(binomial-probability-ci n p alpha)`procedurereturns two values, the upper and lower bounds on an observed probability

`p`from`n`trials with confidence`(1-alpha)`.> (binomial-probability-ci 10 0.8 0.9) 0.724273681640625 0.851547241210938 ; 2 values

`(poisson-mu-ci k alpha)`procedurereturns two values, the upper and lower bounds on the poisson parameter if

`k`events are observed; the bound is for confidence`(1-alpha)`.> (poisson-mu-ci 10 0.9) 8.305419921875 10.0635986328125 ; 2 values

`(normal-mean-ci mean standard-deviation k alpha)`procedurereturns two values, the upper and lower bounds on the mean of the normal distibution of

`k`events are observed; the bound is for confidence`(1-alpha)`.> (normal-mean-ci 0.5 0.1 10 0.8) 0.491747852700165 0.508252147299835 ; 2 values

`(normal-mean-ci-on-sequence items alpha)`procedurereturns two values, the upper and lower bounds on the mean of the given

`items`, assuming they are normally distributed; the bound is for confidence`(1-alpha)`.> (normal-mean-ci-on-sequence '(1 2 3 4 5) 0.9) 2.40860081649174 3.59139918350826 ; 2 values

`(normal-variance-ci standard-deviation k alpha)`procedurereturns two values, the upper and lower bounds on the variance of the normal distibution of

`k`events are observed; the bound is for confidence`(1-alpha)`.

`(normal-variance-ci-on-sequence items alpha)`procedurereturns two values, the upper and lower bounds on the variance of the given

`items`, assuming they are normally distributed; the bound is for confidence`(1-alpha)`.

`normal-sd-ci standard-deviation k alpha)`procedurereturns two values, the upper and lower bounds on the standard deviation of the normal distibution of

`k`events are observed; the bound is for confidence`(1-alpha)`.

`(normal-sd-ci-on-sequence sequence items)`procedurereturns two values, the upper and lower bounds on the standard deviation of the given

`items`, assuming they are normally distributed; the bound is for confidence`(1-alpha)`.

#### Hypothesis testing

These functions report on the significance of an observed sample against a given distribution.

##### (parametric)

`(z-test x-bar n #:mu #:sigma #:tails)`procedureGiven

`x-bar`the sample mean,`n`the number in the sample,`#:mu`the distribution mean (defaults to 0),`#:sigma`the distribution standard deviation (defaults to 1), and`#:tails`the significance to report on:`':both`, the probability of the difference between`x-bar`and`#:mu``':positive`, the probability that observation is`>= x-bar``':negative`, the probability that observation is`<= x-bar`

e.g. given a distribution with mean 50 and standard deviation 10

; probability that a single observation is <= 40 > (z-test 40 1 #:mu 50 #:sigma 10 #:tails ':negative) 0.158655 ; probability that 10 observations are <= 40 > (z-test 40 10 #:mu 50 #:sigma 10 #:tails ':negative) 0.000783 ; probability that 5 observations give a mean of 40 > (z-test 40 5 #:mu 50 #:sigma 10) 0.025347

`(z-test-on-sequence observations #:mu #:sigma #:tails)`procedureAs for

`z-test`except`x-bar`and`n`are computed from given`observations`.

`(t-test-one-sample x-bar sd n mu #:tails)`procedureGiven observed data with mean

`x-bar`, standard devation`sd`and number of observations`n`(`n < 30`), return the significance of the sample compared with the population mean`mu`.`#:tails`is one of:`':both`two-sided (default)`':positive`one-sided,`x-bar >= mu``':negative`one-sided,`x-bar <= mu`

`(t-test-one-sample-on-sequence observations mu #:tails)`procedureAs for

`t-test-one-sample`except`x-bar`,`sd`and`n`are computed from given`observations`.

`(t-test-paired t-bar sd n #:tails)`procedureComputes the significance of the differences between two sequences of data: the differences are given as their mean,

`t-bar`, standard deviation,`sd`, and number of measurements,`n`.

`(t-test-paired-on-sequences before after #:tails)`procedureComputes the significance of the difference between two sequences of data: one before an experimental change and one after.

`#:tails`is as for`t-significance`.> (t-test-paired-on-sequences '(4 3 5) '(1 1 3)) 0.0198039411803931

`(t-test-two-sample mean-1 sd-1 n-1 mean-2 sd-2 n-2 #:variances-equal? #:variance-significance-cutoff #:tails)`procedureComputes the significance of the difference of two means given the sample standard deviations and sizes.

`(t-test-two-sample-on-sequences sequence-1 sequence-2 #:tails)`procedureSignificance of difference of two sequences of observations.

`(f-test variance-1 n1 variance-2 n2 #:tails)`procedureTests for the equality of two variances.

`(chi-square-test-one-sample observed-variance sample-size test-variance #:tails)`procedureTests for significance of difference between an observed and a test variance.

`(binomial-test-one-sample p-hat n p #:tails #:exact?)`procedureReturns the significance of a one sample test with

`n`observations, observed probability`p-hat`and expected probability`p`.

`(binomial-test-two-sample p-hat-1 n-1 p-hat-2 n-2 #:tails #:exact?)`procedureReturns the significance of a two sample test.

`(fisher-exact-test a b c d #:tails)`procedureGiven a 2x2 contingency table, returns a p value using Fisher's exact test.

`a`and`b`form the first row of the contingency table,`c`and`d`the second row.

`(mcnemars-test a-discordant-count b-discordant-count #:exact?)`procedureFor measuring effectiveness of, e.g., one treatment over another.

`a-discordant-count`is the number of times when A worked,`b-discordant-count`the number of times B worked.

`(poisson-test-one-sample observed mu #:tails #:approximate?)`procedureComputes significance of the number of observed events under a Poisson distribution against

`mu`expected events.

##### (non parametric)

`(sign-test-on-sequence sequence-1 sequence-2 #:exact? #:tails)`procedureTakes two equal-sized sequences of observations, and reports if the entries of one are different to those in the other.

`(wilcoxon-signed-rank-test differences #:tails)`procedureGiven at least 16 differences, reports if the positive differences are significantly larger or smaller than the negative differences.

`(wilcoxon-signed-rank-test-on-sequences sequence-1 sequence-2 #:tails)`procedureGiven two sequences of at least 16 observations, computes

`wilcoxon-signed-rank-test`on the differences.

`(chi-square-test-rxc contingency-table)`procedureGiven a contingency table (a SRFI-63 array), returns significance of relation between row and column variable.

`(chi-square-test-for-trend row1-counts row2-counts)`procedureReturns p significance of trend, and prints a string to show if increasing or decreasing.

#### Sample size estimates

`(t-test-one-sample-sse mean-1 mean-2 sigma-1 #:alpha #:1-beta #:tails)`procedureReturns the size of sample necessary to distinguish a normally distributed sample with

`mean-2`from a population`mean-1`standard deviation`sigma-1`. The significance`#:alpha`(defaults to 0.05), power`#:1-beta`(0.95) and sides`#:tails`(':both) may be altered.> (t-test-one-sample-sse 5.0 5.2 0.5) 163

`(t-test-two-sample-sse mean-1 sigma-1 mean-2 sigma-2 #:alpha #:1-beta #:tails #:sample-ratio)`procedureReturns the size of sample necessary to distinguish a normally distributed sample N(mean-1, sigma-1) from a normally distributed sample N(mean-2, sigma-2). The significance

`#:alpha`(defaults to 0.05), power`#:1-beta`(0.95), sides`#:tails`(':both) and sample-ratio`#:sample-ratio`(1) may be altered.

`(t-test-paired-sse difference-mean difference-sigma #:alpha #:1-beta #:tails)`procedureReturns the size of sample to produce a given mean and standard deviation in the differences of two samples.

`(binomial-test-one-sample-sse p-estimated p-null #:alpha #:1-beta #:tails)`procedureReturns the size of sample needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta.

`(binomial-test-two-sample-sse p-one p-two #:alpha #:1-beta #:tails #:sample-ratio)`procedureReturns the size of sample needed to test if given two binomial probabilities are significantly different.

`#:sample-ratio`can be given if the two samples differ in size.

`(binomial-test-paired-sse pd pa #:alpha #:1-beta #:tails)`procedureSample size estimate for McNemar's discordant pairs test.

`(correlation-sse rho #:alpha #:1-beta)`procedureReturns the size of sample necessary to find a correlation of value

`rho`with significance`#:alpha`(defaults to 0.05) and power`#:1-beta`(defaults to 0.95).> (correlation-sse 0.80 #:alpha 0.05 #:1-beta 0.9) 11

#### Correlation and regression

`(linear-regression line-defn)`procedureGiven a line definition as a list of point pairs, first prints to the terminal and then returns 5

**values**for the best fitting line through the points:- the y-intercept
- the slope
- the correlation coefficient, r
- the square of the correlation coefficient, r^2
- the significance of the difference of the slope from zero, p

(This is also called the Pearson correlation; used when relation expected to be linear. Also see

`spearman-rank-correlation`.)> (linear-regression '((1.0 0.1) (2.0 0.3) (3.0 0.8))) Intercept = -0.3, slope = 0.35, r = 0.970725343394151, R^2 = 0.942307692307692, p = 0.154420958311267 -0.3 0.35 0.970725343394151 0.942307692307692 0.154420958311267 ; 5 values

`(correlation-coefficient line-defn)`procedureAs above, but only returns the value of

*r*:> (correlation-coefficient '((1.0 0.1) (2.0 0.3) (3.0 0.8))) 0.970725343394151

`(correlation-test-two-sample r1 n1 r2 n2 #:tails)`procedureReturns the significance of the similarity between two correlations.

`#:tails`determines how the comparison is made:`':both`measures the difference,`':negative`if`r1 < r2`and`#':positive`if`r2 > r1`.

`(correlation-test-two-sample-on-sequences points-1 points-2 #:tails)`procedureAs above, but computes the correlations from given lists of points.

`(spearman-rank-correlation points)`procedureReturns two

**values**, the Spearman Rank measure of correlation between given list of points, and the p-significance of the correlation. (This correlation is used for non-linear relations; compare with`linear-regression`.)

#### Significance test functions

`(t-significance t-value degrees-of-freedom #:tails)`procedurereturns the probability of

`t-value`for given`degrees-of-freedom`. The keyword`#:tails`modifies the calculation to be two-sided (the default) with`':both`, or one-sided,`':positive`or`':negative`.> (t-significance 0.2 5) 0.849360513995829 > (t-significance 0.2 5 #:tails ':positive) 0.424680256997915 > (t-significance 0.2 5 #:tails ':negative) 0.575319743002086

`(f-significance f-value numerator-dof denominator-dof #:one-tailed?)`procedurereturns the probability of

`f-value`for given`numerator-dof`and`denominator-dof`. The boolean keyword`#:one-tailed?`indicates if calculation is two-sided (the default) or not.> (f-significance 1.5 8 2) 0.920449812578091 > (f-significance 1.5 8 2 #:one-tailed? #t) 0.460224906289046

### Authors

Peter Lane wrote the Scheme version of this library. The original Lisp version was written by Larry Hunter.

### License

GPL version 3.0.

### Requirements

Needs srfi-1, srfi-25, srfi-63, srfi-69, vector-lib, numbers, extras, foreign, format

Uses the GNU scientific library for basic numeric processing, so requires libgsl, libgslcblas and the development files for libgsl.

### Version History

- 0.8: added cumsum and random-weighted-sample
- 0.5: fixed warning in compilation (thanks to Felix for pointing it out)
- 0.4: all functions should now be working
- 0.3: some error fixes and addition of tests for majority of functions
- 0.2: fixed some errors in keywords and find-critical-value
- 0.1: initial package