TOC »
Statistics
This library is a port of Larry Hunter's Lisp statistics library to chicken scheme.
The library provides a number of formulae and methods taken from the book "Fundamentals of Biostatistics" by Bernard Rosner (5th edition).
Statistical Distributions
To use this library, you need to understand the underlying statistics. In brief:
The Binomial distribution is used when counting discrete events in a series of trials, each of which events has a probability p of producing a positive outcome. An example would be tossing a coin n times: the probability of a head is p, and the distribution gives the expected number of heads in the n trials. The binomial distribution is defined as B(n, p).
The Poisson distribution is used to count discrete events which occur with a known average rate. A typical example is the decay of radioactive elements. A poisson distribution is defined Pois(mu).
The Normal distribution is used for real-valued events which cluster around a specific mean with a symmetric variance. A typical example would be the distribution of people's heights. A normal distribution is defined N(mean, variance).
Provided Functions
Utilities
- average-rank value sorted-valuesprocedure
returns the average position of given value in the list of sorted values: the rank is based from 1.
> (average-rank 2 '(1 2 2 3 4)) 5/2
- beta-incomplete x a bprocedure
- bin-and-count items nprocedure
Divides the range of the list of items into n bins, and returns a vector of the number of items which fall into each bin.
> (bin-and-count '(1 1 2 3 3 4 5) 5) #(2 1 2 1 1)
- combinations n kprocedure
returns the number of ways to select k items from n, where the order does not matter.
- factorial nprocedure
returns the factorial of n.
- (find-critical-value p-function p-value #:increasing?)procedure
given a monotonic function p-function taking a single value x to y, returns the value of x which makes (p-function x) closest to p-value. A boolean keyword parameter #:increasing? determines if function should be increasing or decreasing (the default).
- fisher-z-transform rprocedure
returns the transformation of a correlation coefficient r into an approximately normal distribution.
- gamma-incomplete a xprocedure
- gamma-ln xprocedure
- permutations n kprocedure
returns the number of ways to select k items from n, where the order does matter.
- random-normal mean sdprocedure
returns a random number distributed with specified mean and standard deviation.
- random-pick itemsprocedure
returns a random item from the given list of items.
- random-sample n itemsprocedure
returns a random sample from the list of items without replacement of size n.
- random-weighted-sample n items weightsprocedure
returns a random sample from the list of items without replacement of size n, where each sample has a defined probability of selection (weight).
- sign nprocedure
returns 0, 1 or -1 according to if n is zero, positive or negative.
- square nprocedure
- cumsum sequencesprocedure
returns the cumulative sum of a sequence.
Descriptive statistics
These functions provide information on a given list of numbers, the items. Note, the list does not have to be sorted.
- mean itemsprocedure
returns the arithmetic mean of the items (the sum of the numbers divided by the number of numbers).
(mean '(1 2 3 4 5)) => 3
- median itemsprocedure
returns the value which separates the upper and lower halves of the list of numbers.
(median '(1 2 3 4)) => 5/2
- mode itemsprocedure
returns two values. The first is a list of the modes and the second is the frequency. (A mode of a list of numbers is the most frequently occurring value.)
> (mode '(1 2 3 4)) (1 2 3 4) 1 > (mode '(1 2 2 3 4)) (2) 2 > (mode '(1 2 2 3 3 4)) (2 3) 2
- geometric-mean itemsprocedure
returns the geometric mean of the items (the result of multiplying the items together and then taking the nth root, where n is the number of items).
(geometric-mean '(1 2 3 4 5)) => 2.60517108469735
- range itemsprocedure
returns the difference between the biggest and the smallest value from the list of items.
(range '(5 1 2 3 4)) => 4
- percentile items percentprocedure
returns the item closest to the percent value if the items are sorted into order; the returned item may be in the list, or the average of adjacent values.
(percentile '(1 2 3 4) 50) => 5/2 (percentile '(1 2 3 4) 67) => 3
- variance itemsprocedure
- standard-deviation itemsprocedure
- coefficient-of-variation itemsprocedure
returns 100 * (std-dev / mean) of the items.
(coefficient-of-variation '(1 2 3 4)) => 51.6397779494322
- standard-error-of-the-mean itemsprocedure
returns std-dev / sqrt(length items).
(standard-error-of-the-mean '(1 2 3 4)) => 0.645497224367903
- mean-sd-n itemsprocedure
returns three values, one for the mean, one for the standard deviation, and one for the length of the list.
> (mean-sd-n '(1 2 3 4)) 5/2 1.29099444873581 4
Distributional functions
- binomial-probability n k pprocedure
returns the probability that the number of positive outcomes for a binomial distribution B(n, p) is k.
> (do-ec (: i 0 11) (format #t "i = ~d P = ~f~&" i (binomial-probability 10 i 0.5))) i = 0 P = 0.0009765625 i = 1 P = 0.009765625 i = 2 P = 0.0439453125 i = 3 P = 0.1171875 i = 4 P = 0.205078125 i = 5 P = 0.24609375 i = 6 P = 0.205078125 i = 7 P = 0.1171875 i = 8 P = 0.0439453125 i = 9 P = 0.009765625 i = 10 P = 0.0009765625
- binomial-cumulative-probability n k pprocedure
returns the probability that less than k positive outcomes occur for a binomial distribution B(n, p).
> (do-ec (: i 0 11) (format #t "i = ~d P = ~f~&" i (binomial-cumulative-probability 10 i 0.5))) i = 0 P = 0.0 i = 1 P = 0.0009765625 i = 2 P = 0.0107421875 i = 3 P = 0.0546875 i = 4 P = 0.171875 i = 5 P = 0.376953125 i = 6 P = 0.623046875 i = 7 P = 0.828125 i = 8 P = 0.9453125 i = 9 P = 0.9892578125 i = 10 P = 0.9990234375
- binomial-ge-probability n k pprocedure
returns the probability of k or more positive outcomes for a binomial distribution B(n, p).
- binomial-le-probability n k pprocedure
returns the probability k or fewer positive outcomes for a binomial distribution B(n, p).
- poisson-probability mu kprocedure
returns the probability of k events occurring when the average is mu.
> (do-ec (: i 0 20) (format #t "P(X=~2d) = ~,4f~&" i (poisson-probability 10 i))) P(X= 0) = 0.0000 P(X= 1) = 0.0005 P(X= 2) = 0.0023 P(X= 3) = 0.0076 P(X= 4) = 0.0189 P(X= 5) = 0.0378 P(X= 6) = 0.0631 P(X= 7) = 0.0901 P(X= 8) = 0.1126 P(X= 9) = 0.1251 P(X=10) = 0.1251 P(X=11) = 0.1137 P(X=12) = 0.0948 P(X=13) = 0.0729 P(X=14) = 0.0521 P(X=15) = 0.0347 P(X=16) = 0.0217 P(X=17) = 0.0128 P(X=18) = 0.0071 P(X=19) = 0.0037
- poisson-cumulative-probability mu kprocedure
returns the probability of less than k events occurring when the average is mu.
> (do-ec (: i 0 20) (format #t "P(X=~2d) = ~,4f~&" i (poisson-cumulative-probability 10 i))) P(X= 0) = 0.0000 P(X= 1) = 0.0000 P(X= 2) = 0.0005 P(X= 3) = 0.0028 P(X= 4) = 0.0103 P(X= 5) = 0.0293 P(X= 6) = 0.0671 P(X= 7) = 0.1301 P(X= 8) = 0.2202 P(X= 9) = 0.3328 P(X=10) = 0.4579 P(X=11) = 0.5830 P(X=12) = 0.6968 P(X=13) = 0.7916 P(X=14) = 0.8645 P(X=15) = 0.9165 P(X=16) = 0.9513 P(X=17) = 0.9730 P(X=18) = 0.9857 P(X=19) = 0.9928
- poisson-ge-probability mu kprocedure
returns the probability of k or more events occurring when the average is mu.
- normal-pdf x mean varianceprocedure
returns the likelihood of x given a normal distribution with stated mean and variance.
> (do-ec (: i 0 11) (format #t "~3d ~,4f~&" i (normal-pdf i 5 4))) 0 0.0088 1 0.0270 2 0.0648 3 0.1210 4 0.1760 5 0.1995 6 0.1760 7 0.1210 8 0.0648 9 0.0270 10 0.0088
- convert-to-standard-normal x mean varianceprocedure
returns a value for x rescaling the given normal distribution to a standard N(0, 1).
> (convert-to-standard-normal 5 6 2) -1/2
- phi xprocedure
returns the cumulative distribution function (CDF) of the standard normal distribution.
> (do-ec (: x -2 2 0.4) (format #t "~4,1f ~,4f~&" x (phi x))) -2.0 0.0228 -1.6 0.0548 -1.2 0.1151 -0.8 0.2119 -0.4 0.3446 0.0 0.5000 0.4 0.6554 0.8 0.7881 1.2 0.8849 1.6 0.9452
- z percentileprocedure
returns the inverse of the standard normal distribution. Input is a percentile, between 0.0 and 1.0.
- t-distribution degrees-of-freedom percentileprocedure
returns the point in the t-distribution given the degrees-of-freedom and percentile. degrees-of-freedom must be a positive integer, and percentile a value between 0.0 and 1.0.
- chi-square degrees-of-freedom percentileprocedure
returns the point at which chi-square distribution has percentile to its left, using given degrees-of-freedom.
- chi-square-cdf x degrees-of-freedomprocedure
returns the probability that a random variable is to the left of x using the chi-square distribution with given degrees-of-freedom.
Confidence intervals
These functions report bounds for an observed property of a distribution: the bounds are tighter as the confidence level, alpha, varies from 0.0 to 1.0.
- binomial-probability-ci n p alphaprocedure
returns two values, the upper and lower bounds on an observed probability p from n trials with confidence (1-alpha).
> (binomial-probability-ci 10 0.8 0.9) 0.724273681640625 0.851547241210938 ; 2 values
- poisson-mu-ci k alphaprocedure
returns two values, the upper and lower bounds on the poisson parameter if k events are observed; the bound is for confidence (1-alpha).
> (poisson-mu-ci 10 0.9) 8.305419921875 10.0635986328125 ; 2 values
- normal-mean-ci mean standard-deviation k alphaprocedure
returns two values, the upper and lower bounds on the mean of the normal distibution of k events are observed; the bound is for confidence (1-alpha).
> (normal-mean-ci 0.5 0.1 10 0.8) 0.491747852700165 0.508252147299835 ; 2 values
- normal-mean-ci-on-sequence items alphaprocedure
returns two values, the upper and lower bounds on the mean of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).
> (normal-mean-ci-on-sequence '(1 2 3 4 5) 0.9) 2.40860081649174 3.59139918350826 ; 2 values
- normal-variance-ci standard-deviation k alphaprocedure
returns two values, the upper and lower bounds on the variance of the normal distibution of k events are observed; the bound is for confidence (1-alpha).
- normal-variance-ci-on-sequence items alphaprocedure
returns two values, the upper and lower bounds on the variance of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).
- normal-sd-ciprocedure
returns two values, the upper and lower bounds on the standard deviation of the normal distibution of k events are observed; the bound is for confidence (1-alpha).
- normal-sd-ci-on-sequence sequence itemsprocedure
returns two values, the upper and lower bounds on the standard deviation of the given items, assuming they are normally distributed; the bound is for confidence (1-alpha).
Hypothesis testing
These functions report on the significance of an observed sample against a given distribution.
(parametric)
- (z-test x-bar n #:mu #:sigma #:tails)procedure
Given x-bar the sample mean, n the number in the sample, #:mu the distribution mean (defaults to 0), #:sigma the distribution standard deviation (defaults to 1), and #:tails the significance to report on:
- ':both, the probability of the difference between x-bar and #:mu
- ':positive, the probability that observation is >= x-bar
- ':negative, the probability that observation is <= x-bar
e.g. given a distribution with mean 50 and standard deviation 10
; probability that a single observation is <= 40 > (z-test 40 1 #:mu 50 #:sigma 10 #:tails ':negative) 0.158655 ; probability that 10 observations are <= 40 > (z-test 40 10 #:mu 50 #:sigma 10 #:tails ':negative) 0.000783 ; probability that 5 observations give a mean of 40 > (z-test 40 5 #:mu 50 #:sigma 10) 0.025347
- (z-test-on-sequence observations #:mu #:sigma #:tails)procedure
As for z-test except x-bar and n are computed from given observations.
- (t-test-one-sample x-bar sd n mu #:tails)procedure
Given observed data with mean x-bar, standard devation sd and number of observations n (n < 30), return the significance of the sample compared with the population mean mu. #:tails is one of:
- ':both two-sided (default)
- ':positive one-sided, x-bar >= mu
- ':negative one-sided, x-bar <= mu
- (t-test-one-sample-on-sequence observations mu #:tails)procedure
As for t-test-one-sample except x-bar, sd and n are computed from given observations.
- (t-test-paired t-bar sd n #:tails)procedure
Computes the significance of the differences between two sequences of data: the differences are given as their mean, t-bar, standard deviation, sd, and number of measurements, n.
- (t-test-paired-on-sequences before after #:tails)procedure
Computes the significance of the difference between two sequences of data: one before an experimental change and one after. #:tails is as for t-significance.
> (t-test-paired-on-sequences '(4 3 5) '(1 1 3)) 0.0198039411803931
- (t-test-two-sample mean-1 sd-1 n-1 mean-2 sd-2 n-2 #:variances-equal? #:variance-significance-cutoff #:tails)procedure
Computes the significance of the difference of two means given the sample standard deviations and sizes.
- (t-test-two-sample-on-sequences sequence-1 sequence-2 #:tails)procedure
Significance of difference of two sequences of observations.
- (f-test variance-1 n1 variance-2 n2 #:tails)procedure
Tests for the equality of two variances.
- (chi-square-test-one-sample observed-variance sample-size test-variance #:tails)procedure
Tests for significance of difference between an observed and a test variance.
- (binomial-test-one-sample p-hat n p #:tails #:exact?)procedure
Returns the significance of a one sample test with n observations, observed probability p-hat and expected probability p.
- (binomial-test-two-sample p-hat-1 n-1 p-hat-2 n-2 #:tails #:exact?)procedure
Returns the significance of a two sample test.
- (fisher-exact-test a b c d #:tails)procedure
Given a 2x2 contingency table, returns a p value using Fisher's exact test. a and b form the first row of the contingency table, c and d the second row.
- (mcnemars-test a-discordant-count b-discordant-count #:exact?)procedure
For measuring effectiveness of, e.g., one treatment over another. a-discordant-count is the number of times when A worked, b-discordant-count the number of times B worked.
- (poisson-test-one-sample observed mu #:tails #:approximate?)procedure
Computes significance of the number of observed events under a Poisson distribution against mu expected events.
(non parametric)
- (sign-test-on-sequence sequence-1 sequence-2 #:exact? #:tails)procedure
Takes two equal-sized sequences of observations, and reports if the entries of one are different to those in the other.
- (wilcoxon-signed-rank-test differences #:tails)procedure
Given at least 16 differences, reports if the positive differences are significantly larger or smaller than the negative differences.
- (wilcoxon-signed-rank-test-on-sequences sequence-1 sequence-2 #:tails)procedure
Given two sequences of at least 16 observations, computes wilcoxon-signed-rank-test on the differences.
- chi-square-test-rxc contingency-tableprocedure
Given a contingency table (a SRFI-63 array), returns significance of relation between row and column variable.
- chi-square-test-for-trend row1-counts row2-countsprocedure
Returns p significance of trend, and prints a string to show if increasing or decreasing.
Sample size estimates
- (t-test-one-sample-sse mean-1 mean-2 sigma-1 #:alpha #:1-beta #:tails)procedure
Returns the size of sample necessary to distinguish a normally distributed sample with mean-2 from a population mean-1 standard deviation sigma-1. The significance #:alpha (defaults to 0.05), power #:1-beta (0.95) and sides #:tails (':both) may be altered.
> (t-test-one-sample-sse 5.0 5.2 0.5) 163
- (t-test-two-sample-sse mean-1 sigma-1 mean-2 sigma-2 #:alpha #:1-beta #:tails #:sample-ratio)procedure
Returns the size of sample necessary to distinguish a normally distributed sample N(mean-1, sigma-1) from a normally distributed sample N(mean-2, sigma-2). The significance #:alpha (defaults to 0.05), power #:1-beta (0.95), sides #:tails (':both) and sample-ratio #:sample-ratio (1) may be altered.
- (t-test-paired-sse difference-mean difference-sigma #:alpha #:1-beta #:tails)procedure
Returns the size of sample to produce a given mean and standard deviation in the differences of two samples.
- (binomial-test-one-sample-sse p-estimated p-null #:alpha #:1-beta #:tails)procedure
Returns the size of sample needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta.
- (binomial-test-two-sample-sse p-one p-two #:alpha #:1-beta #:tails #:sample-ratio)procedure
Returns the size of sample needed to test if given two binomial probabilities are significantly different. #:sample-ratio can be given if the two samples differ in size.
- (binomial-test-paired-sse pd pa #:alpha #:1-beta #:tails)procedure
Sample size estimate for McNemar's discordant pairs test.
- (correlation-sse rho #:alpha #:1-beta)procedure
Returns the size of sample necessary to find a correlation of value rho with significance #:alpha (defaults to 0.05) and power #:1-beta (defaults to 0.95).
> (correlation-sse 0.80 #:alpha 0.05 #:1-beta 0.9) 11
Correlation and regression
- linear-regression xs ysprocedure
Given a line definition as lists of point coordinates, first prints to the terminal and then returns 5 values for the best fitting line through the points:
- the y-intercept
- the slope
- the correlation coefficient, r
- the square of the correlation coefficient, r^2
- the significance of the difference of the slope from zero, p
(This is also called the Pearson correlation; used when relation expected to be linear. Also see spearman-rank-correlation.)
> (linear-regression '(1.0 2.0 3.0) '(0.1 0.3 0.8)) Intercept = -0.3, slope = 0.35, r = 0.970725343394151, R^2 = 0.942307692307692, p = 0.154420958311267 -0.3 0.35 0.970725343394151 0.942307692307692 0.154420958311267 ; 5 values
- correlation-coefficient xs ysprocedure
As above, but only returns the value of r:
> (correlation-coefficient '(1.0 2.0 3.0) '(0.1 0.3 0.8)) 0.970725343394151
- (correlation-test-two-sample r1 n1 r2 n2 #:tails)procedure
Returns the significance of the similarity between two correlations. #:tails determines how the comparison is made: ':both measures the difference, ':negative if r1 < r2 and #':positive if r2 > r1.
- (correlation-test-two-sample-on-sequences points-1 points-2 #:tails)procedure
As above, but computes the correlations from given lists of points.
- spearman-rank-correlation xs ysprocedure
Returns two values, the Spearman Rank measure of correlation between the given lists of point coordinates, and the p-significance of the correlation. (This correlation is used for non-linear relations; compare with linear-regression.)
Significance test functions
- (t-significance t-value degrees-of-freedom #:tails)procedure
returns the probability of t-value for given degrees-of-freedom. The keyword #:tails modifies the calculation to be two-sided (the default) with ':both, or one-sided, ':positive or ':negative.
> (t-significance 0.2 5) 0.849360513995829 > (t-significance 0.2 5 #:tails ':positive) 0.424680256997915 > (t-significance 0.2 5 #:tails ':negative) 0.575319743002086
- (f-significance f-value numerator-dof denominator-dof #:one-tailed?)procedure
returns the probability of f-value for given numerator-dof and denominator-dof. The boolean keyword #:one-tailed? indicates if calculation is two-sided (the default) or not.
> (f-significance 1.5 8 2) 0.920449812578091 > (f-significance 1.5 8 2 #:one-tailed? #t) 0.460224906289046
Authors
Peter Lane wrote the Scheme version of this library. The original Lisp version was written by Larry Hunter.
Repository
This egg is hosted on the CHICKEN Subversion repository:
https://anonymous@code.call-cc.org/svn/chicken-eggs/release/5/statistics
If you want to check out the source code repository of this egg and you are not familiar with Subversion, see this page.
License
GPL version 3.0.
Version History
- 0.12: removed GSL dependency
- 0.11: refactoring correlation and regression interface to take two separate dataset arguments
- 0.9: ported to CHICKEN 5
- 0.8: added cumsum and random-weighted-sample
- 0.5: fixed warning in compilation (thanks to Felix for pointing it out)
- 0.4: all functions should now be working
- 0.3: some error fixes and addition of tests for majority of functions
- 0.2: fixed some errors in keywords and find-critical-value
- 0.1: initial package