The reader is respectfully warned that (1) all of the examples on the present page have very small p-values before correction, (2) the present page does not help at all in finding parameters or models, and (3) a little knowledge of JavaScript (or Java or C or C++ or C#) is needed to use the programs on this page.
All but one of the examples are fictitious. That one is Dr. Arbuthnot’s sample.
The name “jsBonfer” means “JavaScript and Bonferroni.”
To top
A fictitious example
Here is a simple example:
[ [1,3,1,3,1,3,1,1,2,2,1,2,1,3,1,4,3,3,1,1,2,1,2,1,1,2,2,2,1,1,2,1,3,1,1,1,3,1,2,1,1,3,2,4,5,2,2,2,1,2], "x" ]The row of 50 numbers is the given sample. These numbers are the values of x. Perhaps the teacher says that the null hypothesis asserts a Poisson distribution with λ equal to 1. Having looked at the sample, the student sees that there are no zeroes, so it does not look like any kind of Poisson. Maybe the alternative probability expression should have a factor of x in it. That would work, because a Poisson with λ equal to 1 must have a mean value of 1, so that multiplying that Poisson by x will make an expression which sums to one, so that the product is a probability expression. Then the quotient formula is "x", where the quote marks are mandatory.
I respectfully invite the reader to select and copy the above array, including all the rows and square brackets, to move the mouse to the upper text area, to click on the “Clear” button if needed, to paste into the upper text area, and to click on the “jsBonfer” button. Four numbers will appear in the second text area. The first is the length of the formula, but I call it formulaLength because length is already in use in JavaScript.
This formulaLength is used to find the second number, the Bonferroni weight, by calculating 1/Math.pow(base,formulaLength) * (1-geometricR)/geometricR * Math.pow(geometricR,formulaLength), where base is 128 and geometricR is 128/129 . This idea I have partly copied from Solomonoff (1960). The idea is that there are 128 formulas having exactly one character, and 128*128 formulas having exactly two characters, and so on. (A formula having exactly zero characters is useless to us.) Then the total Bonferroni weight assigned to formulas having exactly formulaLength characters is (1-geometricR)/geometricR * Math.pow(geometricR,formulaLength), and the reader sees that these totals add up to unity, if formulaLength goes from one to infinity in whole numbers. We are merely summing a geometric series. The reason that I chose geometricR to be 128/129 is that this choice simplifies the Bonferroni weight. It simplifies to 1 / 128 / Math.pow(129,formulaLength) .
The Bonferroni weight is used to correct (divide) the upper bound on the uncorrected p-value, here 1 over the product of the individual likelihood quotients for the individual values of x. The third number is this upper bound on the uncorrected p-value. The fourth number is the upper bound on the Bonferroni-corrected p-value.
The reader will notice that we are not using a fixed value of n, the size of the sample. Since the likelihood ratio is a martingale, we may use optional stopping.
The reader is respectfully warned that JavaScript thinks that integers beginning with a zero digit are in base 8. Non-integers with multiple leading zeroes on the left of the decimal point are not legal. This applies to both data and programs.
The formula between the quote marks must be legal in the JavaScript language, except that methods from the Math class need not have their Math. prefix. I am speaking of sin cos exp log pow sqrt floor round abs and the like. I did this by using the with(Math) of JavaScript. The characters used in the formula must have code numbers between zero and 127 inclusive. Any character out of this domain will cause an alert.
Hypotheses: The null hypothesis asserts that the probability at x is given by the Poisson distribution with λ equal to 1. The alternative hypothesis asserts that the probability at x is x multiplied by the probability given by that Poisson distribution.
To top
Dr. Arbuthnot’s sample
All students of statistical inference know about “
An argument for Divine Providence,
taken from the constant Regularity observ’d in the Births of both Sexes.”
By Dr. John Arbuthnott,
Physitian in Ordinary to Her Majesty, and
Fellow of the College of Physitians and the Royal Society.
From Phil. Trans. (1710) 27, 186-90. Dr. Arbuthnot looked at all the birth records of the city of London, 82 years’ worth, and he found that in every year there were more live male births than live female births. He was greatly astonished, and he used the binomial distribution to calculate the probability of such a thing in the sample if male and female had equal probabilities in the population. He said that the answer was the 82nd power of one half.
However, he looked at the sample before he decided to see how many years had more live male births. What if, instead, there had been more females for each year? Then he would have used a different hypothesis test, would he not? Let us be modern now. I use 1 to represent a year with more male, and 0 to represent a year with more female. The null hypothesis says that the probability at 0 is 1/2 and the probability at 1 is 1/2. The alternative hypothesis says that the probability at 0 is 0 and the probability at 1 is 1. The quotient will be 0 at 0 and 2 at 1. That is, the quotient is "2*x". The quote marks in "2*x" are required.
[ [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1], "2*x" ]The p-value for Dr. Arbuthnot’s sample is still astonishing, but much less astonishing than he thought. We are not using his binomial distribution, but we are now permitted to do optional stopping.
The user might get the impression that the power, or the efficiency, or something, has gone down to nearly zero. Maybe things are not that bad. Let us go back 29 years into the past:
var nn=82-29; var temp=[]; for( var jj=0;jj<nn;jj++ )temp[jj]=1; [ temp, "2*x" ]The user is respectfully invited to select and copy this program, and so on. The Bonferroni corrected p-value for 82 years is a little smaller than the uncorrected p-value for 82-29 years.
Hypotheses: The null hypothesis asserts that the probabilities at 1 and at 0 are 1/2 and 1/2. The alternative hypothesis asserts that the probabilities at 1 and at 0 are respectively 1 and 0.
To top
What if more females?
What if Dr. Arbuthnot had found more female live births for each of the 82 years? Then he would have put 82 zeroes in the x row. He would say that his alternative hypothesis placed probability 1 at 0 and probability 0 at 1, but his null hypothesis would as before place 1/2 at 0 and 1/2 at 1. His quotient would then be 2 at 0 and 0 at 1. This is "2-x*2". So we use the array
[ [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], "2-x*2" ]The user is respectfully invited as before. The Bonferroni-corrected p-value for 82 female years is much bigger than that for 82 male years, but it is still impressively small.
Hypotheses: The null hypothesis asserts that the probabilities at 1 and at 0 are 1/2 and 1/2. The alternative hypothesis asserts that the probabilities at 1 and at 0 are respectively 0 and 1.
To top
A two-dimensional problem
The previous samples had only one dimension. Here is a fictitious sample in two dimensions. The first row of numbers is the x values of the points, and the second row is the y values.
[ [144,96,54,47,51,240,42,32,215,136,224,80,176,208,31,224,144,158,240,89,29,29,48,160,116,160,96,224,192,43,240,96,74,227,240,128,48,192,162,123,192,115,80,179,107,22,240,28,73,128,191,15,57,95,58,240,96,176,112,176,224,224,192,240,85,89,41,32,176,141], [2,47,208,208,176,136,176,240,224,192,89,15,46,150,160,219,12,192,92,176,80,192,40,35,192,73,18,32,156,48,220,66,128,240,50,40,22,116,224,128,38,128,57,208,160,128,49,128,144,41,208,48,160,96,160,101,68,101,17,107,106,179,98,65,208,128,224,0,28,176], "max(x,y)%16==0?16:0" ]The values in the x row and the y row seem to lie on the integers between 0 and 255 inclusive, but the (x,y) points cannot be uniformly distributed in a square, as one might at first have thought, because the larger of x and y is always divisible by 16. The number of points having this property is less than 1/16 of the total number of points. The reader is respectfully invited to select and copy, and so on, as before.
Hypotheses: The null hypothesis asserts uniformity over the integers in a square with edge size 256, where the x and y are integers between 0 and 255 inclusive. The alternative hypothesis asserts uniformity over only those points of the square whose larger co-ordinate is divisible by 16. That is, the alternative probability on such a point is more than 16 times the null probability on such a point.
To top
More than three variables
The sample of this section is fictitious, but it is inspired by a shocking demonstration that I saw in the Web. The reader is respectfully invited to click on
http://www.cs.pitt.edu/~kirk/cs1501/animations/Random.html
and run the “3D” demonstration. Tilting the 3D graph with the mouse will show the Marsaglia planes. Returning to my fictitious sample, here are four rows each containing 85 whole numbers from 0 to 8 inclusive.
variables=[ "p","q","r","s" ]; [ [3,6,4,6,1,0,3,2,0,8,4,0,2,7,2,6,5,8,0,5,7,6,2,1,2,4,2,5,4,5,7,4,0,2,8,4,3,5,6,0,8,7,1,1,6,2,2,5,4,5,5,4,7,1,6,4,2,2,8,8,1,2,6,3,1,8,1,6,8,4,3,4,5,8,4,3,7,4,4,1,5,2,6,1,0], [1,5,6,3,0,2,3,4,6,6,7,3,7,3,5,0,5,8,0,5,8,3,4,7,8,0,6,3,3,3,6,7,2,7,8,5,2,0,2,2,8,0,7,6,3,4,1,2,0,8,6,4,6,0,4,2,7,6,2,3,0,3,8,4,8,1,1,6,2,7,4,6,7,6,5,8,4,8,6,2,1,0,6,3,8], [8,7,3,6,3,1,0,1,0,0,3,7,6,3,3,3,7,0,4,6,3,5,8,2,2,2,5,6,6,2,3,4,2,5,6,1,3,4,5,8,2,0,4,3,3,6,3,7,3,7,4,1,3,4,4,8,3,4,4,2,3,5,3,1,2,0,4,6,2,8,4,5,7,3,7,2,4,6,4,8,8,4,6,7,1], [6,0,5,3,5,6,3,2,3,4,4,8,3,5,8,0,1,2,5,2,0,4,4,8,6,3,5,4,5,8,2,3,5,4,5,8,1,0,5,8,0,2,6,8,6,6,3,4,2,7,3,0,2,4,4,4,6,6,4,5,5,8,1,1,7,0,3,0,6,8,7,3,8,1,2,5,3,0,4,7,4,3,0,7,0], "(p+q+r+s)%9==0?9:0" ]Could one think that the 85 points are uniformly distributed in a four-dimensional hypercubical lattice? After all, the digits seem to be nearly uniformly distributed in each row, and each pair of rows seems to show nearly zero correlation. However, let no one be deceived: each vertical sum of four digits is exactly divisible by nine. That is, only one-ninth of the available points are in use, so their probability is nine times what one might have thought, and the probability of the other points is zero.
The new difficulty here is that the English alphabet has no letter after "z", so I have provided no variable for the fourth row. The user will instead choose her/his own variable names, each exactly one letter of course, and tell the program about them. The letters chosen here are "p", "q", "r", and "s". The global called variables is loaded with an array containing these names by a JavaScript statement preceding the array of data. Do please remember to use a semicolon at the end of the statement.
Hypotheses: The null hypothesis asserts that the probabilities for all the points in the hypercubical lattice have the same value. The alternative hypothesis asserts that points whose coordinates sum to an exact multiple of 9 have all the probability, nine times as much as they would in the null hypothesis, and the other points have zero probability.
To top
Regression
The reader who plots the first and second rows of the following (fictitious) array will see that the curve appears to be parabolic. Indeed, it appears to be y=4*x*(1-x/1e3).
[ [129,125,561,103,743,325,892,492,529,686,552,241,574,499,247,830,719,291,610,541,232,38,709,588,369,155,172,492,105,2,740,960,468,321,61,148,42,917,986,552,175,853,442,665,382,648,223,232,205,264,651,165,549,156,134], [449,437,985,369,763,877,385,999,996,861,989,731,978,999,743,564,808,825,951,993,712,146,825,969,931,523,569,999,375,7,769,153,995,871,229,504,160,304,55,989,577,501,986,891,944,912,693,712,651,777,908,551,990,526,464], "abs(4*x*(1-x/1e3)-y)<=1?1e3/3:0" ]Let the alternative hypothesis assert that the conditional likelihood for y given x is uniform on the integers between (4*x*(1-x/1e3)-1 and (4*x*(1-x/1e3)+1 inclusive. (There may be either two or three such integers, and three makes the more conservative test.) Let the null hypothesis assert that the conditional likelihood for y given x is uniform on the integers between zero and 999 inclusive. The jsBonfer method will not know that we are doing conditional quotient instead of unconditional quotient. The reader is respectfully invited to change to different quoted formulas to see what happens. It is legal to pick the formula most favorable to us, because we are doing a multiple inference with the help of the Bonferroni correction.
Hypotheses: The null hypothesis asserts that the conditional likelihood for y given x is uniform on the integers between zero and 999, inclusive. The alternative hypothesis asserts that the conditional likelihood for y given x is uniform on the integers within 1 unit of 4*x*(1-x/1e3), inclusive.
To top
A simple Markov chain
Every Markov chain problem can be converted into a regression problem by using the data more than once. Here is a fictitious sample:
175, 170, 145, 20, 103, 164, 115, 47, 61, 131, 127, 107, 7, 38, 16, 83, 64, 146, 25, 128, 112, 32, 163, 110, 22, 113, 37, 11, 58, 116, 52, 86, 79, 44, 46, 56, 106, 2, 13, 68, and 166. After some trial and error this is seen to be y==(5*x+3)%177 where x is the previous value and y is the present value. This is exactly true, without noise, for the numbers in the sample. To make the array, first use all the numbers except the rightmost, and then use all the numbers except the leftmost:
[ [175,170,145,20,103,164,115,47,61,131,127,107,7,38,16,83,64,146,25,128,112,32,163,110,22,113,37,11,58,116,52,86,79,44,46,56,106,2,13,68], [170,145,20,103,164,115,47,61,131,127,107,7,38,16,83,64,146,25,128,112,32,163,110,22,113,37,11,58,116,52,86,79,44,46,56,106,2,13,68,166], "y==(5*x+3)%177?177:0" ]
Hypotheses: The null hypothesis asserts that the numbers are distributed uniformly on the discrete domain from zero to 176, inclusive. The alternative hypothesis asserts that given x we must have y==(5*x+3)%177, and the probability of any other value of y must be zero.
To top
A less simple Markov chain
Some people might say that the preceding example is unrealistic. It is too simple and has no noise. Here is a (fictitious) less simple, more noisy chain: 717, 521, 715, 522, 713, 523, 711, 524, 709, 525, 707, 526, 705, 527, 703, 528, 701, 529, 699, 530, 697, 531, 695, 532, 693, 533, 691, 534, 689, 535, 687, 536, 685, 537, 683, 538, 681, 539, 679, 540, 677, 541, 675, 542, 673, 543, 671, 544, 669, 545, 667, and 546. This requires three rows instead of only two. The top row lacks two numbers on the right. The middle row lacks one number on the left and one on the right. The third row lacks two numbers on the left. These rows are, as usual, called x, y, and z. The z row is the present. The y row is the time before the present. The x row is the time before the time before the present. It seems that nearly z=.99*x+.01*y. Maybe we ought to allow one unit of error. Here is the array to use:
[ [717,521,715,522,713,523,711,524,709,525,707,526,705,527,703,528,701,529,699,530,697,531,695,532,693,533,691,534,689,535,687,536,685,537,683,538,681,539,679,540,677,541,675,542,673,543,671,544,669,545], [521,715,522,713,523,711,524,709,525,707,526,705,527,703,528,701,529,699,530,697,531,695,532,693,533,691,534,689,535,687,536,685,537,683,538,681,539,679,540,677,541,675,542,673,543,671,544,669,545,667], [715,522,713,523,711,524,709,525,707,526,705,527,703,528,701,529,699,530,697,531,695,532,693,533,691,534,689,535,687,536,685,537,683,538,681,539,679,540,677,541,675,542,673,543,671,544,669,545,667,546], "abs(.99*x+.01*y-z)<=1?1e3/3:0" ]
Hypotheses: The null hypothesis asserts that the conditional distribution of z given y and x is uniform on the integers from zero to 999. The alternative hypothesis asserts that the conditional distribution of z given y and x is uniform on the integers not farther from .99*x+.01*y than 1 unit.
To top
Second thoughts about Dr. Arbuthnot
It occurs to me now that Dr. Arbuthnot did some of the statistical calculation in his head, namely, finding out for each year whether there were more male births or more female births. Really we ought to account for all the calculation. Here is an array showing for each column the year, the number of male births, and the number of female births.
[ [1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643, 1644, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1662, 1663, 1664, 1665, 1666, 1667, 1668, 1669, 1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679, 1680, 1681, 1682, 1683, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703, 1704, 1705, 1706, 1707, 1708, 1709, 1710], [5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 5359, 5366, 5518, 5470, 5460, 4793, 4107, 4047, 3768, 3796, 3363, 3079, 2890, 3231, 3220, 3196, 3441, 3655, 3668, 3396, 3157, 3209, 3724, 4748, 5216, 5411, 6041, 5114, 4678, 5616, 6073, 6506, 6278, 6449, 6443, 6073, 6113, 6058, 6552, 6423, 6568, 6247, 6548, 6822, 6909, 7577, 7575, 7484, 7575, 7737, 7487, 7601, 7909, 7662, 7602, 7676, 6985, 7263, 7632, 8062, 8426, 7911, 7578, 8102, 8031, 7765, 6113, 8366, 7952, 8379, 8239, 7840, 7640], [4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 4952, 4784, 5332, 5200, 4910, 4617, 3997, 3919, 3395, 3536, 3181, 2746, 2722, 2840, 2908, 2959, 3179, 3349, 3382, 3289, 3013, 2781, 3247, 4107, 4823, 4881, 5681, 4858, 4319, 5322, 5560, 5829, 5719, 6061, 6120, 5822, 5738, 5717, 5847, 6203, 6033, 6041, 6299, 6533, 6744, 7158, 7127, 7246, 7119, 7214, 7101, 7167, 7302, 7392, 7316, 7483, 6647, 6713, 7229, 7767, 7626, 7452, 7061, 7514, 7656, 7683, 5738, 7779, 7417, 7687, 7623, 7380, 7288], "y>z?2:0" ]The reader sees that the formula, "y>z?2:0", is somewhat longer than what was used before for Dr. Arbuthnot’s data, but the null hypothesis is still rejected.
Let α be a real number strictly between 0 and 1. For some integers j let real numbers wj be strictly positive, and let Σj( wj ) ≤ 1. For the same integers j let real numbers pj be p-values of hypothesis tests. Then
[ [.6], "abs(x-.6)<.5e-300?1e300:0" ]I respectfully invite the reader to select and copy the above array, including all the rows and square brackets, to move the mouse to the upper text area, to click on the “Clear” button if needed, to paste into the upper text area, and to click on the “jsBonfer” button. The results will be
formulaLength=25 Bonferroni weight=1.3429111288008964e-55 uncorrected p<=1e-300 Bonferroni p<=7.446509143854616e-246That Bonferroni-corrected p-value is of course ridiculous.
2/3and move the mouse to the upper text area, and click on the “Clear” button, and paste into the text area, and click on the “Run a program” button. The answer will be printed in the second text area.
To add up the whole numbers from 17 to 53 inclusive, please select and copy
var sum=0; for( var jj=17;jj<=53;jj++ )sum+=jj; sum;and move the mouse to the upper text area as before, and click on the “Clear” button, and paste into the text area, and click on the “Run a program” button. The answer will be printed in the second text area, as before.
Of course, the user’s own programs can be typed directly into the upper text area. Users who have not programmed in JavaScript before are warned that it has the third worst diagnostics in all computing.
To top
Indebtedness
I am indebted to a paper by Rissanen (1983) for telling me about the idea of description length and for showing the need to have a fixed number of fractional digits in the binary representation of each datum. I have simplified this to the need to have discreteness of the sample space.
I am indebted to the book by Miller (1966) for my knowledge of the Bonferroni method.
To top
Bibliography
Arbuthnot, John (1710), “An argument for Divine Providence, taken from the constant Regularity observ'd in the Births of both Sexes,” Philosophical Transactions of the Royal Society, Volume 27, pages 186-90. This is also on the Web at http://www.taieb.net/auteurs/Arbuthnot/arbuth.html
Babyak, Michael A. (2004), “What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models,” Psychosomatic Medicine, Volume 66, pages 411-421. This is also on the Web at http://www.psychosomaticmedicine.org/cgi/content/full/66/3/411
Doob, J. L., Stochastic Processes, John Wiley & Sons, Inc. New York, London, Sydney. 1953.
Feller, William, An Introduction to Probability Theory and Its Applications, Volume II, John Wiley & Sons, Inc., New York, 1966.
Goodman, Nelson, Fact, Fiction, and Forecast, Fourth Edition,
Harvard University Press, 2006. For just grue and bleen see http://www-math.mit.edu/~tchow/grue.html
Loève, Michel, Probability Theory, second edition, D. van Nostrand Company, Inc., Princeton, New Jersey, 1960.
Miller, Rupert G., Jr., Simultaneous Statistical Inference, McGraw-Hill Book Company, New York, San Francisco, St. Louis, London, Toronto, Sydney, 1966.
Rissanen, J. (1983), “A Universal Prior for Integers and Estimation by Minimum Description
Length,” Annals of Statistics, Volume 11, Number 2, pages 416-431. This is also on the Web at http://projecteuclid.org/euclid.aos/1176346150
Solomonoff, R. J. (1960), “A Preliminary Report on a General Theory of Inductive Inference.” This is on the Web at http://world.std.com/~rjs/z138.pdf
Taleb, Nassim Nicholas, The Black Swan: The Impact of the Highly Improbable, Random House, 2007.
To top
License, revision date, and e-mail address
The data of Dr. Arbuthnot are perhaps copyrighted by the Philosophical Transactions, or perhaps by Elisabeth Millet, the transcriber. The remainder of the present file is in the public domain. It is revised 31 May 2008. Constructive and destructive remarks come to me, Harold Kaplan,
at dot
smtw2gh toadmail com