In the following sections are some examples showing use of the programs. The user is respectfully invited to try out these examples or to use any others. The only thing to remember is: follow the grammatical rules of JavaScript. (This is because the “eval” method of JavaScript is used in picking up the data from the text area.) Also, the user is respectfully reminded that integers beginning with a zero digit will be understood to be in base eight. To top.
Which browsers?
Modern browsers such as Safari 3, Microsoft Internet Explorer 6, Netscape 7, and Opera 8 can work this page correctly. Netscape 4 is out of date and cannot work this page correctly. To top.
Central tendency.
Let us suppose, for example, that a sample of four numbers
[1,2,13,15]is drawn from a population. Let us do “central tendency.” For what reason? We cross-validators say that the reason is to predict a fifth number not yet drawn from that same population. Which estimator will predict best: mean or median or mid-range? The way to guess which estimator, we say, is to try these three estimators on the sample that we already have, pretending that one of the numbers is not yet seen, and “predicting” it from the others. If, for example, we leave out the 1 from the sample, then the remaining numbers are [2,13,15]. The sample median of them is 13. The prediction error is 13-1=12. The square of 12 is 144. (Gauss said to use the square of the error, and I follow him.) Then we put the 1 back into the sample and leave out 2 instead, and so on. Here is a table of the calculations:
left out remaining prediction error square of error
1 [2,13,15] 13 13-1= 12 12 * 12 =144
2 [1,13,15] 13 13-2= 11 11 * 11 =121
13 [1, 2,15] 2 2-13=-11 (-11)*(-11)=121
15 [1, 2,13] 2 2-15=-13 (-13)*(-13)=169
Summing the numbers on the extreme right, we have 144+121+121+169=
555
That is a sum, and we wish to have the average. It is 555/4=138+3/4, or 138.75 in decimal fraction form. That calculation was for the median. It ought also to be done for the mean and the mid-range. Also, thinking vaguely of null hypothesis, let us do it for zero. Also, for each estimator we ought to use all four numbers and try to predict the fifth number. The arithmetic is a bit too much for a human, but a computer can do it easily. I respectfully invite the user to select and copy the sample
[1,2,13,15]and move up to the top of the present file, and click on the “Clear” button if there is something in the upper text area, and paste the sample into that upper text area, and click on the “Central tendency” button. The computer will print
mean has average squared error 70.55555555555556 and predicts 7.75
median has average squared error 138.75 and predicts 7.5
midRange has average squared error 45.3125 and predicts 8
zero has average squared error 99.75 and predicts 0
into the lower text area. It looks as though midRange has a smaller average squared error than the others, so maybe we ought to predict “8” for the fifth number.
To top.
That little sample was fictitious. It may be more entertaining to use a real sample. I quote some numbers from Charles Darwin, The effects of cross and self-fertilisation in the vegetable kingdom. New York, D. Appleton & Co., 1892. [first published London, John Murray, 1876.] from the web site http://darwin-online.org.uk/content/frameset?itemID=F1249&viewtype=text&pageseq=414&keywords=zea%20mays. The following are from Darwin’s page 16, columns II and III, subtracting III from II. Here II is crossed and III is self-fertilized.
[ 23+4/8-17-3/8, 12-20-3/8, 21-20, 22-20, 19+1/8-18-3/8, 21+4/8-18-5/8, 22+1/8-18-5/8, 20+3/8-15-2/8, 18+2/8-16-4/8, 21+5/8-18, 23+2/8-16-2/8, 21-18, 22+1/8-12-6/8, 23-15-4/8, 12-18 ]This is of course the famous experiment explained by Sir Ronald Fisher, but I am not assuming Fisher’s hypothesis or using his test. I respectfully invite the user to select and copy that sample, move to the top of the present file, click on the “Clear” button, paste into the upper text area, and click on the “Central tendency” button. The computer will print
mean has average squared error 23.8499681122449 and predicts 2.6166666666666667
median has average squared error 22.033333333333335 and predicts 3
midRange has average squared error 27.9234375 and predicts 0.5
zero has average squared error 27.622916666666665 and predicts 0
The zero and MidRange estimators look worse (in their average squared error) than the median and mean. Also, the median looks better than the mean.
To top.
[ [1, 6, 3, 7, 4], [3, 11, 7, 13, 9] ]and to move up to the upper text area, and to click the “Clear” button if need be, and to paste into the area, and to click on the “Least squares” button. The computer will print
general has average squared error 0.6995442676005313 and prediction formula y=1.5964912280701762*(x-4.2)+8.6
horizontal has average squared error 18.5 and prediction formula y=8.6
origin has average squared error 1.4417029948974134 and prediction formula y=1.954954954954955*x
zero has average squared error 85.8 and prediction formula y=0
Surely in this instance the general line is the one to use. However, it is easy to invent samples where some other line seems best. Here is one:
[ [ 1, 2, 3, 4, 5, 6], [12, 13, 15, 14, 13, 11] ]Again I respectfully invite the user to try it. The computer says
general has average squared error 4.562661016299636 and prediction formula y=-0.17142857142857146*(x-3.5)+13
horizontal has average squared error 2.4 and prediction formula y=13
origin has average squared error 53.93493791316232 and prediction formula y=2.967032967032967*x
zero has average squared error 170.66666666666666 and prediction formula y=0
The horizontal line now seems best. Here is another fun sample:
[ [ -0.7559933604175628, -0.30452469011366007, -11.97588016679282, -1.7202512228121947, -0.09708418781061355, -1214.8091069909535, 0.1404723127750845, 2.04478700550015, 2.4752865344391455, -1.2807430831786153 ], [ 3.1199546932110485, 2.0056226789993685, -0.5875348207012748, 0.19449125144129242, -0.6051379450630253, 1.4461855036062097, -5.0707078525702425, 1.788018414857512, -0.7742942455630042, 0.5360292436125873 ] ]This time the computer says
general has average squared error 187.0856248007983 and prediction formula y=-0.0011323640619636944*(x+122.62830378493645)+0.20526269218304724
horizontal has average squared error 5.675546574408692 and prediction formula y=0.20526269218304724
origin has average squared error 113.39142004038527 and prediction formula y=-0.0011875354985374168*x
zero has average squared error 4.639325498073273 and prediction formula y=0
No engineer ought to trust that general line, and the origin line is nearly as bad. The zero line seems best. Let us finish this section with an example where the origin line seems best:
[ [2,2,3,4,6], [1,2,3,4,5] ]The computer says
general has average squared error 0.9699801098650603 and prediction formula y=0.892857142857142*(x-3.4)+3
horizontal has average squared error 3.125 and prediction formula y=3
origin has average squared error 0.33098900129960634 and prediction formula y=0.8840579710144928*x
zero has average squared error 11 and prediction formula y=0
To top.
[1,2,4]and click on “Clear” if need be, and paste into the upper text area, and click on “Multinomial”. The computer will say
nonUniform has average squared error 0.7777777777777778 and predicts [0.14285714285714285, 0.2857142857142857, 0.5714285714285714] uniform has average squared error 0.6666666666666667 and predicts 0.3333333333333333The uniform predictor looks a little better. On the other hand,
[11,23,39]gives
nonUniform has average squared error 0.6091820987654322 and predicts [0.1506849315068493, 0.3150684931506849, 0.5342465753424658] uniform has average squared error 0.6666666666666667 and predicts 0.3333333333333333so it seems better to use the nonUniform predictor. The user is respectfully invited to check this. To top.
[ [1,0,0,0,1,2,0,0,1,0,1,0], [1,0,0,1,0,0,0,0,0,1,0,2], [1,0,0,0,2,1,0,0,0,0,0,1], [3,0,2,0,0,0,1,0,1,3,1,1], [2,1,1,1,1,1,1,1,1,1,1,0], [2,0,0,0,1,0,0,0,0,0,0,0], [2,0,2,1,0,0,0,0,1,1,1,2], [0,0,0,3,0,0,1,0,0,1,0,2], [0,0,0,1,1,0,0,0,0,0,1,0], [1,1,0,2,0,0,1,0,0,1,1,0], [0,1,1,1,2,0,0,2,0,1,1,0], [0,1,1,0,0,0,1,0,0,0,0,0] ]Diaconis and Sturmfels say, on their page 363, “The classical rules of thumb for validity of the chi-square approximation (minimum 5 per cell) are badly violated here, and there are too many tables with these margins to permit exact enumeration.” They are speaking of doing a test of a null hypothesis, using either Pearson’s chi-squared approximation or else an exact test. For us cross-validators the problem is much easier. Ought we to build our predictor by multiplying the sample marginal probabilities, or ought we to use each sample cell probability separately? The user is respectfully invited to select and copy the table, move to the top of the present file, click on the “Clear” button, paste, and click on the “Contingency, perhaps with structural zeroes” button. The computer will then say
separate has average squared error 1.005639384240208 multiply has average squared error 0.994447958748549 bigCheck is 1.9212062496443139e-16 grand total=82 side is [0.07317073170731708, 0.060975609756097573, 0.060975609756097573, 0.14634146341463417, 0.14634146341463417, 0.03658536585365854, 0.12195121951219515, 0.0853658536585366, 0.03658536585365854, 0.0853658536585366, 0.10975609756097562, 0.03658536585365854] bottom is [0.15853658536585363, 0.048780487804878044, 0.08536585365853656, 0.1219512195121951, 0.09756097560975609, 0.048780487804878044, 0.06097560975609755, 0.03658536585365853, 0.048780487804878044, 0.1097560975609756, 0.08536585365853656, 0.09756097560975609]It seems that the multiply predictor is preferred. The bigCheck number is meant to give an idea how well the algorithm did, and smaller is better. The grand total is the usual total of all the numbers in the rectangle. The side array is the side marginal sample probabilities, and the bottom array is the bottom marginal sample probabilities. The algorithm is the EM algorithm with random starting position, so if the button is clicked a second time the printed numbers will be a little different. (I remember that the EM is a hill-climber, and I fear that there may be more than one hill. So far, I have not found more than one.) To top.
[ [5,0,0], [6,4,0], [6,5,7] ]Copying and pasting and clicking the “Contingency, perhaps with structural zeroes” button gives, incorrectly,
separate has average squared error 0.880859375 multiply has average squared error 0.8928465409712358 bigCheck is 0 grand total=33 side is [0.15151515151515152, 0.30303030303030304, 0.5454545454545454] bottom is [0.5151515151515151, 0.2727272727272727, 0.21212121212121213]The separate prediction seems to be preferred, but this is a logical mistake. Those three zeroes above the diagonal did not just happen in this sample, but rather they are structural zeroes. We need a way of saying this. I have defined the JavaScript variable sz in a special way to say this:
[ [5,sz,sz], [6, 4,sz], [6, 5, 7] ]The user is respectfully invited to select and copy this new array and move to the top of the file and clear the upper text area and paste into the area and click on the “Contingency, perhaps with structural zeroes” button. The computer will print something like
separate has average squared error 0.880859375 multiply has average squared error 0.8708238998688727 bigCheck is 9.512732268568413e-11 grand total=33 side is [0.29411764705406557, 0.33613445377990797, 0.36974789916602646] bottom is [0.5151515151515151, 0.38636363636103216, 0.5736914600452244](Because of starting with random numbers, some of my printed digits may differ from what the user may get.) This time, the multiply prediction looks better. The user will notice that the bottom array numbers do not add to one. That is, they are not probabilities. It is impossible for both side and bottom to add to one, because the prediction number at (eye,j) is not just side[eye]*bottom[j] but rather side[eye]*bottom[j]*w[eye][j], where w[eye][j] is zero at structural zeroes and one otherwise. I arbitrarily chose to make the side numbers add to one. Of course, that does not make the side numbers probabilities. It is only a convenience. To top.
Table 5: Classification of Purum marriages
Sib of husband
Sib of wife Marrim Makan Parpa Thao Kheyang
Marrim [0] 5 17 [0] 6
Makan 5 [0] 0 16 2
Parpa [0] 2 [0] 10 11
Thao 10 [0] [0] [0] 9
Kheyang 6 20 8 0 1
They in turn have copied the table from Das, T. (1945), The Purums: An Old Kuki Tribe of Manipur, Calcutta, University of
Calcutta, and from White, H. C. (1963), An Anatomy of Kinship, p. 138, Englewood Cliffs, N. J., Prentice-Hall. Aoki and Takemura write [0] to signify a structural zero, but in JavaScript this will not work correctly, so I use my notation as before:
[ [sz, 5, 17, sz, 6], [ 5, sz, 0, 16, 2], [sz, 2, sz, 10, 11], [10, sz, sz, sz, 9], [ 6, 20, 8, 0, 1] ]Selecting and copying and clearing and pasting and clicking as before will cause the computer to print
separate has average squared error 0.9202058404116807 multiply has average squared error 0.9447756959266765 bigCheck is 9.867436381871819e-11 grand total=128 side is [0.3586593251646624, 0.22062556730118016, 0.27873743889429287, 0.4718258426439395, 0.26057821624069105] bottom is [0.17214837345257197, 0.2349035381026006, 0.23255277910801347, 0.2672904087882438, 0.14245393649757793]or something close to it. One might say that this sample is not showing “quasi-independence.” To top.
Table 6: Effects of decision alternatives on the verdicts
and social perceptions of simulated jurors
condition
1 2 3 4 5 6 7
alternative
first-degree 11 [0] [0] 2 7 [0] 2
second-degree [0] 20 [0] 22 [0] 11 15
manslaughter [0] [0] 22 [0] 16 13 5
not guilty 13 4 2 0 1 0 2
They copy it from Vidmar, N. (1972), Effects of decision alternatives on the verdicts and social perceptions of simulated jurors, J. Personality and Social Psych., 22, pp. 211-218. Here is the table ready for JavaScript:
[ [11, sz, sz, 2, 7, sz, 2], [sz, 20, sz, 22, sz, 11, 15], [sz, sz, 22, sz, 16, 13, 5], [13, 4, 2, 0, 1, 0, 2] ]The computer prints
separate has average squared error 0.5096325740307238 multiply has average squared error 0.5104330444848224 bigCheck is 8.938061575591849e-11 grand total=168 side is [0.12760005550672035, 0.9568328653925063, 0.6227911582949142, 0.09040342620713811] bottom is [0.6552975288929133, 0.13641347612097152, 0.20030598375460854, 0.12159748309541292, 0.16990729490812684, 0.08554179325984283, 0.07946982477065141]To top.
[ [1,2,3], [3,4,5,6], [5,6,7,8,9] ]The algorithm leaves out each of the twelve measurements and tries to predict it from the remaining eleven measurements. If the user will be pleased to select and copy this array, and to click on the “Clear” button, and to paste in the upper text area, and to click on the “k-sample” button, then the computer will print
separate has average squared error 2.417824074074074 and predicts [2, 4.5, 7] combined has average squared error 6.43801652892562 and predicts 4.916666666666667That is, it seems better to predict in each population separately than to combine the samples as if there were only one population. However, a sample like
[ [1,2,3], [1,2,3,5], [1,2,4,5,7] ]seems to go the other way:
separate has average squared error 4.640046296296297 and predicts [2, 2.75, 3.8] combined has average squared error 3.96694214876033 and predicts 3and combined seems to have a smaller average squared error than separate. To top.
| Gasoline Mileage for Various Cars | |||
| compacts | 20.3 | 25.6 | 24.0 |
| intermediate 6’s | 21.2 | 24.7 | 23.1 |
| intermediate 8’s | 18.2 | 19.3 | 20.6 |
| full size 8’s | 18.6 | 19.3 | 19.8 |
| sports cars | 18.5 | 20.7 | 21.4 |
[ [20.3, 25.6, 24.0], [21.2, 24.7, 23.1], [18.2, 19.3, 20.6], [18.6, 19.3, 19.8], [18.5, 20.7, 21.4] ]and to click on the “Clear” button, to paste into the upper text area, and to click on the “Blocks and treatments” button. The computer will print
separate has average squared error 17.358750000000004 and predicts [19.360000000000003, 21.919999999999998, 21.78] combined has average squared error 20.55725000000001 and predicts 21.02It appears that we ought to use the “separate” prediction. To top.
How is it known in a problem like this which are the treatments and which are the blocks? Let us swap them by transposing the table:
var x= [ [20.3, 25.6, 24.0], [21.2, 24.7, 23.1], [18.2, 19.3, 20.6], [18.6, 19.3, 19.8], [18.5, 20.7, 21.4] ]; var y=transpose( x ); y;As before, the user is respectfully invited to select and copy all this and to click on the “Clear” button, to paste into the upper text area, and to click on the “Blocks and treatments” button. The computer will print
separate has average squared error 21.835000000000008 and predicts [23.3, 23, 19.366666666666667, 19.233333333333334, 20.2] combined has average squared error 34.05300000000001 and predicts 21.02The question is different, so the answer is different.
Of course, it is easy to invent a fictitious table where combined looks better than separate. Here is one:
[ [1,2,3,4], [2,3,4,1], [3,4,1,2], [4,1,2,3] ]The computer will print
separate has average squared error 8.888888888888888 and predicts [2.5, 2.5, 2.5, 2.5] combined has average squared error 5 and predicts 2.5To top.
[ [20.3, 25.6, 24.0], [ [], 24.7, 23.1], [18.2, 19.3, 20.6], [18.6, 19.3, 19.8], [18.5, 20.7, 21.4] ]The algorithm in the present file is able to handle this missing measurement correctly. The computer will print
separate has average squared error 17.306250000000002 and predicts [18.9, 21.919999999999998, 21.78] combined has average squared error 22.310456759149954 and predicts 21.00714285714286Instead of shortage of funds, it is possible that a clerical blunder occurred, and the measurement for one place was done twice, independently, and the measurement for another place was never done at all. We group the two measurements done for the same place in square brackets with a comma between:
[ [20.3, 25.6, 24.0 ], [ [], 24.7, 23.1 ], [18.2, 19.3, [20.6,20.5]], [18.6, 19.3, 19.8 ], [18.5, 20.7, 21.4 ] ]This time the computer prints
separate has average squared error 16.58303 and predicts [18.9, 21.919999999999998, 21.566666666666666] combined has average squared error 20.683967891665453 and predicts 20.973333333333333All those extra blanks to line up columns are unnecessary, but they are a good idea. To top.
[ [1, 6, 3, 7, 4], [3, 11, 7, 13, 9] ]The reader will recall that general had a much smaller error than horizontal. Let us now instead click the “Blocks and treatments” button. The computer says
separate has average squared error 106 and predicts [2, 8.5, 5, 10, 6.5] combined has average squared error 137.8 and predicts 6.4That is, separate has a smaller error than combined. The answers for least squares and for blocks and treatments seem to agree. In the second example for least squares, namely
[ [ 1, 2, 3, 4, 5, 6], [12, 13, 15, 14, 13, 11] ]the horizontal line had a smaller error than the general line. Clicking the “Blocks and treatments” button gives
separate has average squared error 575 and predicts [6.5, 7.5, 9, 9, 9, 8.5] combined has average squared error 555.25 and predicts 8.25so that combined looks better than separate. Again the two different methods of inference seem to agree. Recall that least squares leaves out one (x,y) point but blocks and treatments leaves out one block, so they really are different inferences. Will they always agree? No. Here is an example:
[ [ 1, 2, 3, 4, 5, 6, 7, 9, 8], [ 9, 8, 7, 6, 5, 4, 3, 2, 1] ]Clicking on “Blocks and treatments” will give
separate has average squared error 238 and predicts [5, 5, 5, 5, 5, 5, 5, 5.5, 4.5] combined has average squared error 60 and predicts 5but clicking on “Least squares” will give
general has average squared error 0.4766074015691956 and prediction formula y=-0.9833333333333334*(x-5)+5
horizontal has average squared error 8.4375 and prediction formula y=5
origin has average squared error 23.778275996878346 and prediction formula y=0.5824561403508772*x
zero has average squared error 31.666666666666668 and prediction formula y=0
so the two inferences, least squares on the one hand and blocks and treatments on the other hand, do not agree with each other at all. The reason seems to be that least squares is a two-sided inference for this kind of problem, but blocks and treatments is, for this kind of problem, one-sided.
To top.
[17,69,51,81,40,66,55,63,95,58,8,61,46,34,68,3,42,11,8,79,17,46,12,1,13,49,55,5,99,75,13,26,67,50,17,15,35,68,73,79,20,82,18,35,66,4,44,15,99,91,94,82,52,43,14,31,53,61,10,17,88,83,21,57,78,72,70,28,61,57,13,86,52,21,60,31,89,66,41,35,92,73,99,62,38,12,47,19,75,65,92,70,85,69,57,83,97,82,9,68]The reader is respectfully invited to select and copy the table, to move to the upper text area, to clear if need be, to paste, and then to click on the “Interval estimation” button. The computer will print
Sample mean is predicting worse than simples which are between 47.003823656280076 and 55.07617634371993This interval is much narrower than the usual frequentist confidence intervals or Bayesian intervals would be, because its meaning is much different. To top.
var x=[]; for(var j=0;j<10;j++)x[j]=j; x;The user is respectfully invited. To top.
at dot
smtw2gh toadmail com
Harold Kaplan’s statistics.htm