Saturday, January 02, 2010

New Year's Puzzler

Ethon, at semester break, flew back from his aerie in the Front Range with a neat puzzler. Roger is all upset because the evidence for increased weather and climate change related damage keeps accumulating. The boy would make a great WW I general. The damage is starting to emerge from the noise and he is still stuck in the past.

The political scientist formerly known as Prometheus hit the International Disaster Data Base and brought the purple graph, showing how the number of floods in West Africa have increased a lot in the past three decades. P thought that the effect was caused by under-reporting in the past and therefore Munich Re was talking trash. Not unreasonable thought Eli, although Ethon was looking under the table for cards that had been dropped on the floor. To satisfy the skeptical bird the bunny went to the data base and thought, hmm, under-reporting might not be such a bad problem in Western Europe, so he ran the figures. However, as Steven Leacock would say, this has nothing to do with our bright and sunny puzzler** but is merely blog filler.

This week's puzzler comes from Pieter Vermeesch at UC London. The data comes from the US Geological Survey and shows the number of earthquakes in the ten year period starting in 1999 whose magnitude exceeded 4 on each day of the week.

The average bunny would tell you that the day of the week has nothing to do with the frequency of non-domestic earthquakes. True, Eli knows from experience that propinquity makes for large weekend blow ups with Ms. Rabett. Still Mother Earth IS NOT THAT KIND OF LADY.

That's gonna be the null hypothesis anyhow, and there are six degrees of freedom.

Rabett Labs hitched up the IBM computators bought cheap from the Manhattan Project, shanghaied recruited a bunch of young volunteers, and found that for this case Pearson's chi-square statistic is 94, which means that the probability of the null hypothesis being true is 4.5 x 10-18 or about as likely as Ethon going vegan. (OK this is a blog, Eli exaggerates. Make something of it.)

Eli will provide the link with the solution in a day or so. The question is why is the result wrong. It ain't the math.

**IV. -- Gertrude the Governess: or, Simple Seventeen_

_Synopsis of Previous Chapters:_
_There are no Previous Chapters._

IT was a wild and stormy night on the West Coast of Scotland. This, however, is immaterial to the present story, as the scene is not laid in the West of Scotland. For the matter of that the weather was just as bad on the East Coast of Ireland.

But the scene of this narrative is laid in the South of England and takes place in and around Kmotacentinum Towers (pronounced as if written Monckton Taws), the seat of Lord Kmotacent (pronounced as if written Monkton) and his faithful servant Escrushium (pronounced as if written Scrotum).

But it is not necessary to pronounce either of these names in reading them.

(Thanks to project Gutenberg for the stories on line:)


silence said...

You gave us too much info. Its trivial to find Pieter's explanation.

Anonymous said...

Because NHST should be labelled Statistical Hypothesis Inference Testing?

Cramer's V = sqrt(chi^2/(df(N))

So effect size is negligible (ca. 0.01). Statistically but not practically significant. H0 are almost always false, and a very large sample will readily show significance for mere noise.

EliRabett said...

Yeah, well, no one reads the comments. . . .:). The point is to make a point to the statistical fascists. Probably tomorrow, besides, credit where credit is due.

Arthur said...

Uh, my background in statistics is extremely weak (theoretical physicist and all) and I haven't tried to google the explanation yet - but isn't there at least a minor issue here with correlation? I.e. when you get a large earthquake, there tend to be many aftershocks which could be over magnitude 4? So the actual uncorrelated sample size is at least somewhat smaller. I don't think that could possibly explain 94 sigma's though... Ok, I don't have the answer on this one, I'm sure of that...

Arthur said...

Ok, now I googled Pieter's explanation, and I'm still not sure I understand the point. It is that there *does* need to be an explanation, but not a geological one? But that still makes it an interesting test of statistical significance then - something's going on. Ok, yes, it's a relatively small effect, but still, that's 8.3% more (reported) earthquakes on Sundays than Fridays, that's not nothing...

NonyMouse 8.48 said...

The problem here is that when we should be theory-driven in our experiments, research will often uncover effects of pure sample noise (especially with big samples).

Also what I'd hope to see is a clear specific hypothesis prior to testing. 'You really think days of the week mediates a physical/geological effect on earthquakes, how and on which days?'. As noted, the null hypothesis is almost always false, random sample variation itself ensures this. Blind statistics sucks.

So even with statistical significance, the question should be signal or noise? If one wants to claim signal (i.e., real physical effect) then by what mechanism? Secondly, replication would be pretty essential (do we find the same effect 1989-1999? Are more earthquakes again found on sundays or fridays?). Even with replication, would it be more likely that the effect is a data reporting issue (i.e., data bias)?

NHST is a shitty hodge-podge method born of the Fisher/Neyman-Pearson clashes. Another related issue is familywise errors - using multiple tests simultaneously inflates type I errors (false positives). A great example in my own area is that of the fMRI methodology where such issues are still pervasive...

I see that Tamino is seeing strengths in Bayesian approaches.

"It is foolish to ask 'are the effects of A and B different?" They are always different - for some decimal place" Tukey, 1991

tamino said...

I don't think you need 10 years' earthquake counts to get this result -- I believe I got a "significant" (but not AS significant) result using just 2009 data.

The hypothesis that "day of the week has nothing to do with earthquake occurrence" seems to me to be the right hypothesis. But applying Pearson's chi-square assumes a lot more than that! It requires that the probability of an earthquake is unaffected by the occurrence of recent prior earthquakes. In that case, we'd expect the daily counts to follow a Poisson distribution. But it's easy to see, by looking at daily counts, that they don't. Not even close.

So the result from Pearson's chi-square is correct, and we can conclude that EITHER earthquake occurrence is related to day of the week, OR earthquake occurences are correlated, or both. I'm going with correlation.

Holly Stick said...

The Viscount drained a dipper of whiskey neat and became again the perfect English upperclass twit.

Harrywr2 said...

As a 'laymen'

I would posit

Given a brief perusal of the status of the USGS Global Monitoring network, not all stations are currently reporting.

I would then hypothesize that 'technical faults' which result in under reporting are a common random occurrence.

'Time to fix' technical glitches has a 'day of week' bias. There are numerous studies in productivity that conclude a 'day of week' bias.(Never buy a car built on monday etc)

Hence there is a day of week bias in 'under reported' events.


Martin Vermeer said...

Four theories up till now... three -- the church bell theory, working week noise drowning out quakes, and time-to-fix bias -- all are related to workdays vs. weekend. Which is not what I see in the deviations from the average at all.

Tamino's theory seems most plausible: it would produce "random" deviations from the average, which seems to be what we see. And we know that seismic events cluster.

It should be readily testable using the data.

Anonymous said...

Oops we have two nonny mices.

An explanation I haven't seen. The word is round, in your part of the world it is still yesterday. Could the weekend affect where the quake is reported from and therefore the day? Or are the reports GMT adjusted.
Nonny Mouse (without a google account)

David B. Benson said...

Try it again with just magnitude 6 and up. That should eliminate almost all of the aftershocks.

David B. Benson said...

But not all of them:
Solomon Islands Earthquake: Two Quakes Hits South Pacific

Hank Roberts said...

I'm only making myself more confused, so let me see if I can share that confusion with others.

Annales Geophysicae (2002) 20: 1137–1142
Does the magnetosphere behave differently on weekends?

Midweek increase in US summer rain and storm heights
TL Bell, D Rosenfeld, KM Kim, JM Yoo, MI Lee, M … - J. Geophys. Res, 2008 -
... WASHINGTON – Rainfall data recorded from space show that summertime storms
in the southeastern United States shed more rainfall midweek than on weekends.

I can't imagine why the AI chose as its verification word "chiligno" -- no, I refuse to imagine any connection.

Arun said...

Those 118,425 earthquakes of magnitude 4 or greater occur on 3654 days. The minimum events on a day is 8. The average number of events is 32.4 The maximum events on a day is 306. There are about 30 "high-event days" - days with more than 100 events. If we divide up the days into regular and high-event days, I expect that there are enough regular event days to approximate - for these days - rather closely a uniform distribution of earthquakes by day of week. However the few high-event days can introduce large deviations of uniformity when folded in.

The problem is we have too small a sample of high event days.

Arun said...

Re: the first comment by silence -- Pieter's explanation is not necessarily correct.

Re: the first anonymouse, the law of large numbers had better hold. Noise loses statistical significance with large samples, or the whole enterprise is screwed.

Re: Eli, yes, we read the comments.

Arun said...

Since opinion surveys with about 2000 responders are said to have errors of 3%, we expect a sample of 200,000 - hundred times - to have one-tenth (square root of 1/100) of the error, or 0.3%. The deviation from the mean events per day of week of the earthquake sample should be around 50 (17000 * .003). Instead it is as large as 800 or so; which may mean that some responders have many, many votes and are stuffing the ballot boxes with them.

nony mouse 8.48 said...

However, Arun, the very notion of a null hypothesis meaning 'no difference' or 'no relationship' is ridiculous(i.e., a nil hypothesis). Once we accept that the null is almost always false, then finding statistical significance is only a case of sufficient power (i.e., large enough sample). NHST via p-values is inherently influenced by sample size. This issue has been long discussed in psychology (e.g., Jacob Cohen).

I guess the point I would make is what happens when signal is 0 (or approaches 0) and noise is prominent? Of course, it depends on defining what the signal is. What are we looking for here, we might take the null as important from a statistical point of view, but scientifically what are we looking for here? The alternate hypothesis would just be 'there is a difference in earthquake measures across days of the week'. Again, a trivial matter once we view the null as a nil hypothesis. I'd rather see science based on a real scientific hypothesis (than statistical). Do we think that earthquakes fall moreso on certain days due to some geological process? Or is it a simple reporting issue? If, as Tamino suggests, it's a result of non-independence of the data (i.e., clustering) wouldn't this still mean there is still no real underlying relationship between 24 hour periods in some arbitrary human-defined concept such as 7 days a week (is there a logical reason for sundays and thursdays to be prominent? In another 10 year period would it now be mondays and wednesdays?). If we use the Javanese 5 day week calender then what?

It really doesn't surprise me that there is mere statistical significance for such a massive sample. I suppose my next question to a researcher with such data would be so what? Show me the exact same outcome for 1989-1999 and I'll be mildy impressed (p <.05).

Anonymous said...

Steve L says:
You can't look at data, observe what looks like something interesting, and then decide to do a statistical test like this on it. I mean, there are a posteriori tests, but this ain't one of 'em.

By looking at a bunch of data and testing many hypotheses visually, then selecting the data that seem to show something for a more formal test, you have inflated your Type I error. You should correct for multiple tests. For this reason I agree with the last paragraph of nony mouse 8.48.

And of course the non-independence of events has to be taken into account....

Arthur said...

The multiple tests issue is a real one, but when you get 4.5x10^-18 probabilities, that means something else is going on (you are not doing 2x10^17 tests looking for one that is unusual in this case)!

It sounds like we're actually converging on correlation as the problem - non-independence of the events (what I raised tentatively in my first comment here). But I'm interested in what Eli thought Pieter's conclusion was, if not that?

Pieter said...

If you tally the earthquakes by hour instead of by day, the correlation problem effectively disappears, but the chi-square value is 134.8 which, with 23 degrees of freedom, still corresponds to a p-value of 5e-18.

Nony mouse 8:48 said...

Hey Arthur,

familywise errors are a problem, obviously not in this example. But the same problems apply. For example, in familywise errors we know that we are more likely to reject the null for multiple tests.

For NHST we are more likely to reject the null for a larger sample. Thus, rejecting the null in NHST p-value testing tells us that our research was powerful enough to detect a degree of difference/relationship in the sample (which was almost certainly true anyway, especially so for quasi-experiments).

Now, yeah, larger samples are generally better for robust findings. With a smaller sample we would expect to find more false positives. Equally, the smaller the effect size the more likely we find false positives.

What Pieter has done is effectively shown the inane approach that Gerd Gigerenzer has called 'mindless statistics': 1. set null as nil, make no prior alternate (form post-hoc); 2. use 5% convention; 3. always use this procedure. This is more common than we might like to think. And with large samples we will more likely show significance...

So the problem is not with the statistics (they do what they say on the tin), but in how they are generally used and interpreted.

Arthur said...

Pieter - why should the correlation problem disappear if you tally by hour? By second of the hour, maybe, but "aftershock" repetitions can be much closer than 1 hour apart.

And I really am not following the line of reasoning here. The "null hypothesis" is essentially no causative relation between one variable and the other - they should not be correlated. So if the null hypothesis is false (as these statistical tests prove - *if* the events were really independent and uncorrelated) then the statistics is showing there *is* some relationship. Not what that relationship is, but that still makes it an interesting scientific result. Perhaps not "important", if the size of the effect is small, but unexpected relationships between different things are the essence of good science, and I'm not sure why you seem to be dismissing them...??

nony mouse 8:48 said...

I can see what you're saying. But for most situations (you could argue that not so much for true experiments with excellent randomisation) the null is almost always false - be it the effect of interest or the myriad of non-interesting effects we might call noise (the 'crud factor'). Which Pieter does sort of point out.

So where does that leave us? I would assume in most experiments what we are really interested in is some general hypothesis (e.g., does tobacco cause cancer?). But the null, if viewed as nil, is almost always false. So the nil, as per Popper, would be 'there is no causal relationship between tobacco use and cancer'. But we can actually be pretty certain that the statistical null is already false - be it noise or the effect of interest, and that we only need sufficent power to tease it out. And so we might as well reject the nil null before testing anyway, lol. Not a very informative approach.

Another issue is that NHST doesn't really test the veracity of the null: 'probability of null given data'. It tests: 'probability of data (or more extreme) given null'. An important difference. So rather than the null is 'proven' true/false given the data, we can safely say the data is unlikely/likely given the null is true.

Even Fisher retreated from some of the awful interpretations (somewhat due to his lack of clarity) of his NHST approach.

Arthur said...

Hmm, I think what you're getting at is that falsifying the null hypothesis (or finding that the data is of very low likelihood given the null, which seems to me almost the same thing) is a useful thing to do only if your experimental dataset is of sufficient quality to suppress or control for all those "noise" effects. Which is really just the same issue scientists regularly have to deal with in analysis, of ensuring that they have sufficient control of the experimental system, so that the variables they are changing are the ones they care about, not irrelevant things.

Is that what you're getting at? I'm definitely uncomfortable with leaving unexplained correlations lying around, but if it's just a matter of acknowledging that the data gathering/analysis has a limited limited level of quality after which you're not getting any additional information, I guess that makes sense... Not sure this earthquake case is really a good example of that though!

nony mouse 8:48 said...

Yeah, my lack of clarity competes with Fisher's at times. After a little thought I think this might be the best way to highlight the problem as I see it.

Popper essentially picks out that we can't really positively 'prove' theories and hypotheses - only confirm and support (data consistent with etc). But we actually test this through the use of null hypotheses. This sort of roughly fits with the Popperian notion of falsifiying scientific hypotheses and theories. The actual experimental hypotheses thus become the 'alternate', which we generally take to be the converse of the null. By rejecting the null we sort of use this as merely an indirect confirmation of the alternate - it survived falsification! And we add a notch to its bedpost.

But when we have a null hypothesis as the nil effect/relationship without defining an a priori alternate - whence falsification? We haven't even defined a research hypothesis.

Moreover, if we accept that in many styles of experiment (especially the messy quasiexperiments) the nil style of null is almost always false, and if we don't even define an a priori risky alternate...


nony mouse 8:48 said...

So take the earthquake example: the null is no relationship between days of the week and freq. of earthquakes. Without defining a specific alternate, we can take it as being 'there is a difference/relationship'. But we can very likely see the nil/null as a pure strawman we already know is probably false (we just need a big sample to find any old negligible effect).

If the veracity of a theory/hypothesis is a consequence of subjecting it to risky testing, then the 'mindless stats' approach is useless for testing hypotheses and theories (maybe good for career). Alternate hypotheses must be defined prior to testing and should put to risk by being specific and clearly falsifiable.

By just taking a set of data, looking for no difference/relationship as the null without alternate, we find something much less than impressive or compelling. As a pure exploratory approach - perhaps. But not for theory-driven research.

So, now, we have a p-value which is very small and a minimal effect. We determine multiple possible hypotheses to explain the data (like we have here). We need to test again with a new set of data. We can't verify the hypotheses post-hoc with this data. Perhaps some specific testable/falsifiable predictions which people might like to support (all with associated nulls, nil if you want) would be:

1. If the finding is due to weekend data reporting issues (or church bell etc?) we'll find the exact same outcome 1989-1999 (low p, small phi, same days).

2. If the finding is the result of non-independence of the DV then we'll find difference but not exact outcome (low p, small phi, different days).

Would that be more impressive than just testing the same null for a different set of data without specifying the alternate? Otherwise we're just dealing a set of thousands of cards and depending on overfitting and hindsight bias. No risk or falsification, just post-hoc story-telling. It's almost like the creationists who get a thrill from doing post-hoc probability.

[cont. again!!]..

nony mouse 8:48 said...

In Popper-style, the riskier the test of falsification the stronger we can view the corroboration of the hypothesis/theory. The more tests of falsification, the stronger we can view the corroboration (it takes on a level of truthfulness/verisimilitude).

But the problem is when we test with a vacuous alternate with the same approach. If sample size increases the probability of finding an effect/relationship - including any old noise, then how robust is the test of falsification? If I test for a relationship between smoking and cancer using Chi-square with 150,000 participants, how sure can we really be that merely rejecting the null provides a degree of corroboration for the alternate? We know that there is a lot of noise in such messy quasiexperiments, the null is probably false anyway, and the massive sample size will only increase the likelihood of teasing out background noise as statistically significant.

I think that's the sort of thing that Pieter was trying to get across, but the example wasn't the best. There's been a couple of studies on this in psychology (e.g., Waller, 2004), and it appears that for such large sample quasiexperiments the probability of rejecting the nil null is ca. .5 no matter what the DV.

That's a problem. In later work, fisher did state that the null need not be nil.

And I wouldn't say we would leave such findings unexplained, just that if we even use such methods with a proper alternate hypothesis the background noise is a problem if we find a small effect. Not really that impressive or informative. It is likely a real effect of something, but probably a baffling one and not one worthy of a postgrad student.

I used to be a chem labrat in a past life, so it was a surprise to move from standardised assay testing/HPLC etc with correlations of .999999999999 to depending on making inferences from messy samples with significant correlations of .2, lol. The inferences need to be much more tentative and replication is essential. So psychology has been moving to other approaches as additive to NHST (some even want to do away with it completely).

As an aside, NHST in the firm reject/accept decision-style is the sort of method where we would be compelled to reject the null for p = .049, r (effect size) = .05; but accept the null for p = .051 r = .6. As a method, it sort of sucks. Sometimes it's extremely hard to get a massive sample of people with lesions in a particular area of the brain etc leading to non-significance but big effect.

Hope that's clear enough. Cheers.

[that's all folks!]

Arthur said...

Thanks, that was a good explanation - I don't think theory-driven research is the be-all and end-all of science, so was a little surprised at this which seems a good example of exploration and finding something interesting (but not knowing what). But you're absolutely right, anything more requires a specific hypothesis and further tests, there are obviously many hypotheses that may agree with what we know so far about the problem, but have differing predictions about how other data (or other partitions of the same data) would behave. So I think we agree in the end...

Thanks for the instructive commentary!

nony mouse 8:48 said...

No worries! I just tend to fall into stream of consciousness babbling on the interdweebs.

But, yeah, we need to get our hypotheses from somewhere. So exploratory approaches, using simple anecdotal evidence/observation, prior paradoxical findings, pilot data, and even simple intuition are important enough for directing our attention. But data mining and ad-hoc/post-hoc approaches aren't helpful (although tempting in the publish or perish age).

Of course, Pieter wasn't making the exploratory point, lol. Best intentions blah blah.

nony mouse 8:48 said...

OK, after some munchies I'll try (lol) to show the issue with the difference between p(data|H0) vs p(H0|data)...

Now, the point is that they are not the same (after Cohen, 1994):

The incidence of schizophrenia is 2%, a test can determine schizophrenia with 95% accuracy and 97% for determining 'normality'.

H0 = case 'normal'
H1 = case of schizophrenia

p(normal|H0) = ca. 97% (3% false negative)
p(schizophrenia|H1) = 95%

p(data|H0): thus probability of the positive test given being 'normal' is 5% (false positive; p = .05). We would reject the null and diagnose with schizophrenia for positive test.

p (H0|data): this actually requires a Bayesian approach using the incidence rate of 2% [P(H0) = .98]:

P (H0|data) = [(98)*(.3)]/[(98)*(.3)+ [(.2)*(95)]

P (H0|data) = .607

Thus, the probability is of .6 being normal given a positive test.

P (data|H0) = .05

Here, the probability of .05 for a positive test given a case is normal.

A bit different (but better viewed as a 2x2 table). We can use Bayesian priors for assessing the probability of the null, but not so much p values.

Arun said...

Am very puzzled.

Arun said...

Updated the stuff at the link

Computed a auto-correlation and it shows no signs of going to zero.

What is the physical source of these long-term correlations?

Arun said...

My guess is that most earthquakes are a byproduct of the movement of continental plates, which are long-term motions spanning millions of years. While randomized because of the non-uniform response of the crust to stress, this is otherwise not different from if we put down markers and observed the flow of a very viscous liquid as it passes the markers. That is the source of long-term correlation.

Arun said...

Whoa, I took data from 1980-2009 inclusive, and this is wierd! Of course, Tuesday and Friday exchange their places.

# of quakes 1980-2009
Average 35555
Sun 36829
Thu 36101
Sat 35918
Mon 35423
Fri 35174
Wed 34774
Tue 34668

# of quakes 1999-2008 (Pieter Vermeesch set)
Average 16916
Sun 17752
Thu 17401
Sat 17019
Mon 16851
Tue 16552
Wed 16490
Fri 16349

Arun said...

To keep you from having to do the arithmetic:


David B. Benson said...

I think Aron has it: long term autocorrelation.

nony mouse 8:48 said...

For the 1980-2009 data:

chi-square (6) = 101.596, p < .001, Cramer's V = .008

Again, minimal effect with sundays as most frequent day. It will likely be the temporal correlation, but perhaps the sunday bell ringing type thing has some validity, lol. As the same is in the 1980-1998 data.

Chi(6) = 43.726, p <.001, Cramer's V = .008

Sunday seems to pop through as the highest frequency category each set (overall, 1999-2009. Wyrd. The fact that two different data sets show the same outcome with sunday as most frequent day...

Would be nice to break it down 1980-1989, 1990-1999, 2000-2009 to see if sundays pop through in each. If so, I can't readily see why the autocorrelation would cause that.

nony mouse said...

Should be "(overall, 1999-2009, and 1980-1998)"...

Arun said...

nony mouse,
I binned the 1999-2008 data (Pieter Vermeesch set) for a culture that had nine-day weeks. (Jan 1, 1990 is day of week A.)

If Sunday is indeed special rather than a chance artifact, then the above should have a very different chi-square, Cramer's V, etc.

On the other hand, if long-term correlation simply means that a chance large fluctuation is slow to dissipate, then the above should be equally wierd as the 7-day week.


nony mouse 8:48 said...

Yup, it should, the 7-day week is pretty arbitrary.

Chi(8) = 29.687, p <.001, Cramer's V = .006

Also slightly reduces the effect size (perhaps the longer you stretch out the time categories the weaker the effect would become). Quite possible that reducing the categories would elimate the effect altogether (seen a similar outcome with a dice example for large sample chi).

Good job, cheers.

David B. Benson said...

Maybe a Ilinear) trend to first remove?

David B. Benson said...

Eli stated "in a few days".

When do we see it?

Bob, Just Bob said...

I downloaded the data, then tried to sort it into "earthquake events". To do so (okay, these numbers are a stretch) I counted as an "aftershock" any quake that occurred within 20 degrees lat/long of the initial quake, and within 30 days of the initial quake. Doing so, the number of quakes becomes only 7,590, and chi^2 goes to 9.031, with a p-value of .250, which looks pretty good to me.

Now, I don't know much about earthquakes, so I don't know if bunching things by 30 days and 20 degrees is fair. 20 degrees is 1,400 miles, which seems a bit far.

It also doesn't explain why aftershocks should group so very badly (chi^2 for aftershock day-of-the-week is 94.439), except they are certainly dependent to some degree on the initial quake, so certainly you wouldn't want to lump initial quakes and aftershocks together (i.e. aftershocks are not entirely independent of initial quakes).

As far as the day of the week thing... the moon takes 27.33 days to orbit the earth. Do tidal forces influence earthquakes, in particular aftershocks, such that even quarters of 28 (7!!!) comes into play?