This is the final instalment in a series of posts documenting dubious scholarship and unattributed sources in the background chapter of the touchstone of climate contrarians known as the Wegman Report. That report has been touted as Exhibit A proving the “destruction” of Michael Mann’s “hockey stick” graph by self-styled climate auditor Steve McIntyre.
Previously, I found extensive passages bearing “striking similarity” to a classic text by the distinguished paleoclimatologist (and “hockey stick” co-author) Raymond Bradley in the background sections on tree rings and on ice cores. Subsequently, the background section on social networks was found to contain material apparently drawn without attribution from a variety of sources, including Wikipedia and several text books.
This time, I’m looking at section 2.2 (see Wegman Report PDF at p. 15), which gives the background of key statistical concepts, including Principal Component Analysis. Astonishingly, even this section appears to contain a significant amount of unattributed material from other sources, although quite a bit less than the other sections. Again, Wikipedia appears to be a key source, along with a couple of text books.
I’ll also introduce some refinements to the text analysis, based largely on John Mashey’s recent innovations. Those refinements allow a better characterization of the relationship between various passages in Wegman et al and their apparent antecedents, as well as permitting a quantitative analysis based on word counts.
Despite my relative success in adducing the antecedents for other background sections, for some time I avoided serious sustained sleuthing on section 2.2, which describes Principal Component Analysis and time series noise models. Surely this background section, at least, was well within the authors’ ambit of expertise and there would be no need to borrow liberally without attribution.
Still, like much of the report, there were no citations at all. So eventually, I did make the attempt, but my initial efforts yielded only one passage of “striking similarity” – the “colors of noise” passage also found in Wikipedia, as discussed in this comment back in April.
Recently, though, inspired by John Mashey’s research into other parts of the report, I tried again, this time searching for smaller blocks of text. In the end, no fewer than nine possible unattributed sources have been identified (see the list at the end of the full textual side-by-side comparison with identified possible antecedents). In general, there were even more slight changes and rearrangements of the various sources than seen in previously analyzed sections, making detection of those possible antecedents more difficult.
In fact, to evaluate just how “strikingly similar” some of the newly discovered passages were, refinement of the analysis techniques became necessary. Borrowing Mashey’s concept of “longest common sub-sequence”, I highlighted exactly identical text in cyan in both source and target text, while taking care to separate the blocks where the text was rearranged or separated by changed text.
Next, trivial changes were highlighted with yellow. These are slight changes of tense, number or voice (i.e. active to passive), as well as substitution of synonyms or similar sounding words. Finally, changes or additions that introduced issues were underlined; these issues might be a simple trivial error, a change in meaning or even the introduction of distortion or bias.
Now let’s look at some examples (each of these can be found in the side-by-side comparison or Cmp for short, on the page noted). I’ll start with the above-mentioned “colors of noise” passage.
Here is the first sentence from the Wikipedia article Colors of Noise (from April 12, 2006):
There are many forms of noise with various frequency characteristics that are classified by “color”.
The corresponding sentence in Wegman et al (at p. 15, Cmp p. 2) is:
There are many types of noise with varying frequencies each classified by a color.
The errors introduced by the changes are perhaps not serious, but they do bespeak a possible lack of understanding by the responsible author. (“Varying frequencies” implies that each type of noise would have a single dominant distinguishing frequency, while the change from color in quotes to “a color” obfuscates the conceptual nature of the classification).
The changes in the next two sentences are mainly removal of text, shown with strikeout (so we’ll show the original Wikipedia version only):
The color names for these different types of sounds are derived from an analogy between the spectrum of frequencies of sound wave present in the sound (as shown in the blue diagrams) and the equivalent spectrum of light wave frequencies. That is, if the sound wave pattern of “blue noise“ were translated into light waves, the resulting light would be blue, and so on
Clearly, the above three sentences in Wegman et al, taken together, have a very convincing and striking similarity with the Wikipedia passage. Even the reference to “sounds” has been left as is, instead of the obvious change to a more general term , such as “signal”. All the same, the identical text has been broken up into no less than eight separate sub-blocks.
Things get even more interesting in our next example (Wegman p. 17, Cmp p. 4). Here is the passage from Wegman et al. describing “long memory” processes:
Random (or stochastic) processes whose autocorrelation function, decaying as a power law, sums to infinity are known as long range correlations or long range dependent processes. Because the decay is slow, as opposed to exponential decay, these processes are said to have long memory. Applications exhibiting long-range dependence include Ethernet traffic, financial time series, geophysical time series such as variation in temperature, and amplitude and frequency variation in EEG signals.
The apparent (but unattributed) antecedent is from the introduction to Processes with long-range correlations: theory and applications, edited by Govindan Rangarajan and Mingzhou Ding):
Processes with long range correlations (also called long range dependent processes) occur ubiquitously in nature. They are defined as random stochastic processes whose autocorrelation function, decaying as a power law in the lag variable for large lag values, sums to infinity. Because of this slow decay (as opposed to an exponential decay), these processes are also said to have long memory. … A partial list of problems involving long range dependence include: Anomalous diffusion, potential energy fluctuations in small atomic clusters, Ethernet traffic, geophysical time series such as variation in temperature and rainfall records, financial time series, electronic device noises in field effect and bipolar transistors, and amplitude and frequency variation in music, EEG signals etc. …
The similarity is obvious from the sheer amount of highlighted text. But the degree of rearrangement of text is staggering; so much so, that one assumes that the passage may have been edited a few times. This short passage contains no fewer than 18 rearranged separate blocks of identical text, some consisting of only a single word of two. For example, “Because of this slow decay” becomes “Because the decay is slow”, with all three common words – “because”, “decay” and “slow” – rearranged and separated by trivially changed words.
As before, a couple of surprising errors have been introduced. For one thing, the processes discussed are not themselves “long range correlations”; rather they are processes with long range correlations.
The next example (a Wikipedia article on Self-similarity) is included more for amusement than anything else.
A self-similar object is exactly or approximately similar to a part of itself.. … Many objects in the real world, such as coastlines , are statistically self-similar: parts of them show the same statistical properties at many scales. Self-similarity is a typical property of fractals.
Wegman et al’s version is very similar, but not quite the same (Wegman et al, p. 17, Cmp p. 5):
An object with self-similarity is exactly or approximately similar to a part of itself. For example, many coastlines in the real world are self-similar since parts of them show the same properties at many scales. Self-similarity is a common property of many fractals …
Our final example actually comes from the beginning of the section (Wegman et al, p. 15; PDF, p. 1).
Principal component analysis tries to reduce the dimensionality of this data set while also trying to explain the variation present as much as possible. To achieve this, the original set of variables is transformed into a new set of variables, called the principal components (PC) that are uncorrelated and arranged in the order of decreasing “explained variance.” It is hoped that the first several PCs explain most of the variation that was present in the many original variables.
Some readers might recognize the similar text from the Introduction of Ian Jolliffe’s classic Principal Component Analysis.
The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated and which are ordered so that the first few retain most of the variation present in all of the original variables. [p. 2] … but it is hoped, in general, that most of the variation in x will be accounted for by m PCs, where m << p.
At first glance, this example might be considered more of a paraphrased definition, and perhaps not as questionable as the other examples. On the other hand, consider the juxtaposition of the following identical key phrases, present in both versions:
- “Principal component analysis”
- “to reduce the dimensionality of”
- “the variation present”
- “a new set of variables”
- “as much as possible”
- “are uncorrelated”
- “it is hoped”
A Google search on this set of phrases returns only a handful of hits, including the Jolliffe text itself, Wegman et al and a smattering of others attributing the passage to Jolliffe. It seems implausible, then, that this passage was not directly inspired by Jolliffe. As such, it definitely should have been attributed and probably the original should have been block-quoted.
Of course, PCA is at the heart of the McIntyre critique of the work of Mann, Bradley and Hughes and therefore of Wegman et al. The short description above refers to the possibility that the first few principal components (PCs) might “account for” or “explain” most of the variation in the original larger data set. This implies that enough PCs must be retained to accomplish this. Normally at least enough PCs to account for most of the original data set’s variance should be retained, and typically other conditions (such as convergence upon retention of successive PCs) would be imposed.
[Update, July 31: In fact, the 2002 edition of Jolliffe's text (which remains the foremost refernce on PCA), contains a vastly expanded chapter on the topic of retention of PCs. It describes a number of rules, some "ad hoc" (but plausible and highly useful) and some statistically based. There's even a section on the retention rules used in atmospheric sciences in which Preisendorfer and Mobley's Principal Component Analysis in Meteorology and Oceanography figures prominently. Variations of "Preisendorfer's rule N" are discussed.]
Tellingly, Wegman et al never once discuss this crucial aspect of PCA, even though a thorough examination of the issue of PC retention criteria was a key element in the most extensive peer-reviewed critique of McIntyre and McKitrick’s work, namely that found in Wahl and Ammann’s Robustness of the Mann, Bradley, Hughes reconstruction of Northern Hemisphere surface temperatures (Climatic Change 2007).
But that is a discussion for another time. For now, I’ll conclude by presenting overall metrics that show some interesting contrasts between the various sections of chapter 2. These metrics are based on word counts (WC).
The percentage of “strikingly similar” (SS) is simply based on the word count of passages with identified antecedents, relative to the overall word count. The next two columns show the percentage of combined identical and trivially changed text (ID+TC), and the percentage of identical text alone. The average identical text block length (BL) is calculated by ascertaining the total word count of identical text and dividing by the number of separate blocks of identical (ID) text in the target text. Thus it is an indicator of the amount of change and rearrangement that may have occurred.
Finally, the Src column shows the number of apparent antecedent sources and the Issues column shows the number of issues, both major (in bold) and minor (in normal typeface). As previously noted, these typically involve changes in meaning and errors.
2.2 PCA and stats [Cmp]
As can be seen, all these metrics have also been generated for the previously analyzed background sections, along with links to the original discussions and updated colour-coded side-by-side comparisons.
The tree ring section is notable for the relatively large amount of added and changed material, both within and outside of the “strikingly similar” passages. This, along with the large number of major issues, could reflect the special attention paid to this section and the apparent introduction of key distortions that undermine the original source, Raymond Bradley’s classic Paleoclimatology: Reconstructing Climates of the Quaternary. (By the way, Chapter 10 is available once more online as a PDF at Keith Briffa’s web page).
In contrast, the section on ice cores and corals introduced many trivial changes, but with little change in meaning.
The social networks section (2.3) clearly has the most “strikingly similar” material. Moreover, the “striking similar” passages contain little new material, consisting almost entirely of identical or trivially changed text. The relatively large block size (almost 9 words) would appear to suggest that this section underwent less extensive editing, although numerous trivial changes are scattered throughout.
Finally, the PCA and noise model section discussed above clearly contains the least “strikingly similar” material. But the surprise here is that there is any at all. Not only that, but changes made by Wegman et al have apparently introduced errors. Moreover, the sheer number of apparent sources and relative brevity of the antecedent passages means that additional antecedents can not be ruled out.
Nevertheless, this likely brings to a close my examination of the unattributed sources in the background chapter of the Wegman report, now that I have covered each of the three sections from that perspective. But that does not mean we are done with the report as a whole, or even the background chapter – far from it.
Future posts will cover such topics as Wegman’s “trick to hide the deletion” of Wahl and Ammann’s critique of McIntyre and McKitrick, as well as a discussion of the supposed “peer review” of the Wegman report, which Wegman claimed was “similar” to that of the National Research Council (which produced a competing report from a distinguished team led by Gerald North). I’ll also revisit the tree-ring section, but this time focusing on the serious issues raised by Wegman et al’s changes and omissions, relative to Bradley’s original.
And, very soon now, John Mashey will present his exhaustive investigation of other aspects of “strange” scholarship in the Wegman Report, including a jaw-dropping analysis of the “Summaries of Important Papers” and a complete breakdown of all references and citations. So stay tuned; there’s plenty more on the way.
1. Edward J. Wegman, David W. Scott and Yasmin H. Said; Ad Hoc Committee Report on the “Hockey Stick” Reconstruction A Report to Chairman Barton, House Committee on Energy and Commerce and to Chairman Whitfield, House Subcommittee on Oversight and Investigations, 2006. [PDF]
2. Ian T. Jolliffe, Principal Component Analysis (Springer, 2nd ed. 2002)
3. Wikipedia article – Color of Noise (April 12, 2006 version) – Available online at: http://en.wikipedia.org/w/index.php?title=Colors_of_noise&oldid=48074859
4. Govindan Rangarajan, Mingzhou Ding (ed..), Processes with long-range correlations: theory and applications (Springer, 2003)
5. Wikipedia article – Self-similarity (Mar. 20, 2006 version) – Available online at: http://en.wikipedia.org/w/index.php?title=Self-similarity&oldid=44580086
Detailed comparisons of Wegman Report section 2.2 to these apparent antecedents, as well as other Wikipedia articles, are found in:
A comparison of Ad Hoc Committee Report (Wegman, Scott, Said) section 2.2, p.15-17 and Various unattributed sources on statistics and noise models [PDF]