I examine the opening chapter by Edward Wegman and Jeffrey Solka in the 2005 Handbook of Statistics: Data Mining and Data Visualization (C Rao, E Wegman and J Solka, editors). Sections 3 (The Computer Science Roots of Data mining ), 5 (Databases), 6.2 ( Clustering) and 6.3 (Artificial Neural Networks) appear to be largely derived from unattributed antecedents; these include online tutorials and presentations on data mining, SQL and artificial neural networks, as well as Brian Everitt’s classic Cluster Analysis. All the identified passages, tables and figures were adapted from “copy-paste” material in earlier course lectures by Wegman. The introduction to Chapter 13 (on genetic algorithms) by Yasmin Said also appears to contain lightly edited material from unattributed sources, including an online FAQ on evolutionary computing and a John Holland Scientific American piece. Several errors introduced by editing and rearrangement of the material are identified, demonstrating the authors’ lack of familiarity with these particular subject areas. This extends a pattern of problematic scholarship previously noted in the work of Wegman and Said.
The pending retraction of Said, Wegman et al 2008 in Computational Statistics and Data Analysis (see part 1 and part 2) has led to a renewed focus on the problematic scholarship of George Mason University professors Edward Wegman and Yasmin Said. The unattributed appropriation of social network analysis background in the CSDA article and the earlier Wegman et al congressional report from two text books has garnered most of the attention. But the same pattern of “copy-and-paste” scholarship and lack of competence or domain knowledge can be seen in the other background sections of the Wegman report, including section 2.1 on tree-ring and other paleoclimatology proxies, and section 2.2 on principal component analysis and noise models. I also showed a similar pattern in Wegman and Said’s 2010 overview article on Colour Theory and Design in Wiley Interdisciplinary Reviews Computational Statistics (see parts 1 and 2).
Today I examine two chapters in in the Handbook of Statistics: Data Mining and Data Visualization (Elsevier, 2005), edited by eminent statistician C R Rao along with Wegman and Jeffrey Solka. Solka obtained his PhD in 1995 under Wegman and has been working at the Naval Surface Warfare Center and teaching at GMU.
Most of my attention will be on the opening overview chapter, Statistical Data Mining, by Wegman and Solka, although I’ll also take a quick look at chapter 13, On Genetic Algorithms and their Application, by Wegman protege (and “hockey stick” report co-author) Yasmin Said.
As in the case of the WIREs colour overview article, certain sections of Statistical Data Mining rely heavily on lightly edited portions on lectures from Wegman’s statistical data mining course at GMU. In turn, those lectures contain “copy-and-paste” material from a variety of sources, some partially attributed and some not at all.
The following table shows the main identified antecedents and the GMU course lecture that provided the “flow through” source for the particular chapter sections.
[UPDATE: As noted in a comment by “harvey” at least some of the antecedent slides in the Bajcsy presentation appear to be based on passages in Data Mining Techniques by Michael J. A. Berry and Gordon Linoff (1st ed., 1997, Wiley). ]
|Bajcsy PPT (Berry & Linoff)||SDM3-2003||3. Computer science roots of data mining|
|Everitt (1993)||SDM8-2003||6.2 Clustering|
|StatsSoft Neural Networks||SDM11-2003||6.3 Neural networks|
Before moving on to the details, the acknowledgments section of the chapter should be noted:
The work of E.J.W. was supported by the Defense Advanced Research Projects Agency via Agreement 8905-48174 with The Johns Hopkins University. This contract was administered by the Air Force Office of Scientific Research. The work of JLS was supported by the Office of Naval Research under “In-House Laboratory Independent Research.” Figures 21 through 24 were prepared by Professor Karen Kafadar who spent time visiting E.J.W. During her visit, she was support by a Critical Infrastructure Protection Fellows Program funded at George Mason University by the Air Force Office of Scientific Research. Much of this chapter summarizes work done with a vast array of collaborators of both of us and we gratefully acknowledge their contributions in the form of ideas and inspiration.
Thus, as in the CSDA case, this work relied on federal funding. The acknowledgment of collaborators is interesting as well, insofar as these received ample citations, while the sections detailed below were free of any citations at all.
Section 3 – Computer science roots of data mining
The main antecedent in this section appears to be a course presentation, Introduction to Data Mining, by Peter Bajcsy of the University of Illinois. I’ll draw on examples from the 2002 version of the course; although the course appears to date back 2000, earlier versions are unavailable. It should also be noted that Bajcsy acknowledges the contributions of other colleagues, notably Jiawei Han also of the Illinois, and the course also acknowledges two underlying references.
- Data Mining – Concepts and Techniques by J. Han and M. Kamber, Morgan Kauffman, 2001)
- Pattern Classification by R Duda, P. Hart and D. Stork (Wiley, 2001 2nd ed.)
So this particular version of the Bajscy course may not be the actual source for Wegman in all cases.
[UPDATE: As previously noted, at least some of the material identified in Bajcsy has itself an antecedent in Berry and Linoff’s 1997 Data Mining Techniques. For now I will note the antecedents, where it is possible to confirm them via Amazon search. And I still hope to have complete chain of provenance in part 2.]
But whatever the actual chain of provenance, it is clear that little of this section is original, as seen in the following examples.
Our first example from Wegman and Solka is definitional:
Data mining itself can be defined as a step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data. The knowledge discovery process can be defined as the process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
The steps in the data mining process are usually described as follows. First, an understanding of the application domain must be obtained including relevant prior domain knowledge, problem objectives, success criteria, current solutions, inventory resources, constants, terminology cost and benefits. The next step focuses on the creation of a target dataset. This step might involve an initial dataset collection, producing an adequate description of the data, verifying the data quality, and focusing on a subset of possible measured variables. …
And so on. Virtually the same wording is found at p. 7 and p. 22 of the Bajcsy presentation, albeit in point form. First the exact same definitions:
A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.
Knowledge Discovery Process
The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
The description of data mining steps is almost identical as well:
Develop an understanding of the application domain
Relevant prior domain knowledge, problem objectives, success criteria, current solution, inventory resources, constraints, terminology, cost and benefits.
Create target dataset.
Collect initial dataset, describe, verify data quality, focus on a subset of variables.
The second step, “create target dataset”, is more or less unscathed, although the Wegman and Solka version has changed the order and added unnecessary verbiage. But the list of “application domain” activities shows small and baffling changes. A missing comma yields the nonsensical “terminology cost and benefits”, whatever that is. A more critical and telling problem is the change of “constraints” into “constants”. Is this a mistranscription or the result of a failure of automatic character recognition? Whatever the reason, it’s a fairly shocking error.
The core of section 3 is a description of “market basket analysis”, along with very specific examples of the technique at work. None of this material appears to be original.
The motivation for market basket analysis is stated by Wegman and Solka as a series of questions:
Where should detergents be placed in the store in order to maximize their sales? Are window cleaning products purchased when detergents and orange juice are bought together? Is soda purchased with bananas? Does the brand of the soda make a difference? How are the demographics of the neighborhood affecting what customers are buying?
Bajcsy asks the same questions at p. 40 of the presentation.
And the exact same image from Bacsjy can be found at p. 37 of Wegman’s SDM 3 lecture.
[Berry and Linoff have a similar figure at p. 125 of the earlier Data Mining Techniques, that is the ultimate antecedent, as stated by “harvey”. However, it still seems to me that the Bacsjy version was the proximate source; here is what appears to be the Berry and Linoff version (as reproduced in course notes from 2006 in a 23 Mb PDF).
What a tangled web! ]
Another derivative figure is Wegman and Solka’s Figure 2:
This taxonomy table is found at p.50 of Bajcsy:
Not to mention, that once again, the identical figure is at p. 48 of Wegman’s SDM lecture 3. [The Berry and Linoff 1997 antecedent version is figure 8.4 at p. 135, although once again it appears not to be an identical figure. ]
Another example is a table of “co-occurrence” of products in a small instructive example involving five customers and five products. The Wegman and Solka version:
[This example also occurs in Berry and Linoff, at p. 135, again as noted by "harvey".]
In this case, the Wegman version is actually correct, since the co-occurrence of orange juice with itself is necessarily higher than its co-occurrence with window cleaner. Either Wegman corrected his source, or there is a common antecedent for this table.
Finally, we have the conclusion on market basket analysis.
Some of the strengths of market basket analysis are that it produces easy to understand results, it supports undirected data mining, it works on variable length data, and rules are relatively easy to compute. … Clearly if all possible association rules are considered, the number grows exponentially with n. Some of the other weaknesses of market basket analysis are that it is difficult to determine the optimal number of items, it discounts rare items, it is limited on the support that it provides.
The identical points are made in Bajcsy (p. 54-55) and found with the same exact wording in the Wegman lecture (typo and all):
Strengths of Market Basket Analysis
Weaknesses of Market Basket Analysis
Section 5 – Databases
Much of this section comes from an online tutorial that can be found in many places; I’ll use the version at SqlCourse.com. Here is the definition in Wegman and Solka (with asides for statisticians), followed by an example table:
A relational database system contains one or more objects called tables. These tables store the information in the database. Tables are uniquely identified by their names and are comprised of rows and columns. The columns in the table contain the column name, the data type and any other attribute for the columns. We, statisticians, would refer to the columns as the variable identifiers. Rows contain the records of the database. Statisticians would refer to the rows as cases. An example database table is given in Table 15.
Here is the identical definition (without the asides, of course) and table from SQLCourse.com.
A relational database system contains one or more objects called tables. The data or information for the database are stored in these tables. Tables are uniquely identified by their names and are comprised of columns and rows. Columns contain the column name, data type, and any other attributes for the column. Rows contain the records or data for the columns. Here is a sample table called “weather”.
[C]ity, state, high, and low are the columns. The rows contain the data for this table:
The rest of the section follows the tutorial quite closely, except for some fatuous statements about databases that I’ll return to in part 2.
Section 6.2 – Clustering
This section is derived entirely from Brian Everitt’s Cluster Analysis (3rd edition, 1993). A later edition of the book is given as a reference for clustering techniques, but there is no attribution of Wegman and Solka’s exposition.
Compare Wegman’s version of the clustering problem with Everitt’s original. The Handbook version:
The basic clustering problem is, then, given a collection of n objects each of which is described by a set of d characteristics, derive a useful division into a number of classes. Both the number of classes and the properties of the classes are to be determined.
The problem which these techniques address may be stated broadly as follows:
Given a collection of n objects … each of which is described by a set of p characteristics or variables, derive a useful division into a number of classes. Both the number of classes and the properties of the classes are to be determined. [Section 1.3, p. 4]
As for motivation, Wegman and Solka state:
We are interested in doing this for several reasons, including organizing the data, determining the internal structure of the dataset, predication, and discovery of causes.
This is a ham-handed summary of Everitt:
In the widest sense, a classification scheme may represent simply a convenient method for organizing a large set of data so that the retrieval of information may be made more efficiently. …
In many applications however, a classification to serve more fundamental purposes may be sought. … To understand and treat disease it has to be classified and the classification will have two main aims. The first will be prediction – separating diseases that require different treatments; the second will be to provide a basis for research into aetiology – the causes of different types of disease. [Section 1.2, p. 2-3]
So the specific needs of disease classification have somehow morphed into a supposed general purpose of “predication” (instead of prediction) and aetiology. Lost too is Everitt’s general distinction between simple data organization for convenience or efficiency, and more fundamental motivations for clustering.
Cluster analysis generally relies on defining similarity (or dissimilarity) between various members of the data collection being analyzed. This can be done by employing one of various available measures of distance between data points. Here is Wegman and Solka’s list of common distance measures used to compute dissimilarity:
As in previous sections, Wegman and Solka draw heavily on their main source’s specific examples. A worked out example of single linkage agglomerative hierarchical clustering starts from a distance matrix for five data points and works through intermediate steps to arrive at a dendrogram (a chart representing the final hierarchical clustering solution):
Everitt has exactly the same example, albeit without the loss of pertinent distance information in the final diagram. Here is the distance matrix:
And following a series of agglomerative steps, fusing points and smaller clusters into larger ones, here is the final dendrogram:
The following section contains an egregious error, where Wegman and Solka purport to discuss the “number of groups” problem.
Section6.2.2. The number of groups problem
A significant question is how do we decide on the number of groups. One approach is to maximize or minimize some criteria.
But in fact, what follows is something completely different – a summary of optimization methods of cluster analysis (as opposed to the hierarchical clustering techniques previously discussed). The optimization methods divide the n data points into g groups according to maximization or minimization of a particular criterion. But these methods assume a predetermined number of groups (clusters) to be formed; the described criteria (which are the same as found in Everitt’s chapter 5) are ways to identify a specific optimal clustering into g groups.
Now Everitt does discuss the “number of groups” problem for both hierarchical clustering (i.e. “cutting off” before reaching the individual level) and optimization clustering methods. But none of this is covered (or copied) in the Handbook chapter.
How did such a huge error occur? A likely scenario is found by examining the underlying Wegman lecture. At p. 28 there is one slide on determining the number of relevant groups or clusters in hierarchical clustering, which would otherwise produce an unwieldy and overly detailed taxonomy of limited utility, right down to the individual level. This is followed by several slides on “optimization methods”. Thus it would appear that when these slides were reviewed to produce the passage in question, the slides on optimization clustering methods were mistakenly presumed to be a continuation of the “number of groups” topic.
Section 6.3: Artificial Neural Networks
This section is derived from a StatsSoft online statistics guide. The underlying Wegman lecture reproduces much of the chapter on artificial neural networks almost verbatim. However, some editing and rephrasing was attempted for the Handbook version:
An artificial neuron has a number of receptors that receive inputs either from data or from the outputs of other neurons. Each input comes by way of a connection with a weight (or strength) in analogy to the synaptic efficiency of the biological neuron. Each artificial neuron has a threshold value from which the weighted sum of the inputs minus the threshold is formed. The artificial neuron is not binary as is the biological neuron. Instead, the weighted sum of inputs minus threshold is passed through a transfer function (also called an activation function), which produces the output of the neuron. Although it is possible to use a step-activation function in order to produce a binary output, step activation functions are rarely used.
This is not so much clearly wrong, as it is somewhat incoherent. The original is much clearer and has hyperlinks to define key terms.
To capture the essence of biological neural systems, an artificial neuron is defined as follows:
- It receives a number of inputs (either from original data, or from the output of other neurons in the neural network). Each input comes via a connection that has a strength (or weight); these weights correspond to synaptic efficacy in a biological neuron. Each neuron also has a single threshold value. The weighted sum of the inputs is formed, and the threshold subtracted, to compose the activation of the neuron (also known as the post-synaptic potential, or PSP, of the neuron).
- The activation signal is passed through an activation function (also known as a transfer function) to produce the output of the neuron.
If the step activation function is used (i.e., the neuron’s output is 0 if the input is less than zero, and 1 if the input is greater than or equal to 0) then the neuron acts just like the biological neuron described earlier (subtracting the threshold from the weighted sum and comparing with zero is equivalent to comparing the weighted sum to the threshold). Actually, the step function is rarely used in artificial neural networks, as will be discussed.
This section follows the original quite closely. Although the attempt to rephrase and summarize tends to obfuscate the text, there appear to be fewer of the outright errors apparent in some of the other sections.
Chapter 13 (Yasmin Said): On Genetic Algorithms and their Application
The opening of this chapter appears to draw heavily from three sources:
- Evolutionary Computing FAQ (from “An Overview of Evolutionary Computation”, ECML: European Conference on Machine Learning [ECML93], 442-459.
- Melanie Mitchell, An introduction to genetic algorithms, MIT Press, 1998.
- John H. Holland, “Genetic Algorithms”, Scientific American, July 1992.
I have produced a detailed side-by-side comparison of Said’s text to the above originals for those interested. However, here I will show just one example paragraph. Here is John Holland, the inventor of genetic algorithms, describing how genetic algorithms mimic biological reproduction.
Biological chromosomes cross over one another when two gametes meet to form a zygote, and so the process of crossover in genetic algorithms does in fact closely mimic its biological model. The offspring do not replace the parent strings; instead they replace low-fitness strings, which are discarded at each generation so that the total population remains the same size.
Now here is Said’s rendering of that same paragraph.
Biological chromosomes perform the function of crossover when zygotes and gametes meet and so the process of crossover in genetic algorithms is designed to mimic its biological nominative. Successive offspring do not replace the parent strings; rather they replace low fitness ones, which are discarded information at each generation in order that the population size is maintained.
We have seen this style of summarization before, where Said manages to add verbiage, while subtracting information. Low-fitness strings are not merely “discarded”; rather, they are “discarded information”. And GA’s “biological model” is now a “biological nominative”, whatever that is. But the real howler is Said’s description of crossover (now the “function of crossover”) as “when zygotes and gametes meet”. Uh, no. As Holland clearly stated, two gametes “meet to form a zygote”.
There had been some progress since Said’s PhD dissertation the year before, however. At least this time, she managed to interweave strikingly similar material from three different sources, instead of just copying one.
This examination has provided an important new piece in the documentation of the downward spiral of the scholarship of the Wegman group. I have shown large-scale use of unattributed material, along with attendant errors, in several sections of an advanced statistical text book on data mining. This not only establishes a pattern of problematic scholarship involving Edward Wegman and Yasmin Said in the period just before the Wegman report. It is doubly shocking as it involves material supposedly within the authors’ sphere of expertise, and even taught by Wegman in courses.
I should also mention that it is highly improbable that Jeffrey Solka bears direct responsibility for the problems documented above, as the passages in question were derived from Wegman’s course lectures.
I’ll return to the Handbook for a more detailed analysis, including colour coded side-by-side analysis of Chapter one sections and identification of further unattributed antecedents. I’ll also examine Wegman’s peculiar personal approach to literature review, whereby the contributions of major figures (like Usama Fayyad) are given short shrift, while his own proteges are given a disproportionate attention. And I’ll also take a look at the evolution of the Statistical Data Mining course itself, an evolution which yields important clues about the descent of the Wegman group.
And I’ll also be posting a summary page of all the problematic scholarship identified by myself and others so far, something I’ve been meaning to do for some time. If nothing else, it will provide a convenient resource for George Mason University when they finally get around to launching a full blown investigation of the ever-growing mountain of evidence of misconduct. But before I do that, there is one more shocking piece in the Wegman and Said saga to report, so stay tuned.