Addressing the Data Analysis in Francesca Gino’s Data Colada Lawsuit
I am a JD-PhD candidate in financial economics, not a barred attorney. This post is about data analysis, not law, and nothing in this post should be construed as legal advice. As of this writing, I have no personal relationship to Francesca Gino, Harvard University, or the authors of Data Colada.
The following post reviews, line-by-line, the data science-based allegations made by behavioral scientist Francesca Gino in her legal complaint against the authors of the quant blog Data Colada (DC) relating to four blog posts concerning alleged fabrications in her academic work. These allegations are interspersed within paragraphs (232)-(282) of the complaint. Coverage of the suit is available here.
In this analysis, I focus exclusively on Gino’s factual counterarguments that deal with DC’s data-based rationales for why data in each paper is allegedly fabricated. Nothing in this writing relates to Gino’s claims related to her former employer (Harvard University), DC outside of the specific data-based inquiry advanced in each blog post (such as DC’s descriptions of its interactions with Harvard’s investigators), or the way that any alleged facts relate to any legal standards.
My conclusion, on which I elaborate below, is that Gino’s claims about DC’s investigation (1) do not establish that DC erred or substantially overreached in its data analyses and (2) do not acknowledge entire lines of DC’s data analyses.
Allegations related to Blog Post 1: Clusterfake (discussing “Signing at the beginning makes ethics salient and decreases dishonest self-reports in comparison to signing at the end”)
I summarize DC’s argument in Clusterfake as follows: DC looks at the data, available in Open Science Framework (OSF) (a website). There are 101 observations in the dataset, and the 101 observations are almost (but not quite) sorted by: first, the condition assignment (0 = control, 1 = sign-at-the-top, and 2 = sign-at-the-bottom, in ascending order), and second (within each condition assignment), a participant ID identification variable called “P#”. 8 of the participant IDs are either duplicated or out-of-sequence in a suspicious manner (in addition to one other out-of-sequence observation over which DC does not raise concerns). DC states that “there is no way, to our knowledge, to sort the data to achieve this order,” and that “this means that these rows of data were either moved around by hand, or that the P#s were altered by hand. We will see that it is the former.”
DC notes that the OSF data also include an Excel file of the same data that contains formulas. One subsidiary file that the Excel file uses to produce the spreadsheet is calcChain.xml. CalcChain “retains the order in which formulas were initially entered into the spreadsheet” regardless of where cells are ultimately moved. Using CalcChain, DC shows that 6 observations that appear on top of each other in the dataset and are out-of-sequence, all of which are marked as being in either condition 1 (sign-at-the-top) or condition 2 (sign-at-the-bottom), used to be higher up in the spreadsheet in between rows 3 and 10, which would have put them in condition 0 (the control condition). Additionally, the P#s of the rows surrounding the places where CalcChain identifies the initial positions skip the exact position that would have been moved, further bolstering the idea that the observations were moved.
The 8 observations (the 6 out-of-sequence rows and the two duplicates), DC shows, are critical to the paper’s result, because they “are all among the most extreme observations within their condition, and all of them in the predicted direction.” DC concludes by saying that “all of this strongly suggests that row 70 [and the other out-of-sequence rows] was moved from the control condition (Condition 0) to the sign-at-the-bottom condition (Condition 2) [or, depending on the specific observation, Condition 1].”
Having recapped DC’s post, I now move to Gino’s counterargument. Gino says that the study at issue was conducted “on paper” and that “accordingly, Data Colada’s review of a spreadsheet years later could not have accurately reflected the results of the study” (236). Specifically, Gino says that “when collecting data on paper, data is usually entered into an Excel database from stacks of paper [emphasis in original], and that data entered this way is not necessarily sorted” (241), and that with respect to the duplicate observation, “it was equally likely that the same index card was used for participants’ IDs or the research assistant who conducted the study entered the ID twice — an honest error” (242).
First, with respect to the explanation for the duplicate observation: I note that in addition to P#49 appearing twice (which DC highlights as the duplicates), P#13 also appears twice and there is no one identified as P#98. Neither DC nor Gino discusses this in any detail. The two P#49 observations are not identical across all observations (although many of the pieces are identical). It is of course possible in the abstract for an RA to enter an ID twice or that the same index card was used twice, but it is very difficult to offer meaningful evidence for or against this.
Secondly, and more importantly, with respect to the explanation for the sorting: Gino repeatedly emphasizes that the data was collected on paper and entered into Excel afterwards, meaning that such data entering is not necessarily sorted. This is true, but does not address the problem DC identifies: that after the initial entering of data into the Excel spreadsheet, observations were moved within the spreadsheet itself. That the data was initially collected on paper and entered into Excel later is irrelevant to that the observations that were moved after the observations were put into Excel are the highly suspicious ones. Gino focuses on addressing DC’s claim that there is no way within Excel to generate the ordering of observations via a sort, which is an auxiliary contextual claim. Gino also puts forward no explanation for why the observations that were moved after the spreadsheet was created are also ones that are critical to establishing her headline result.
Allegations related to Blog Post 2: My Class Year is Harvard (discussing “The Moral Virtue of Authenticity: How Inauthenticity Produces Feelings of Immorality and Impurity”)
I summarize DC’s argument in My Class Year Is Harvard as: in a survey with observations purportedly consisting of 491 Harvard students, one of the questions asked was “Year in School: _______”. 19 observations in the dataset (~4%) did not contain answers like “junior” or “2013,” but instead contained the answer “Harvard,” and one observation contained the answer “harvard,” which are intuitively improper answers to the question. The study’s core finding is that making survey respondents write an essay that goes against their stated beliefs makes them subsequently rate cleansing products to be more desirable. The 19 observations that contained the answer “Harvard”, which were not roughly evenly spaced out across the dataset but clustered close together, were “especially likely to confirm the authors’ hypothesis” because they all took on very strong ratings about the desirability/undesirability of the product in the manner that the paper argues. The “harvard” answer, on the other hand, contradicts the thrust of the paper.
DC concludes by saying that “this strongly suggests that these ‘Harvard’ observations were altered to produce the desired effect. And if these observations were altered, then it is reasonable to suspect that other observations were altered as well.”
Gino’s reply to this argument, in (250), is: “Data Colada, as experienced behavioral scientists, knew that participants frequently respond to a survey to obtain payment due for their participation (as study participants) and may rush through questions, sometimes more than once to get paid, and use extreme values as their answers. It is widely known in behavioral science that participants in online studies at times provide poor-quality data by answering surveys without the attention they require.”
Gino’s points above are all correct, but do not contradict Data Colada’s argument.
Survey participants of course may pay less than full attention and use extreme values for their answers, but Data Colada’s argument is that this “Harvard” subset of answers (1) are much more extreme than the other answers and (2) are extreme always in a way that favors the study’s conclusion. If it happened to be the case that the people who replied to this survey who were filling out the survey with the least care just happened by chance to all land on the exact same mistake of putting down the word “Harvard” (always spelled in this exact matter), then we should expect their answers to the actual survey question to be more or less random, not that their answers should particularly favor a pattern of extreme answers that all push in the same direction. Gino also does not provide an explanation for why all these people would have happened upon writing the exact same conceptually incorrect answer in their survey, when a lack of attention by survey respondents would predict that there would be a variety of different errors in the class year question.
Allegations related to Blog Post 3: The Cheaters Are All Out of Order (discussing “Evil genius? How dishonesty can lead to greater creativity”)
DC’s argument here is: in an MTurk study with 178 participants, participants first perform a task in which they have the incentive to cheat, and some do cheat. Then, the participants perform a “uses” task where they are supposed to “generate as many creative uses for a newspaper as possible within 1 minute.” The paper’s result is that participants who cheat on average come up with more uses for the newspaper. In the dataset, the results are almost perfectly sorted by, first, whether the participant cheated or not, and then within that, how many uses for the newspaper the participant generated in ascending order. But while this sort perfectly describes the results of the participants who purportedly cheated, when looking at the data of the participants who purportedly did not cheat, the “Numberofresponses” column (the number of uses) only usually follows ascending order, but occasionally jumps to much higher numbers before later declining back to the next integer of what the directly ascending sort would have put next. (This happens for 13 observations.)
DC says that it is not possible to conduct a sort of the data that would produce this ordering, and so the data were either originally altered (“which is implausible, since the data originate in a Qualtrics file that defaults to sorting by time”), or manually altered. DC then creates a new variable that imputes the low and high values of what Numberofresponses would have been for the out-of-order observations if their suspicions are correct (the low and high values are either the same number or are only off by one). DC runs the same regression that the paper runs on its version of the data and finds that the statistical significance of the main result vanishes.
Additionally, DC finds that in the dataset for the paper, it is not just the case that the average number of newspaper uses is higher for the cheaters, but that the entire distribution of results of number of newspaper uses is shifted out for cheaters relative to non-cheaters. DC says that if there is truly no difference between cheaters and non-cheaters and that there was data tampering, then after correcting for the tampering, the entire distribution of results should be nearly identical, and find that this is the case with their version of Numberofresponses. They also bootstrap their results by making the same 13 changes to random observations 1 million times and find that the creation of such similar distributions is virtually impossible to get by chance.
Gino’s response to Blog Post 3 (paragraphs (254)-(260) of the complaint) barely addresses any of the above analysis. In (255), Gino states that, “as experienced behavioral scientists who presumably read the study at issue, Data Colada knew that the variable called ‘NumberOfResponses’ needed to be coded by a research assistant. And they also knew that it was very likely that the data in this case was sorted before, not after, the variable ‘NumverOfResponses’ [sic] was coded by the research assistant.”
Gino’s reply here does not make sense. The issue DC raises with respect to the sorting of the dataset is that there is no sort that can be done of the data that would result in that ordering of the data. Gino is claiming that DC knew it was “very likely” that the exact order of events of this research project was: 1. The file of results was generated, but that this file did not include the “Numberofresponses” column. 2. Then, the file of results was sorted (and would not be sorted again). 3. Then, the research assistant manually coded the “Numberofresponses” column (and never sorted the dataset again before the dataset was made available).
There is no intuitive reason for such a process. It seems easier to just generate the file of results all at once. Gino’s explanation also requires the reader to believe that this supposed RA, after coding up the “NumberofResponses” column, just happened to do so in a manner that is a near-perfect sort, just not quite a perfect one.
Gino never responds to DC’s version of the regression or simulations of the dataset’s distribution.
Allegations related to Blog Post 4: Forgetting the Words (discussing “Why connect? Moral consequences of working with a promotion or prevention focus”)
DC’s argument here is: in an MTurk study with 599 participants, participants were randomly asked to write about a hope or aspiration (the “promotion” condition), a duty or obligation (the “prevention” condition), or their usual evening activities (the “control” condition). The prediction is that people would feel more morally impure about networking when in a prevention-focused mindset than a promotion-focused mindset. After the writing was complete, participants were asked to imagine being at a networking event where they made professional contacts, and then were asked to rate on a 7-point scale the extent to which they felt “dirty,” “tainted,” “inauthentic,” “ashamed,” “wrong,” “unnatural,” or “impure” (1 meant “not at all,” 7 meant “very much”). The average of these seven variables was called “moral impurity” and the study found that participants in the prevention condition felt on average more moral impurity than in the promotion condition. Participants were also asked to list 5–6 words describing their feelings about the networking event.
DC finds that in the control condition, many (92) participants had a moral impurity score of 1.0, which mathematically requires them to have given a 1 out of 7 on every scale. They find this result perfectly plausible. However, they find that in the prevention condition, many (64) participants had a moral impurity score of 2.0 exactly and several (18) participants had a moral impurity score of 3.0 exactly, which they find odd because, again, the moral impurity score is an average of seven scales. In the promotion condition, a great many (118) participants had a moral impurity score of 1.0 and virtually no participants express high moral impurity scores. DC expresses “suspicions” that some scores in the prevention condition were changed to 2s and 3s across the board and that some scores in the promotion condition were changed to 1s across the board.
DC tests their “suspicions” by employing the words used to describe feelings about the networking event, on the theory that a data tamperer could have changed the ratings scores while neglecting to change the words because the words are not used in the paper’s results generally. DC has three online workers, “blind to condition and hypothesis,” independently rate the overall positivity/negativity of each participant’s word combination on a 7-point scale. DC averages the three workers’ ratings to create a sentiment measure of each participant’s words.
DC finds that observations in the prevention condition where the observation contains ‘all 2s’ or ‘all 3s’ have, per their metric, words that are “way [emphasis in original] too positive” when comparing them to the other observations in the dataset and that are “as positive as the ‘all 1s’ in the rest of the dataset.” When looking at the promotion condition, by contrast, DC finds that some participants who (1) gave a ‘1’ to all moral impurity-related scales and (2) gave maximally positive responses about engaging in future networking actually wrote negative things about networking. [Updated for clarity: for the observations that DC does not consider to be suspicious, they write that their sentiment metric correlates highly with the moral impurity score]. Lastly, when DC re-runs the paper’s analysis using their metric of word sentiment analysis, the effect that the paper estimates disappears.
Gino’s response to Blog Post 4, as it relates to the specific analysis conducted, is as follows. In (264), Gino states, “to the extent that Data Colada read the paper, they knew that how ‘impure’ a person feels about networking is not equivalent to how positive or negative that same person feels about Networking, as the study in the paper showed.” Then, in (265), Gino states, “Data Colada also knew that coding the ‘words’ they had their coders rate for positivity or negativity not only had nothing to do with the hypothesis that was being tested in Professor Gino’s study, but further, would such a subjective exercise as to be useless [sic].”
Of the four responses to the blog posts, I find Gino’s response here to be the relatively most convincing, though still not a persuasive rebuttal. It is true that a scale based on “positivity vs negativity” is not identical to a scale based on “moral impurity” — however, this is a dispute of degree. The words used in the construction of the “moral impurity” scale (“dirty,” “tainted,” “inauthentic,” “ashamed,” “wrong,” “unnatural,” and “impure”) obviously have valence relevant to a negative-positive scale. Reasonable researchers can disagree on the degree to which the DC-generated negative vs. positive metric measures what the moral impurity measure is meant to capture, but there is a world of difference between the claim in (264) that the two are “not equivalent” and the claim in (265) that they have “nothing” to do with each other.
Additionally, Gino says that the study “showed” that how “impure” a person feels about networking is not the same as how “positive or negative” that person feels about networking. A direct citation is needed for this claim, as I have read the study twice now and cannot find what part of the study this statement is supposed to be referencing.
Lastly, Gino does not respond to DC’s statements regarding the dubiousness of so many moral impurity scores being whole numbers due to having the same score across the subcomponent board in the prevention and promotion conditions.
Conclusion
In this post, I cover the parts of Francesca Gino’s legal complaint against Harvard University and Data Colada that specifically deal with DC’s analysis of the data and results of her research. Without commenting on any other aspect of the lawsuit, Gino’s rebuttal of DC’s empirical work does not persuade me that DC made mistakes in its data work or over-hyped what it was able to demonstrate with regard to direct data work alone. Gino also does not respond at all to many of the empirical arguments made, particularly with regard to Blog Post 3.
Additionally, as no doubt readers have noticed, the wording of this post is very specific and careful, as it is about active litigation. I would like to state my support for, as a general matter, verification of the work of academics and journal policies mandating data/code transparency that enable that work. I would also like to say that if such work becomes personally risky due to litigation risk, people will be less willing to attach their names to it, and that those efforts will be more likely to be distributed via anonymous platforms and whisper networks. The social scientific project will be worse off if that comes to pass.