A key question in evaluating a given study is, “How do I establish trust in the validity of a finding?” Good RDM practices can help to address this question.

Measuring Bias

Good RDM practices help address issues related to identifying and measuring bias in research.

In an undergraduate context, trust is usually a by-product of some proxy measure: peer review, credentials, journal, authority (validated by a course or instructor), etc. As one moves through their undergraduate degree, they may also begin conducting more in-depth analyses of articles: are the reported methods appropriate; do the points in the discussion reflect back on the reported findings appropriately, etc. And in some cases, they may be asked to comment on analytical approaches used, level of reporting, if a data availability statement is present, etc. In general, these explorations don’t venture beyond the record of publication, however.

Advancing through an academic career, trust turns to being able to ensure that stated findings can be confirmed from the data and analytical approach used in much greater detail. Two key questions need to be addressed in this context:

  • How do I measure or evaluate for bias in the study?
  • Can I reproduce the study’s findings?

These, in turn, raise two more interconnected questions:

  • What do I need to be able to answer these questions?
  • What does it mean to reproduce the findings of a study?

Types of Bias

Any given study is subject to bias. These can be broadly grouped into 3 categories, but these are strongly interconnected.

Structural biases impact who gets to do research in the first place. Built into the systems in which we live and work, these manifest in things like who gets published, who gets invited to share their research, and who gets promotion and tenure. This is a deep and complex source of bias.

Cognitive biases, knowingly and unknowingly, impact how a study is planned and implemented, contributing to a specific paradigm in which the study is run alongside structural biases. These may impact not just the kinds of questions asked, but how these questions are asked and the approaches used to attempt to answer them.

Systemic bias, or error, is a by-product of studying complex systems and the reality that studies ask questions about only a subset of the components of these systems; we can’t simultaneously measure everything, so we resolve to recognize that there are unmeasured forces impacting our object of study as well as things such as measurement error.

All of these sources of bias can impact the validity of a study’s findings. One way of measuring this validity is to attempt to reproduce a given study. This replication is in part the backbone of the idea that science can be self-correcting; as more data are collected and more studies are conducted, the error introduced by all this bias become less significant. This is perhaps most true of systemic bias; self-correction on structural and cognitive biases is ostensibly much more challenging to address and/or correct for.

Reproducibility and Replicability

Reproducibility and replicability exist on a continuum from evaluating study internal validity to evaluating study generalizability.

Addressing study validity through replication allows us to home in on specific aspects of the biases introduced, and namely where it’s introduced. This also allows us to articulate what is required for successful replication and what a definition of successful replication might be.

It is common to have reproducibility and replicability be articulated as these two categories – the former referring to reproducing the results reported in a paper, the latter to re-running the same study. However, in the context of RDM, and the importance of how the record of research activities is maintained, it is valuable to unpack this a bit. Here, we will break replication efforts into four broad categories:

  • Computational
  • Study results
  • Methods
  • Generalization

For the purposes of this workshop, we will be focusing on aspects related to computational reproducibility, but we’ll explore the others here.

Computational Reproducibility

This is the most basic and rudimentary form of reproducibility and only validates procedural elements of a study’s findings. And yet, even this is very challenging to achieve.

Computational reproducibility involves being able to take the same data, same analysis tool, and same analysis pipeline to derive the same results. Using open data types (such as csv) and scripted instruction sets (like R) can enhance computational reproducibility.

Validating the computational reproducibility of a study consequently requires access to a study’s data (generally the cleaned data), the requisite documentation to understand this data (readmes, data dictionaries), and analysis protocols (software, software versions, tests run and selected parameters on those tests).

A lack of computational reproducibility may simply be frustrating, it may result in failure to publish, and it may result in retraction if it is found to sufficiently invalidate the study.

Frustrations

A good example of a frustrating experience when using a graphical user interface tool that generally limits reproducibility is the case of using Excel for genomics data, where it was discovered that a significant number of published research articles storing their data in Excel contained errors. Reporting on this and access to the published studies can be found on Retraction Watch.

Addressing the Issue

It is becoming increasingly common for publishers to ask researchers to include a statement about code and data availability to help address these issues and to help ensure that computational reproducibility can be measured. For example, PLOS asks for a data availability statement, and their requirements can be found here. However, simply providing data and code, or a statement about them, does not guarantee computational reproducibility in any meaningful way.

Taking Things a Step Further

Some disciplines and journals are addressing this issue by hiring data editors – people responsible for ensuring the computational reproducibility of a study submitted for publication. The journals of the Econometrics Society is a good example, requiring a full reproducibility package to be submitted at the time that an article is submitted for initial consideration. The full requirements of the journal can be found on their Data Editor Website.

Study Results Reproducibility

Study results reproducibility helps to address analytical choices made in the study: it works with the same data as the original study, ideally the data as collected, but runs its own analytical pipelines in pursuit of the same research question. Borderline p-values and small effect sizes are often contradicted or invalidated in these replications.

Validating study results requires access to the researchers’ data (ideally as collected), the requisite documentation to understand this data, the hypothesis being tested, and possibly the analytical methods used.

A lack of results reproducibility can call a given study into question and raise concerns about researcher bias, particularly in relation to researcher degrees of freedom.

An early example to test analytical approaches with the same data set was published in 2018 – Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results, available at https://doi.org/10.1177/2515245917747646.

This has since been followed by similar investigations in other disciplines.

Researcher degrees of freedom is a coined term referring to all the decisions a researcher is at liberty to make with their data, whether it be defining an outlier, rounding a variable, grouping variables (choosing a bin size for age ranges, for example), deciding you need to collect more data after having looked at the data, etc. For further discussion, see False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant available at https://doi.org/10.1177/0956797611417632.

Methods Replicability

Methods replicability attempts to replicate a study as designed. It involves using the same methods, but with new data and a new sample from the same population. It helps to address choices related to study implementation including measurement error and sampling bias. This kind of replicability helps to address systemic bias.

Validating a study’s methods requires access to detailed methods protocols that identify the study’s:

  • Research question
  • Hypothesis
  • Data collection plan
  • Data analysis plan

The more robust the documentation, the more closely matched the study design, the greater the homogeneity between studies, and the more utility these studies provide to systematic reviews and meta-analyses as tools to evaluate the state of evidence on a particular research question. The further utility of methods replication studies may be contingent on access to their data and analytical pipelines.

This is also a critical set of documentation in allowing the researcher and their peers to differentiate between exploratory and confirmatory research.

Exploratory research is hypothesis generating, while confirmatory research is hypothesis testing. It is rare that confirmatory research is ever conducted fully independently of exploratory research, as confirmatory research often suggests other paths of inquiry. However, it is critical to clearly differentiate between the two and to document the two processes appropriately, generally where exploratory research is documented while doing the exploration and confirmatory research is documented in advance.

The goal is to iteratively attempt to validate the significance testing and grow the body of evidence and available data addressing a specific question. An inability to conduct methods replication calls into question researcher bias (often suspected, but sometimes difficult to pinpoint exactly where it is as a result of a lack of documentation). This may invalidate the original study, or worse, result in wasted research dollars and researcher time.

It was this kind of replication that first emerged in response to Ioannidis’ 2005 article Why Most Published Research Findings Are False, available at https://doi.org/10.1371/journal.pmed.1004085.

These kinds of replication studies were started in psychology (see Estimating the Reproducibility of Psychological Science available at https://doi.org/10.1126/science.aac4716, with full documentation at https://osf.io/ezcuj/), but have been explored in other areas as well, such as cancer research: see Reproducibility Project: Cancer Biology at https://www.cos.io/rpcb.

Study Generalizability

This is the most robust category of validation, and involves addressing the same research question, but from a novel perspective. It involves re-considering choices related to study design including how to measure and how to define the population. It may also involve re-evaluating choices related to how the question will be approached, impacting the choice of overall study design.

Findings reproducibility attempts to extend the validity of a study, or address issues related to structural or cognitive bias in the research process.

Reproducibility & RDM

Each type of reproducibility requires increasingly robust implementation of RDM best practices. Computational reproducibility for quantitative aspects of a research study should ostensibly be a baseline part of the peer review process, and requires cleaned data and scripts as well as a reproducible environment – something we’ll look at tomorrow.

Moving up the line to validate findings, we then need access to data as collected, to detailed protocols and methods, and to version-controlled records to understand when and how things changed. In addition, things like clearly articulated positionality statements help to demonstrate how much thought is put into the structural and cognitive biases that might influence their work.

Capturing this as a summary table, we might suggest something like the following:

Who are you? What’s you’re plan What was your final analysis plan? Can I follow your final recipe?
What world view, paradigm, framework, etc. is guiding your thought and decision making processes? What are your research question, testable hypothesis, data collection and analysis plans? Did these change as the project unfolded, and if so, how and why? How did you go from raw to processed data, and what decisions did you make about your data once it was in hand? How clearly documented or reproducible is your final analysis? Was further error introduced at this stage?

Evaluation to Practice

Just as we need this level of transparency to evaluate a given study, we should strive to provide this same level of transparency for our peers and to enhance our research practices. This process begins with a protocol that details why you’re doing what you’re doing, and how you intend to do it – a data management plan is a key piece of this and it is where this workshop begins. It then involves updating this protocol and DMP to reflect how the process actually unfolded (we can never account for every eventuality), and once data are in hand, documenting your exploratory approaches - using R and RMarkdown, for example - then scripting your analysis, and finally depositing these ‘artefacts’ of the final output somewhere that allows for validation.