Blog posts:

Inside the OSF Reproducibility Challenge: A Data Scientist’s Perspective


I Thought Open Science Data Would Be Clean 🙂 

When I first started working with research reproducibility data, I had a very clear idea of what my days would look like. 

I imagined myself: 
– contributing to elegant models in our lab, 
– running clever network analyses, 
– discovering deep, satisfying patterns hidden in the data. 

In my head, the data would arrive… ready. 

Reality, however, had other ideas. 

Very early on, I learned a universal truth of data work: 
You don’t analyse data. 
You negotiate with it first. 

And sometimes that negotiation takes most of your time. 

My Optimism, Meet OSF 

When I started working with Open Science Framework (OSF) data for the OSF Replication Challenge (Round 1 and Round 2), I was genuinely excited. 

“Open science,” I thought, “means everything is transparent, documented, and perfectly organised.” 

Surely this would be the cleanest data I’d ever touch. 

I downloaded my first OSF dataset. 

A file appeared. 
With many columns. 
With names that did not explain themselves. 

Promising. 

Then I looked closer. 

– Entry IDs were duplicated—but somehow referred to different studies 
– The same original study had been replicated by multiple teams, raising deep questions like: what exactly does one Effect ID represent? 
– Rows looked identical until they very much weren’t 

That was the moment I realised something important: 
OSF data isn’t messy because people are careless. 
It’s messy because science is human. 

Cleaning OSF Data Is Like Archaeology 

Working with OSF data doesn’t feel like data science. 
It feels like archaeology. 

You’re constantly asking: 
– Which file came first? 
– Which decision came later? 
– What do these values actually mean? 
– Is this a duplicate—or a separate entry pretending to be one? 
– Was this exploratory… or confirmatory? 
– Is this variable new, or just renamed (again)? 

You’re not just cleaning numbers. 
You’re reconstructing the story of a project. 

And, like most real stories, it’s rarely linear. 

“Fine,” We Said. “Let’s Go Even Further Back.” 

At some point, as a team, we realised we had to go even further back. 

Not just OSF files. 
Not just preregistrations. 

We needed the original research papers that were replicated, along with their records from Web of Science and Scopus. 

Surely this would bring clarity. 
(It brought context. And more work.) 

The Gold Standard That Wasn’t 

I had worked closely with Web of Science and Scopus before. In my mind, they were: 
– curated, 
– unified, 
– authoritative. 

Basically… perfect. 

They were not. 

I opened the reports and immediately found: 
– papers with no publication year, 
– very old books mixed with modern journal articles, 
– the same paper listed differently across databases, 
– author names spelled in impressively creative ways, 
– conference papers pretending to be journal articles. 

At that point, all I could do was laugh. 

Because the lesson had returned, louder than before: 
Prestigious does not mean consistent. 
Curated does not mean complete. 

Automation to the Rescue (Sort Of) 

To survive this phase, I automated part of the data extraction. 

I pulled key metadata such as: 
– year of publication, 
– abstract word counts, 
– and other descriptive fields 
directly from Web of Science and Scopus records. 

And then I did what every cautious data scientist eventually does: 
I randomly selected records and checked them manually. 
Once. 
Twice. 
Again. 

Automation was fast. 
Trust required human eyes. 

The Hidden Cost: Time, Pressure, and Mood 

This phase alone lasted two to three months. 

Months of switching between OSF, papers, Web of Science, and Scopus. 
Some days everything finally aligned. 
Other days, a single inconsistency could undo hours of work. 

All of this happened under real time pressure, because we were working toward the OSF Replication Challenge deadline. 

The data didn’t just need to be clean. 
It needed to be ready. 
Ready to run models. 
Ready to generate results. 
Ready on time. 

That combination—messy data, high stakes, and a ticking clock—had a real impact on my mood and wellbeing. 
There were definite ups and downs. 
It was challenging, sometimes exhausting, and occasionally overwhelming. 

But it was also meaningful. 

Why This Always Takes So Long 

By now, I’ve accepted that 70–80% of my work is not analysis. 

It’s: 
– checking, 
– reconciling, 
– standardising, 
– asking, what is this, really? 

Why? 
Because real data comes from real people. 
Because standards change over time. 
Because different systems store the same thing differently. 
Because science evolves faster than metadata. 

And no algorithm can fix that for you. 

What I Do Now (After Being Burned Many Times) 

These days, I don’t rush. 

When I get a new dataset, I: 
– read everything before analysing anything, 
– map the structure myself, 
– decide what I trust—and why, 
– write down assumptions explicitly, 
– accept ambiguity as part of the data. 

Only then do I start the fun part. 

Ironically, once the data is clean, everything else becomes easier. 

The Big Lesson 

I used to think data cleaning was the boring part you had to survive before doing real work. 

Now I think: 
Data cleaning is the real work. 

It’s where you: 
– learn how science actually happens, 
– see the gaps between intention and execution, 
– understand the limits of what your analysis can honestly claim. 

OSF taught me transparency. 
Web of Science and Scopus taught me humility. 
Messy data taught me patience. 
Deadlines taught me prioritisation. 

And all of them taught me this: 
If you don’t understand your data deeply, 
your results don’t mean much—no matter how fancy the model. 

Final Thoughts 

Yes, I still spend most of my time cleaning data. 

But now I don’t see it as wasted time. 

I see it as: 
– preparing the canvas, 
– sharpening the tools, 
– making sure the story I tell is honest. 

Because in the end, beautiful analysis on dirty data is just decoration. 

And I’d rather do the slow work properly than get fast results I don’t trust. 


Leave a Reply

Your email address will not be published. Required fields are marked *