Rethinking the Data Flow Diagram
Since the beginning of Research Methods Support project, we have been using the term “Data Flow” to encompass our view on how data should be considered. Data is vital to a research project - it’s the raw material that you collect and explore to discover information needed to fulfil your research objectives. Yet, we regularly encounter projects where the researchers don’t really think about data until they’re about to go and collect some.
What do we mean by "Data Flow"?
When we say “Data Flow”, we don’t just mean the flow of actual data through the stages of your project, we mean the thinking about data throughout the project. This is an important distinction, because it means “Data Flow” starts at the very beginning of the project, instead of part-way through after most of the planning is complete.
That’s one of the key reasons we adopted the term - to get people thinking more broadly about their data. It’s not just “data collection” or “data management”, it’s all the issues surrounding the data. “Who owns the data” might not seem like a vital question when you’re rushing to prepare your data collection forms, but it’s just as important as including robust data quality checks to the overall project.
In the early days of RMS, we needed a way of emphasising these different aspects of data-related thinking and how it flows through the entire project life-cycle. We organised our thinking into the following stages, mapped to the stages of a research project.
Improving the Diagram
This diagram is starting to show its age, and we’re working on a revised version to take into account some modern thinking (for example, collecting data digitally means you can skip the ‘data entry’ stage). Some of the stages need to be re-worded, and more emphasis placed on the planning phases. When it comes to digital systems, the more you can set up and properly test before moving to data collection, the easier the activity and reporting stages should be.
I think a new version of this diagram will have a different structure - hopefully one that takes into account the non-linear links between these topics. As it said on the old Statistical Services Centre website:
“In reality the process of data flow is continuous and the boundaries between the steps shown are artificial. We have divided the process to provide a framework for description and planning.”
Throughout the lifetime of RMS, we have stuck to these boundaries and they’ve been, on the whole, pretty useful. Importantly, it’s helped groups we’ve worked with take more notice of the boxes in the planning and reporting steps. This is a huge win, as they are often overlooked but are vital to determining the long-term impact of a research project.
Now, as we enter the 3rd phase of RMS and approach the 10-year mark, it’s a good time to review these fundamental concepts. The practicalities of data flow are intrinsically wrapped up in the technologies we use to collect, manage and manipulate our data. The past few years have seen major shifts in the landscape - not just towards digital data collection but also towards cloud-based systems that are accessible from anywhere and can run constantly in the background, helping you as a data manager to keep your data organised, clean and properly backed-up.
These shifts require changes to the ‘doing’ of data flow - we now encourage projects to explore tools like the Open Data Kit for data collection, think about how to store data in the cloud and how best to access that data for analysis.
They also require changes to the ‘thinking’ of data flow - but I’m not yet convinced that the core of our thinking needs to change in a drastic way. Yes, I think the diagram needs updating. The boxes need to be shuffled around a bit; maybe change the emphasis a bit more towards planning and testing. Yet, the principles at the centre of every box remain important, regardless of your toolkit. Are you collecting data digitally or on paper? You still need a way of ensuring your IDs are unique and consistently formatted. Are you storing your data in excel files on your laptop, or in a JSON database on a remote server? You still need to take backups, decide who and how people can access your data, and what you’re going to do with it at the end of the project. The methods change, but a lot of the principles remain.
Of course, the new technologies open new possibilities. Many things are now much easier - data quality checks at the point of collection, automated backups - but many things are also more complicated - like choosing where to put your data. So maybe the principles of Data Flow haven’t changed, but our presentation and interpretation of them must. And that, I think, is how we can start to update our model. What are the principles in that diagram above? How do they relate to modern methods? What parts require more care and attention by the people actually using these ideas in practice?
I’m not hoping to answer any of these questions here - not yet. This is more of a statement of intent. Our goal is to try and answer some of these questions over the next couple of years, update some of our old resources and create some new ones around the idea of Data Flow. I think the journey will be an interesting one, and I hope to be able to document some of it here.
Author: Dave Mills
Dave developed an IT & data infrastructure that allows us to close information loops and deliver tailored information to diverse users, through data collecting mobile apps. He is also responsible for the development of our eLearning portfolio and Open Educational Resources.