How to get data out of ODK by writing code
Data Engineering is, unsurprisingly, all about building data systems. Our goal is to take data - whether that’s primary (fresh, new data collected for the project), or secondary (existing data that we can re-purpose), and get it into a sensible place and format that’s most useful for the project team.
This post is about handling primary data; specifically, data collected through a tool called the Open Data Kit (ODK). Most of our current partners use ODK for data collection, and it’s one we often recommend. For us engineers, this is great news, as anything we build to support ODK is potentially useful for lots of different projects.
In the past few years, we have developed lots of tools that can link to an ODK server, and automatically process the data that has been collected in different ways. These tools all share a common factor, which is how they link to ODK servers to automatically pull data. In this post, we explain how this link works and give a few examples of what it enables.
The ODK Aggregate Server
ODK allows you to have lots of people collecting data in different places - on mobile apps, through the web, even through SMS - and then have all that data aggregated together onto a server. This server could be one setup specifically for your project, or provided to you by a 3rd party service like KoBoToolbox or Ona.io.
Figure 1: An Infrastructure for Digital Data Collection, taken from our video on the topic
Once aggregated, you can download the combined datasets directly from ODK (this is usually done through a web page). You log into the application in your browser, and export data from your forms.
This manual access is often enough, but it does come with limitations:
· You only get data when you specifically ask for it
· The format you receive is based on the data collection forms, and a good form structure is often not a good structure for storage or analysis
· If you have multiple forms, the data you get from each form will not be linked in any way
This all means that you will likely need to do some data cleaning on whatever you export from ODK. Then, if you want to import that data into another system, like a database for long-term storage, or reporting tool like a dashboard – you will need to do that manually.
Get data automatically using a REST API
KoboToolBox and Ona.io are both great options to use as the ‘server’ component in an ODK system (SurveyCTO and ODK’s own ODK Central are also good options, although we have less experience with those tools). One great thing about them is that they offer a REST API, which is engineering speak for “we can interact with this web service by writing and running code”.
Want to know more about what a REST API actually is? This medium post by Taika Björklund, Senior Product Owner at Basware, explains REST APIs in non-technical terms.
Great – but what code?
As a simple example: imagine we have an R script that does some basic data cleaning. You run this script each time you download data from ODK. It has the following steps:
1. Import data from the downloaded Excel file;
data <- read.xlsx('/path/to/downloaded-data.xlsx',1)
2. Run data cleaning (different for each dataset)
3. Export the cleaned data back into Excel
This is much more efficient than doing the cleaning manually each time. But it can be more efficient! Instead of manually downloading the data, you could send a request to the ODK server directly in your script:
1. Connect to the ODK server and download your data;
2. Run data cleaning (different from each dataset)
3. Export the cleaned data into Excel
In the script, you would replace the parts in red with details of your own KoBoToolbox account – including the server url, username and password, and the form ID you want to download from.
Now, you don’t have to manually get the data from those forms. Whenever you want the latest cleaned data, just run the script, and R will connect to KoBoToolbox, download the data and give you a cleaned dataset in Excel.
Real World Examples
The above example, although simple, is still a real-world way in which someone could use the KoboToolbox API in a script. That structure could be used, for example, to monitor progress of a survey by downloading data to your local machine and running some simple data quality checks.
But the ability to pull data from ODK automatically, through a script, also enables a lot of very powerful things. Here are 2 examples from recent projects:
Case 1: Soils Data Platform
We maintain and support a data platform for the Soils cross-cutting project of the Collaborative Crop Research Program (CCRP). This project works with other CCRP grantees to help integrate soil health into their research and work with smallholder famers. They have a series of ODK forms that are used by researchers to collect data about soil samples. There are currently 7 different analyses that researchers might conduct on a soil sample, and each analysis has a different ODK form.
The process of exporting data from so many forms and merging them together into a usable dataset would be very time consuming if done manually. Our data platform has scripts to pull the data automatically from all the forms and feed them into an SQL database. Researchers can then access their own data through the web platform to export a single, merged dataset that’s ready for analysis.
Case 2: Survey Monitoring Dashboard
As part of our support to the MESH programme, we are providing survey design and management support to the Danwadaag Durable Solutions Consortium. Earlier this year, they ran their annual Local Re-integration Assessment survey, which aims to understand how displaced and recently returned groups of people cope when settling into their new (or old) homes. The survey was conducted through KoBoToolbox, and we monitored the survey using a dashboard built in R Shiny.
Figure 2: Screenshot of the LORA data collection monitoring dashboard, written in R Shiny
The dashboard we built uses the API to pull data, and the user can trigger this – ensuring that they are seeing the most recent submissions.
We could just monitor directly on KoBoToolbox, but this separate dashboard has some clear advantages:
- We can generate any summary statistics, filtered any way we like, using R scripts
The ability to automatically pull data from an ODK server into other places opens up a lot of interesting possibilities for projects that use ODK. Automating tasks like pulling data can save a lot of time in large projects and can also reduce data errors that might be introduced by doing things manually. It does require some technical skills to write the code, but if you’re already doing analysis in R, then you’ll have the skills required to query an API and get the data straight from your ODK server, instead of downloading to Excel first!
Finally, from the perspective of an engineer managing many projects, the fact that pulling data from the API is similar across many different ODK services is a great boon. It means we can use similar scripts in different projects, and what we learn for one project lets us build better tools for the next project. We have learnt a huge amount in this area in the past few years, and I hope we can continue this learning long into the future!
If you’re interested in these more technical blog posts, let us know! And be sure to follow us on Github to see the code projects we share.
This post gives a very basic example of pulling data from KoBoToolbox in R. For more details on how to do that, or to pull from Ona.io instead, see this guide:
To learn about the core infrastructure that ODK uses:
Learn about how to use ODK in our multi-video series:
ODK Services and server applications with APIs:
KoBoToolbox API: https://support.kobotoolbox.org/api.html?highlight=api
Ona.io API: https://ona.io/static/docs/index.html
ODK Central: https://odkcentral.docs.apiary.io/#
Co-author: Dave Mills
Dave developed an IT & data infrastructure that allows us to close information loops and deliver tailored information to diverse users, through data collecting mobile apps. He is also responsible for the development of our eLearning portfolio and Open Educational Resources.
Co-author: Lucia Falcinelli
Lucia joined the team in October 2018 as a Data Engineer Intern. She has a Master’s in Computer Engineering and a keen interest for problem-solving and taking on new challenges. Lucia intends to use her time at Stats4SD to improve her skills, like software development and supporting customers with apps, and hopes to find elegant and functional solutions for every client.
0 comments for "How to get data out of ODK by writing code":
Add a comment:
We run an anonymous commenting system. If you are not logged in, we do not collect any information on who you are when you leave a comment. This means we manually confirm comments before they appear on the site.
If you want to have a comment you submitted deleted, please contact us, giving the date of the comment and name of the article.