Over the past five years, the ResearchPro team has been responsible for building and managing the data pipeline for a large-scale clinical trial focused on the potential use of Ivermectin in the fight against malaria in Africa. This endeavour led us into the intricate world of ODK survey form development for data collection. Our mission has been to not only to collect data on tens of thousands of individuals, but to ensure that the data collected is both accurate and of high quality. This, we learned, requires careful planning and execution.
In this post, we'll unveil some of our insights and best practices for optimizing data accuracy and quality through simple form design and harnessing ODK's built-in functions.
What is data accuracy?
Data accuracy refers to how well the data collected reflects the real-world information it is supposed to represent. Accurate data is free from errors, omissions, and inconsistencies. When data is accurate, it can be trusted to provide a true representation of the phenomena it describes.
What is data quality?
Data quality is a broader concept that encompasses various aspects of data, including accuracy. It also involves ensuring that all necessary data elements are present and that there are no missing values, that data is consistent within itself and across different datasets or sources, and that the data conforms to the expected formats and values.
Constraints & Warning Messages
Using constraints in ODK forms is one of the simplest and most straight forward strategies to ensure data accuracy. Constraints allow you to define rules and conditions that data collectors must adhere to when entering information. By setting constraints, you can prevent the entry of incorrect or out-of-range data, minimizing errors and enhancing the overall accuracy of the data entered. For example, you can specify that a certain field must only accept numerical values within a specific range or that a date field should be within a certain timeframe.
In the XLS form snippet below, you can see that we've set a constraint on the Household ID variable hhid_manual to only accept 5 numeric digits using the constraint formula regex(.,"^\d{5}$") and set the constraint message to "Format must be 5 numeric digits", which appears if any other format is entered.
Setting acceptable ranges can also be very helpful in ensuring accurate data. In one of our early forms used for demographic data collection, we had not set any constraints on the field to enter participant weight. This resulted in fieldworkers entering obvious erroneous integers like 1kg and 200kg. We were able to detect and fix these values in the dataset once the data had been collected, but it resulted in hours of extra work (fieldworkers having to return to households to re-weigh participants, data managers manually correcting the database, etc). In subsequent ODK forms, we added constraints for acceptable weight ranges, which helped reduce erroneous data entry.
In the case the constraints might be too restrictive, adding warning notes that appear if a seemingly anomalous value is entered can be effective. This can be done by adding a note type field that is conditionally shown upon a certain response or combination of responses entered (see example here).
Implementing an in-form anomaly detection system
Ensuring the correct weight was entered for participants was crucial in the study mentioned above because it determined the dosage of the study drug that individuals received. Aside from adding constraints to prevent obviously erroneous values from being entered, we also added a few other checks to detect potentially incorrect values within the ODK form.
First, we compared the weight entered with the participant's age. If a participant was under 12 years old and a weight over 50kg or the participant was 12 or older and less than 30kg, a warning message would be displayed to flag this likely anomaly. This would prompt the data collector to double check the value entered.
We also added a calculation for BMI, and if based on the height and weight entered a person's BMI was under 16 or over 35 (outside normal ranges), a warning message would be displayed alert the data collector to verify both the height and weight entered.
Designing an in-form anomaly detection system can help reduce erroneous data from being entered and save lots of time cleaning the data once it has already been collected.
Using text formatting and optimizing the user interface
You'll notice in the code snippet above that the warning messages are formatted within the HTML style code:
<span style=color:#3371FF">...</span>
This prints the messages in bright blue so that it stands out on the screen. We can't understate the usefulness of stylistic devices - coloured text, bold text, and other page formatting tools - as a way to support data collectors in entering accurate information.
Thoughtful design of the screen layout and question presentation is another key element in data collection efficiency and accuracy. Overloading a single screen with too many questions can overwhelm data collectors, potentially leading to missed notes or rushed responses. Conversely, not using any "field-lists" and showing just one question per screen can create the perception of lengthy and disjointed questionnaires. Striking the right balance involves grouping logically related questions, preventing important notes or instructions from being overlooked at the screen's bottom, and ensuring that each screen's question count doesn't appear daunting. Designing ODK forms with the user experience (ie the data collector) in mind optimizes data collection by promoting clarity, engagement, and accurate responses.
Data verification screens
An effective way to allow data collectors to verify the information they have inputed is to include a recap or verification screen with the data entered. Below is a screenshot of a screen with a (fake) participant's basic demographic data for the fieldworker to review before proceeding. Notice the use of coloured text to help the important information stand out.
Using images (and other media) to verify responses
Leveraging ODK's ability to include images in choice lists to validate responses is another useful tool for enhancing data accuracy and reliability. Visual aids can provide concrete reference points and context, ensuring that data collectors interpret and record information accurately. In our study, we used images of Malaria rapid diagnostic tests (RDTs) to enable fieldworkers to confirm their responses based on visual evidence. Rather than simply selecting Valid or Invalid, by adding an example image of a valid and invalid test, the data collectors could match the physical RDT result with the image in the questionnaire.
In addition to images, videos and audio files can also be added as choices (learn how to include media in your choices here).
The importance of collecting accurate and quality data cannot be overstated. Using the simple methods discussed above, our team has not only minimized errors but also liberated valuable time that would have otherwise been spent wrestling with data cleaning.
In the following post, we will talk about other helpful operational tools and strategies to enhance data accuracy and quality.
Comments