Introduction
In our 3+ years of partnership with the BOHEMIA project, a clinical trial studying the efficacy of the antiparasitic drug ivermectin in fighting malaria in Africa, we have managed hundreds of ODK survey forms (and growing!).
Our team used AWS to store all survey forms as CSV files and make downstream data transformations for analysis. As the number of files grew rapidly, we noticed that it was difficult for new users to navigate or find the data they needed. Thus, we needed to take action before our database turned into a data swamp by adding more structure to our files.
This is where AWS Glue & Athena comes in!
Athena is a serverless data warehouse that is built on top of our files in S3, which means that you can create SQL queries based on your S3 files. AWS Glue is an ETL tool that helps you create schema on top of your files in S3 and build all the required components for enabling SQL-like queries for all your files.
So, what are the advantages of using this infrastructure?
File-like documents as the base of our tables
Clear data schemas, catalogs, and governance
No heavy change management is needed (making pivotal system updates, changing to a new tech stack, etc.)
Serverless: The user is only charged per the query they make - which makes things much cheaper than provisioning new relational databases
Technical Details
Fetch data from ODK to S3
We are currently using their Python library called pyodk as it enables you to fetch data frames directly into your environment. Once data is fetched, convert your data frame into a parquet.
A parquet file is a serialized format that can store data types of each column. This is not the case for comma-separated formats like CSV which does not store data types
Hers is a Python snippet to convert ODK submissions to Parquet file:
# Install python dependencies
pip install pandas
pip install pyarrow
pip install boto3
pip install pyodk
# Python libs
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import boto3
# import pyodk
from pyodk.client import Client
# ODK variables
PROJECT_ID='YOUR_PROJECT_ID'
FORM_ID='YOUR_FORM_ID'
# AWS Variable
BUCKET_NAME = 'databrew-odk-aws-test-bucket'
LOCAL_FILE_OUTPUT = 'data.parquet'
S3_KEY = 'data/data.parquet' # add folder prefix to create folder-like struct
# instantate boto3 client
s3 = boto3.client('s3')
# Read ODK form using pyODK
with Client() as client:
submissions = client.submissions.get_table(
project_id=PROJECT_ID,
form_id=FORM_ID)
df = pd.json_normalize(data=submissions['value'], sep='/')
table = pa.Table.from_pandas(df)
pq.write_table(table, LOCAL_FILE_OUTPUT)
with open('data.parquet', 'rb') as data:
s3.upload_fileobj(data, BUCKET_NAME, S3_KEY)
This will result in a parquet file generated in S3:
Configure AWS Glue in the AWS console
Now let’s create the ODK database in Glue
Set up a Glue database: Go to the AWS Glue console and create a new database. This will be used to store the metadata of your tables.
Create a Glue crawler: In the Glue console, navigate to "Crawlers" and create a new crawler. Specify the data store (S3 in this case) and the location of your data within the S3 bucket. Since data is stored under s3://databrew-odk-aws-test-bucket/data/ you can pinpoint the S3 data source to s3://databrew-odk-aws-test-bucket/ so that it can re-crawl other new files in the bucket
Specify crawler settings: Configure the crawler settings such as the frequency of crawling, permissions, IAM Roles, and database to store the metadata.
-> Common gotchas: Since we are structuring tables as folders, set the Table Level = 2 in advance configurations
Run the crawler: Once configured, run the crawler. It will scan the specified S3 location, infer the schema of your data, and create metadata tables in the Glue Data Catalog. Once succeeded, you will be able to see the parquet files created into tables in the Glue DatabaseBelow is a picture of the Glue Database - here you can see the location of your parquet file, table name, and the database it belongs to.
Check Data Catalog: With the table created, you will be able to see all the information regarding the schema data types, update time, and many more.
Now, on to how you can query this table via SQL
Configure AWS Athena in the AWS console
Go to AWS Athena using the console search bar
Click on Launch Query Editor
Set on output S3 bucket: Every SQL query in Athena that you run will be stored to an S3 bucket by Athena, thus you’ll need to choose the new/existing S3 bucket to store the output results
Voila, you are now able to make SQL queries to the files you created from ODK. In this UI, you will be able to see all your tables created from the crawler, column names and data types (left side) from the Glue Data Catalog, and the results of your query (bottom)
Closing Remarks
This is a quick tutorial on how you can set up a small serverless database in AWS for your ODK forms (or any other files in AWS). This infrastructure does not require a heavy backend setup and can bootstrap your team quickly with good data practice and get your project going quickly.
We at Databrew can provide these modern data integrations and can help you through your data journey, please check out our website to learn more about our services!
Comments