Automate data copying from S3 to MongoDB…

Jyothi Lakshmi Krishnakumar
2 min readJul 8, 2022

This post explains how to have a Mongo database that is dynamically updated every time a new file arrives on S3.

Introduction

When we are dealing with a continuous flow of data from clients to run advanced analytics applications in real-time, it is often difficult to isolate operational workload. Setting up automatic continuous replication of the data with the database you are working with might be immensely helpful for this reason.

Pre-requisites

  1. MongoDB installed in your machine
  2. Create an AWS account
  3. Install and configure AWS CLI

Coding

Required Packages

Among all the required libraries (refer to requirements.txt), two libraries are primarily essential…

Boto3 — an API to connect to the AWS services (Documentation page)

PyMongo — Tool to connect and work with MongoDB (Documentation page)

pip install boto3
pip install pymongo

Set up initial variables

Connect with the AWS account using boto3 and mention the service (S3)which you want to access. Specify the bucket name from where you want to fetch.

Connect with the Mongo database using the MongoClient instance of the PyMongo library. Specify the database and collection name where you want to store the data.

s3_client = boto3.client(‘s3’)
bucket_name = ‘name-of-the-bucket’
db_name = ‘test-env’
coll_name = ‘test-coll’
client = MongoClient(‘localhost’,27017)

For DB updates when executed (not continuous)

The following script helps you to update the mongo database when executed…

The program fetches the latest file from S3 which will be compared with the existing file contents, if a new value exists, the database gets updated with this content, or else no updates in the database occurs. This will maintain the data flow without redundancy and on-the-fly updates.

For continuous DB updates…

To update the database in a continuous manner without any triggers, for example, every 5 mins or 10 mins, the program must fetch the latest file from S3 and the same above data updation mechanism takes place.

For this purpose, we will be using the dash application to run the program at a continuous pace.

Dash is an interactive web application based on python. ( An intro guide)

EndNotes

I hope this article helps you in your work. Thanks for the read. Feel free to leave any comments.

--

--