Doing ETL Process with Azure Technologies

Introduction

Omnia processes millions of retailer’s products information each day to deliver the best prices according to their needs. To gain advantage from machine learning techniques Omnia began a new project to backup and utilize its data to improve its business analytics and explore machine learning. The project has two main objectives:

Backup data (about 38 GB per day, with expected growth in the coming months);
Ability to create multiple ETL (Extract, transform, load) processes to make data available for diverse purposes
a. This project ETL process is to transform unstructured data into business report data.

Alongside that, we add the following technical requirements:

Progress traceability and monitor the backup and the ETL;
Facility to recover if something goes wrong in the process;
Ability to scale the speed linearly to the amount of data;
Flexibility to create multiple distinct ETL processes.

As for constraints:

Internet speed from our datacenter to Azure (about 50 MB/s);
Couchbase version (3.0.0), cluster size and specs – to retrieve that for backup.

Development

The first step was to choose the technology stack and since Omnia is primarily a .NET shop we investigated Azure. This seemed reasonable due to the current offers and easy team adoption.

To satisfy the backup requirement, Azure Data Lake Store (ADLS) seemed the appropriate choice, due to the facts:

Unlimited storage capability;
Backups with redundancy on the fly;
Created based on HDFS and supports big files (talking hundreds of gigs).

Omnia currently has its infrastructure designed around Hangfire to allow easy scheduling and monitoring of background jobs, so the backup to ADLS was created as another Hangfire job. We used the Dataflow library to be able to have a good degree of parallelism when inserting documents into ADLS. We decided to zip them (with a 4 MB size cap) because text documents have a good compression rate, we use less bandwidth between our datacenter and azure and file sizes have an impact of Azure Function processing time (this is constraint since Azure functions have short lifespans).

For the ability to create multiple ETL processes with ease, Azure Functions seemed the best fit due to their serverless characteristic. Azure Functions are implementations of serverless architecture. The serverless architecture allows one to execute a business logic function without thinking about the underlying infrastructure, i.e., the functions run and live in stateless compute containers that are event-triggered. Moreover, Azure functions scale on demand, have automatic failover and business logic written in C# which promotes flexibility for business requirements. To trigger Azure functions, we selected Azure Storage Queue as it provides a reliable way to transmit information between applications. These characteristics met the non-functional requirements of scaling and availability to allow the transformation of hundreds of gigabytes of Omnia data.

As for the business reports requirement, Azure has SQL Data warehouse (SQL DW) seemed a possible fit due to:

Ability to manage large volumes (terabytes) of information;
It distributes data across 60 databases (by default) and enables more partitions;
Has column store indexes – enabling fast returns for every database column;
T-SQL has aggregation function to summarize data;
Ability to connect with Azure Analysis services;
Combability with other Azure components is easy to establish - SQL DW also has a tool to connect ADLS called PolyBase. PolyBase parses files from ADLS and map them into SQL tables.

Complete Process

The process to perform the backup and the ETL process is represented by the sequence diagram in Figure 1.

Figure 1 Backup and ETL (simplified process)

The process starts with a Hangfire job batch that inserts data into ADLS. Every time a file is added onto ADLS, a message is inserted into unzip message queue and the Unzip function is triggered. After the file is extracted into ADLS another message is inserted on transform message queue to trigger the Transform function that splits the file into several CSV files. When all files are converted to CSV, PolyBase is then manually triggered to insert data into SQL DW.

This process has a central information point, ADLS, and a good scalability due to the use of Azure functions. Traceability is achieved by using the paths as the file name of the file. An important aspect to organize the ETL’s information was folder structure. Each ETL phase has a specific folder identified by a date folder structure and the file unique name. The folder structure is:

Backups
1. Year/Month/Day/Guid_N_Total.zip
Unzip
1. Etl/Unzip/Year/Month/Day/Guid_N_Total.json
Final Files
1. Etl/Csv/Year/Month/Day/BusinessFolder/Guid_N_Total.csv
2. Etl/Csv/Year/Month/Day/BusinessFolder2/Guid_N_Total.csv
3. ...

To monitor all these functions, we used Azure’s application insights and found it easy to understand if something went wrong. Although, a single monitoring system needs to be set up to enable an overall monitor of the three different resources, and an automatic verification system to assure if all the backup files were transformed and inserted.

Doing ETL Process with Azure Technologies

Introduction

Development

Complete Process

Tips from our development process

Conclusion