Doing ETL Process with Azure Technologies

in
3 minutes min read

Introduction

Omnia processes millions of retailer’s products information each day to deliver the best prices according to their needs. To gain advantage from machine learning techniques Omnia began a new project to backup and utilize its data to improve its business analytics and explore machine learning. The project has two main objectives:

  1. Backup data (about 38 GB per day, with expected growth in the coming months);
  2. Ability to create multiple ETL (Extract, transform, load) processes to make data available for diverse purposes
    a. This project ETL process is to transform unstructured data into business report data.
     

Alongside that, we add the following technical requirements:

  1. Progress traceability and monitor the backup and the ETL;
  2. Facility to recover if something goes wrong in the process;
  3. Ability to scale the speed linearly to the amount of data;
  4. Flexibility to create multiple distinct ETL processes.

As for constraints:

  1. Internet speed from our datacenter to Azure (about 50 MB/s);
  2. Couchbase version (3.0.0), cluster size and specs – to retrieve that for backup.

Development

The first step was to choose the technology stack and since Omnia is primarily a .NET shop we investigated Azure. This seemed reasonable due to the current offers and easy team adoption.

To satisfy the backup requirement, Azure Data Lake Store (ADLS) seemed the appropriate choice, due to the facts:

  1. Unlimited storage capability;
  2. Backups with redundancy on the fly;
  3. Created based on HDFS and supports big files (talking hundreds of gigs).

 

Omnia currently has its infrastructure designed around Hangfire to allow easy scheduling and monitoring of background jobs, so the backup to ADLS was created as another Hangfire job. We used the Dataflow library to be able to have a good degree of parallelism when inserting documents into ADLS. We decided to zip them (with a 4 MB size cap) because text documents have a good compression rate, we use less bandwidth between our datacenter and azure and file sizes have an impact of Azure Function processing time (this is constraint since Azure functions have short lifespans).

 

For the ability to create multiple ETL processes with ease, Azure Functions seemed the best fit due to their serverless characteristic. Azure Functions are implementations of serverless architecture. The serverless architecture allows one to execute a business logic function without thinking about the underlying infrastructure, i.e., the functions run and live in stateless compute containers that are event-triggered. Moreover, Azure functions scale on demand, have automatic failover and business logic written in C# which promotes flexibility for business requirements. To trigger Azure functions, we selected Azure Storage Queue as it provides a reliable way to transmit information between applications. These characteristics met the non-functional requirements of scaling and availability to allow the transformation of hundreds of gigabytes of Omnia data.

 

As for the business reports requirement, Azure has SQL Data warehouse (SQL DW) seemed a possible fit due to:

  1. Ability to manage large volumes (terabytes) of information;
  2. It distributes data across 60 databases (by default) and enables more partitions;
  3. Has column store indexes – enabling fast returns for every database column;
  4. T-SQL has aggregation function to summarize data;
  5. Ability to connect with Azure Analysis services;
  6. Combability with other Azure components is easy to establish - SQL DW also has a tool to connect ADLS called PolyBase. PolyBase parses files from ADLS and map them into SQL tables.


Complete Process

The process to perform the backup and the ETL process is represented by the sequence diagram in Figure 1.

 

Backup and ETL - simplified process

Figure 1 Backup and ETL (simplified process)

 

The process starts with a Hangfire job batch that inserts data into ADLS. Every time a file is added onto ADLS, a message is inserted into unzip message queue and the Unzip function is triggered. After the file is extracted into ADLS another message is inserted on transform message queue to trigger the Transform function that splits the file into several CSV files. When all files are converted to CSV, PolyBase is then manually triggered to insert data into SQL DW.

 

This process has a central information point, ADLS, and a good scalability due to the use of Azure functions. Traceability is achieved by using the paths as the file name of the file. An important aspect to organize the ETL’s information was folder structure. Each ETL phase has a specific folder identified by a date folder structure and the file unique name. The folder structure is:

 

  1. Backups
    1. Year/Month/Day/Guid_N_Total.zip
  2. Unzip
    1. Etl/Unzip/Year/Month/Day/Guid_N_Total.json
  3. Final Files
    1. Etl/Csv/Year/Month/Day/BusinessFolder/Guid_N_Total.csv
    2. Etl/Csv/Year/Month/Day/BusinessFolder2/Guid_N_Total.csv
    3. ...

To monitor all these functions, we used Azure’s application insights and found it easy to understand if something went wrong. Although, a single monitoring system needs to be set up to enable an overall monitor of the three different resources, and an automatic verification system to assure if all the backup files were transformed and inserted.


Tips from our development process

  1. Don’t build Azure functions with more than one objective. Keep them short and precise, otherwise, they will fail and terminate midway through;
  2. Keep in mind that if the same message might be caught by multiple azure functions and your process must predict that same behaviour;
  3. Think about the process with monitoring in mind and the ability to execute specific parts;
  4. Have a correlation identifier to represent the entire ETL process. That will allow you to aggregate all the data regarding that specific run and monitor it;
  5. Use PolyBase to upload data to SQL DW instead of SISS. PolyBase was made to work with ADLS, and Microsoft as a lot of documentation on how to use it;
  6. When adding data to SQL DW, use CTAS (Create table as select) – it makes the data insertion faster and you can it to merge two tables without duplicates;
  7. To fully achieve the full SQL DW potential you need at least 60 million rows evenly distributed across all the 60 databases that it provides. Be aware of it, when developing your solutions.
  8. When creating SQL DW database study the data carefully; think how you are going to use, how you are going to partition and distributed across the database.


Conclusion

This project is still underway, but so far, we are happy with the results achieved. This process allowed us to backup and transform, 38 GB - about 10 Million Couchbase documents, in 1 hour and 40 minutes. Our prominent bottleneck lies on our on-premise infrastructure where hang fire batch jobs run. However, this runtime suits fits our current need, and it can be scalable by increasing the number of job slots available to perform the batch. The Azure functions occurred in parallel with the backup and that time is encapsulated by our thought-put into ADLS.

PolyBase’s ingestion time to SQL DW is another project itself because it’s related with other factors such as the process units chosen for the database, the data model designed and the amount of information already stored.

The next steps will be achieving a way to monitor all 3 components and assure that all backed-up data was indeed transformed and automate the PolyBase ingestion.

 

Sign up for our newsletter