A structured 3-month data engineering program focused on building practical, job-ready skills through a mix of core concepts, hands-on tools, and real-world projects. the curriculum covers data pipelines, etl and elt processes, modern data infrastructure, data modeling, orchestration, data quality, and testing. learners gain experience with industry-relevant tools and complete an end-to-end capstone project to demonstrate their ability to design, build, an
60
Apr.30,1990
4 Months
Month 1: Foundations and Tools
Week 1: Introduction to Data Engineering and Ecosystem
• Day 1: Roles and Responsibilities
o What is a Data Engineer?
▪ Overview of the role within the data ecosystem.
▪ Difference from Data Scientists and Analysts. (Optional)
▪ Key deliverables: pipelines, infrastructure, and scalability.
o Core Responsibilities:
▪ Building and maintaining data pipelines.
▪ Data integration, transformation, and storage.
▪ Supporting downstream analytics and ML workflows.
o Scope in the Real World:
▪ Demand for Data Engineers in the industry.
▪ Career paths and growth opportunities.
• Day 2: Key Concepts
o What is Data?
▪ Types: Structured, Semi-structured, Unstructured.
▪ Formats: JSON, CSV, Parquet, Avro.
o Data Pipelines Overview:
▪ What are data pipelines and their role in the data ecosystem?
o ETL (Extract, Transform, Load):
▪ Why ETL is foundational to data workflows.
▪ Example: Moving data from APIs to databases.
o Hands-On Introduction:
▪ Setting up Airbyte for simple data extraction.
▪ Setting up Minio as a destination.
Week 2: ETL vs ELT
• Day 1: ELT and Reverse ETL
o What is ELT?
▪ Difference in process (Transformation post-load).
▪ Use cases: Modern data platforms like Snowflake.
o Reverse ETL:
▪ Bringing transformed data back to operational systems.
▪ Examples: Sending processed data back to Salesforce or other CRMs.
o Key Differences:
▪ Use case comparisons.
▪ Cost and performance implications.
• Day 2: Tools and Implementation
o Hands-On ETL Tools:
▪ Extraction using Airbyte.
▪ Loading to Minio using Iceberg.
▪ Transformation using dbt:
▪ Building simple transformations.
▪ Writing SQL models.
Week 3: Data Infrastructure
• Day 1: Databases
o Relational Databases:
▪ Core concepts: Tables, indexes, primary keys, foreign keys.
▪ Common tools: MySQL, PostgreSQL.
o NoSQL Databases:
▪ Core concepts: Key-value stores, document stores, graph databases.
▪ Use cases: MongoDB, Cassandra.
• Day 2: Data Lakes vs Warehouses
o What is a Data Lake?
▪ Characteristics: Raw data storage, schema-on-read.
▪ Use cases and tools: Hadoop, Iceberg.
o What is a Data Warehouse?
▪ Characteristics: Structured data, schema-on-write.
▪ Use cases and tools: Redshift, Snowflake, BigQuery.
o Key Differences:
▪ Scalability, cost, and performance.
Week 4: Orchestration Tools
• Day 1: Airflow Basics
o What is Orchestration?
▪ Need for scheduling and automation.
▪ Overview of Apache Airflow.
o Core Components:
▪ DAGs (Directed Acyclic Graphs).
▪ Tasks and dependencies.
• Day 2: Hands-On with Airflow
o Setting up Airflow locally.
o Creating a simple DAG:
▪ Tasks for data extraction, transformation, and loading.
▪ Monitoring DAG runs.
Month 2: Data Modelling and Advanced Pipelines
Week 5: Data Modelling Concepts
• Day 1: Introduction to Data Modelling
o Banking Domain Overview:
▪ Types of data: Transactions, accounts, customers.
▪ Business requirements for analytics and reporting.
o Star Schema vs Snowflake Schema:
▪ Differences, advantages, and trade-offs.
▪ Examples for both schemas.
• Day 2: Dimension Modelling
o Fact Tables:
▪ Quantitative data (e.g., sales, transactions).
o Dimension Tables:
▪ Descriptive data (e.g., customer, product).
o ERD Tools:
▪ Creating models using Lucidchart or dbt.
Week 6: Building Data Models
• Day 1: Hands-On Data Modelling
o Building a banking data model.
o Identifying facts and dimensions.
• Day 2: Validating Models
o Verifying relationships between tables.
o Optimizing schema for performance.
Week 7: Advanced Data Pipelines
• Day 1: Deep Dive into Extraction and Loading
o Advanced Airbyte usage:
▪ Extracting data from multiple sources (APIs, files).
o Loading to Minio with Iceberg:
▪ Creating partitions and file optimization.
• Day 2: Transformation
o Using dbt for advanced SQL-based transformations.
o Using Pandas and PySpark for programmatic transformations.
Week 8: Reverse ETL
• Day 1: Concepts and Tools
o Reverse ETL process and tools overview.
• Day 2: Hands-On
o Example: Syncing processed data back to a CRM system.
Month 3: Data Quality, Testing, and Projects
Week 9: Data Quality and Testing
• Day 1: Importance of Data Quality
o Common issues: Missing data, duplicates, inconsistencies.
o Tools for quality checks.
• Day 2: Unit Testing for ETL
o Writing tests for pipeline steps (e.g., transformation validation).
Week 10: Capstone Project Introduction
• Day 1: Project Briefing
o Overview of telecom and banking projects.
o Generating mock data.
• Day 2: Setting Up Pipelines
o Starting ETL pipelines for the chosen project.
Week 11: Project Development
• Day 1: Intermediate Steps
o Refining transformations and models.
• Day 2: Integration and Orchestration
o Setting up final Airflow DAGs.
Week 12: Project Completion and Presentation
• Day 1: Finalizing and Testing
o Data quality checks and pipeline validation.
• Day 2: Presentations
o Students present their projects.
o Feedback and suggestions for improvement.
Core Concepts Across Modules
1. ETL/ELT: Focus on real-world implementation and tool usage.
2. Data Infrastructure: Understanding databases, data lakes, and warehouses.
3. Data Modelling: Real-world schema design.
4. Orchestration: Automation with Airflow.
5. Testing and Quality: Building robust and reliable pipelines.
This detailed schedule ensures a balance between theory, hands-on practice, and project-based
learning to build job-ready skills.