CI/CD for Data Pipelines with Git

The following article is part of a series on Python for data engineering aimed at helping data engineers, data scientists, data analysts, Machine Learning engineers, or others who are new to Python master the basics. To date this beginners guide consists of:

‍

Part 1: Python Packages: a Primer for Data People (part 1 of 2), explored the basics of Python modules, Python packages and how to import modules into your own projects.
Part 2: Python Packages: a Primer for Data People (part 2 of 2), covered dependency management and virtual environments.
Part 3: Best Practices in Structuring Python Projects, covered 9 best practices and examples on structuring your projects.
Part 4: From Python Projects to Dagster Pipelines, we explore setting up a Dagster project, and the key concept of Data Assets.
Part 5: Environment Variables in Python, we cover the importance of environment variables and how to use them.
Part 6: Type Hinting, or how type hints reduce errors.
Part 7: Factory Patterns, or learning design patterns, which are reusable solutions to common problems in software design.
Part 8: Write-Audit-Publish in data pipelines a design pattern frequently used in ETL to ensure data quality and reliability.
Part 9: CI/CD and Data Pipeline Automation (with Git), learn to automate data pipelines and deployments with Git.
Part 10: High-performance Python for Data Engineering, learn how to code data pipelines in Python for performance.
Part 11: Breaking Packages in Python, in which we explore the sharp edges of Python’s system of imports, modules, and packages.

Sign up for our newsletter to get all the updates! And if you enjoyed this guide check out our data engineering glossary, complete with Python code examples.

‍

In our blog post series, we’ve tackled fundamental Python coding concepts and practices. From diving deep into project best practices to the intricacies of virtual environments, we've recognized the significance of structured development. We've explored environment variables and mastered the art of the factory pattern, ensuring our projects are both adaptable and scalable.

As we look towards developing data pipelines and pushing them to production, there are still several questions left unanswered. How do we ensure our evolving codebase integrates seamlessly? As we will be doing this frequently and often as part of a larger team, how do we automate this process?

If you are new to this process, you might wonder how to keep track of changes, collaborate with teammates, or automate repetitive tasks. Data engineers use the concept of CI/CD to automate the testing, integration, and deployment of data pipelines. Many use tools like Git, GitHub Actions, Bitbucket Pipelines, and Buildkite Pipelines to streamline tasks, reduce human errors, and ensure data pipeline reliability.

‍

What is CI/CD? A definition in data engineering

Continuous Integration and Continuous Deployment ensure that changes and updates to data pipelines are automatically tested, integrated, and deployed to production, facilitating consistent and reliable data processing and delivery.In data engineering, this involves automating testing new ETL code, validating data schema, monitoring data quality, detecting anomalies, deploying updated data models to production, and ensuring that databases or data warehouses have been correctly configured.

‍

In this part of the series, we’ll share how you can implement CI/CD through tools such as Git, a GitHub repo, and GitHub Actions.

‍

What is CI/CD?

A CI/CD pipeline is a concept central to software. It spans a whole field of processes, testing methods, and tooling, all facilitated by the Git code versioning process.

Since the terms “CI/CD Pipeline” and “Data Pipeline” are confusing, we will simply refer here to the CI/CD process.

Imagine you're building a toy train track (your data pipeline). Every time you add a new piece (a code change), you want to ensure it fits perfectly and doesn't derail the train (break the pipeline).

Continuous integration: Every time you add a new track piece, you immediately test it by running the toy train (data) through it. This ensures that your new addition didn't introduce any problems. If there's an issue, you know instantly and can fix it before it becomes a bigger problem.

Continuous deployment: Once you've confirmed that your new piece fits and the train runs smoothly, you don't wait to show it off. You immediately let everyone see and use the updated track. In other words, as soon as your changes are verified, they're made live and functional in the main track (production environment).

In technical terms, CI/CD automates this process:

CI checks and tests every new piece of code (or data transformation logic) you add to your data pipeline.
CD ensures that once tested and approved, this code gets added to the live system without manual intervention.

For a data engineer, this means faster, more reliable updates to data processes, ensuring high-quality data is delivered consistently. And if there's ever an issue, it's caught and fixed swiftly.

CI/CD in data pipelines

CI/CD, in the context of data pipeline deployment, focuses on automating data operations and transformations.

This merges development, testing, and operational workflows into a unified, automated process, ensuring that data assets are consistently high quality and that data infrastructure evolves smoothly, even at scale.

Using CI/CD for data pipeline automation has become more critical in ensuring the development velocity of processes such as training machine learning models, supporting a data science team, doing large-scale data analysis, business intelligence or data visualization, supporting the growth of unstructured data collection, and other business needs. For example, as organizations adopt a data mesh approach, more structured and trackable deployment becomes more vital.

Continuous integration and continuous deployment both have a set of characteristics that we need to understand to design an effective process:

Continuous Integration (CI) in data pipelines

Automated Testing: Automated tests check the integrity and quality of data transformations, ensuring that data is processed as expected and any error is spotted early.
Version Control: Data pipeline code (e.g., SQL scripts, Python transformations) is stored in repositories like Git, allowing tracking and managing changes.
Consistent Environment: CI tools can run tests in environments that mirror production, ensuring that differences in configuration or dependencies don't introduce errors.
Data Quality Checks: These might include checks for null values, data range violations, data type mismatches, or other custom quality rules.

Continuous data pipeline deployment

Automated Deployment: Once code changes pass all CI checks, CD tools can automate their deployment to production, ensuring seamless data flow.
Monitoring and Alerts: Once deployed, monitoring tools keep track of the data pipeline's performance, data quality, and any potential issues. Automated alerts can notify on discrepancies.
Rollbacks: In case an issue is identified post-deployment, CD processes allow for quick rollbacks to a previously stable state of the data pipeline.
Infrastructure as Code (IaC): Many CD tools support IaCs. For example, cloud resources such as storage or compute can be provisioned automatically as part of the deployment process.

While CI/CD is a concept, there are various tools and frameworks developed to implement and support CI/CD practices, such as Jenkins, GitLab CI/CD, Travis CI, CircleCI, and many others.

How data engineers use Git

Now that we understand the value of CI/CD in the context of data pipeline deployments, let’s take a look at how to use Git to facilitate this process.

When most people think of Git, they think of version control—a way to track code changes, collaborate with others, and merge different code branches.

But pushing to Git can mean a lot more than just saving the latest version of a script. It can be synonymous with deployment, especially when integrated with tools like GitHub Actions.

‍

What is Git?

Git is a distributed version control system that facilitates collaborative software development by tracking changes across multiple contributors. For data engineers, it's indispensable for managing complex data projects, offering the ability to maintain multiple versions of data processing scripts and pipelines.Git can be paired with data orchestration tools and integrated into CI/CD workflows, providing the benefits of streamlined deployment and consistency in data engineering tasks.

‍

ETL pipelines

ETL (Extract, Transform, Load) pipelines are at the heart of data engineering. They're the processes that pull data from sources (databases, APIs, etc.), transform it into a usable format, and then load it into a destination, like databases or a data warehouse. When you deploy an ETL script to Git, you're not just saving the code—you could be triggering a series of events:

Testing: Automated tests are first run to ensure the new code doesn't break anything
Deployment: Once tests pass, the ETL process can be automatically deployed to a staging or production environment
Notifications: If any part of the process fails, or if it's successfully completed, notifications can be sent out

Data pipeline deployment

While Git's primary function is version control, its integration with CI/CD solutions like GitHub Actions makes it a powerful deployment tool. By setting up specific "actions" or "workflows", data engineers can automate the deployment of their pipelines. This means that when we push code to a Git repository, it can automatically be deployed to a production environment, provided it passes all the set criteria.

This approach brings production-level engineering to data operations. It ensures that data pipelines are robust, reliable, and continuously monitored. It also means that data engineers can focus on writing and optimizing their code, knowing that the deployment process is automated and in safe hands.

CI/CD, Git, and data processes

Data professionals integrate Git and CI/CD into their workflows to automate repetitive tasks, ensure data quality, and focus on optimizing data pipelines. Here are some common workflows that you may have encountered:

Data validation

Every data engineer knows about “garbage in, garbage out.” Incoming data always reintroduces potential anomalies or inconsistencies. CI/CD tools can execute scripts that meticulously validate the integrity and quality of new data. This proactive approach minimizes the risk of downstream issues and maintains the trustworthiness of the data ecosystem.

Scheduled data jobs

Certain analytical tasks, like aggregating metrics or updating summary tables, don't need to be executed on-the-fly. Instead, they can be scheduled to run at specific intervals, optimizing resource usage. With the scheduling features of CI/CD tools, data engineers can seamlessly integrate these tasks into their Git repositories.

Catching anomalies and failures

The complexity of data pipelines means that even with the best precautions, things can go awry. Using CI/CD tools, data engineers can set up workflows that automatically send notifications to platforms like Slack or email whenever specific events occur. This integration ensures that any disruptions in the data flow are quickly communicated, allowing teams to swiftly address and maintain the integrity of the data pipeline.

A novel workflow

In traditional software development, the use of branch deployments has long been a staple to ensure that new features, bug fixes, or code refactors are developed and tested in isolation before merging them into the main codebase.

By contrast, data engineering has typically involved separate stages of development, testing, and deployment, often with manual interventions or handoffs between stages.

However, traditional software development practices don't always neatly translate to data workflows. Enigma has discussed how the conventional 'dev-stage-prod' pattern may not be the optimal approach for data pipelines.

Instead, branch deployments can be seen as a branch of data platforms – you can preview, validate, and iterate on changes without impacting the production environment or overwriting existing testing setups.

This shift in perspective highlights the need for tools and practices tailored specifically for data engineering. Enter modern CI/CD tools such as the ones mentioned earlier.

Instead of rigidly adhering to the 'dev-stage-prod' paradigm, data engineers can leverage these CI/CD solutions to create dynamic, ephemeral environments on demand, ensuring that each data transformation or pipeline change is tested in an environment that closely mirrors production.

But how does this actually work in practice?

CI/CD and ephemeral environments

When an engineer creates a new feature branch in Git, CI/CD tools can be set up to listen for this specific event (i.e., the creation of the branch). Through defined workflows, it can communicate with the APIs or SDKs of platforms like AWS, Azure, or GCP.

This means that if your data engineering workflow requires resources such as an Amazon Redshift cluster, a Microsoft Azure Data Lake, or a Google Cloud Dataflow job, these CI/CD tools can automate their provisioning.

Upon detecting the branch creation event, the CI/CD system can trigger a predefined workflow that automates the process of setting up an ephemeral environment.

‍

What are ephemeral environments?‍

Ephemeral dev/test environments are temporary environments that are spun up for the purpose of development, testing, or experimentation and are torn down after. They are used in data engineering workflows to test data pipelines, transformations, and integrations in a safe, isolated manner before deploying to production.

‍

The entire process, from setting up the necessary configurations and seeding data, to ensuring the right permissions, can be automated, ensuring that the environment is ready for testing in a matter of minutes.

“Ephemeral” means short-lived, and once testing is completed, it’s crucial to tear down the resources to avoid incurring costs or leaving unused resources running. CI/CD can be set up to automatically de-provision resources too.

Thus, CI/CD acts as the automation bridge between the act of branching in Git and the provisioning of resources required for the ephemeral environment.

Git best practices

As data engineers, adopting best practices in Git not only ensures the integrity of our data pipelines but also fosters collaboration and efficiency. We present 7 habits for Git that every data engineer should adopt:

Handling large data files: While some teams historically used Git Large File Storage (LFS) to manage and version large datasets, there's a growing consensus that data should be kept separate from code repositories. Modern practices often recommend versioning cloud storage buckets or using dedicated data versioning tools.
Pull requests: The heart of collaboration. Use pull requests to propose changes, solicit feedback, and ensure code quality before merging.
Code reviews: Foster a culture of reviewing code. It's not just about catching errors but also about sharing knowledge and ensuring consistent coding standards.
Commit often: It's easier to merge smaller, frequent changes than large, infrequent ones. Aim for atomic commits, where each commit represents a single logical change
Commit with clear messages: Write clear, concise messages that explain the "why" behind your changes, not just the "what"
Branch deployments: They automatically create staging or temporary environments based on the code in a specific branch of a git repository. This allows data professionals (like data scientists) to test, preview, and validate the changes made in that branch in an isolated environment before merging them into the main or production branch.

Conclusion

Pushing to Git in the realm of data engineering isn't solely about preserving code—it's about steering the data processes of a company, ensuring quality, and delivering the benefits of trustworthy data to end-users.

Modern CI/CD has been instrumental in adding value by advancing and automating data engineering operations, particularly in the context of branch deployments. They infuse the automation, rigor, and best practices of contemporary software development into the data sphere, guaranteeing that data pipelines are as resilient, adaptable, and proficient as their software development equivalents.

We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!

Follow us: