Dagster Glossary | Data Orchestration Terms Explained

A/B Testing

A statistical hypothesis testing for a randomized experiment with two variables, A and B, which are used to compare two models or strategies and determine which performs better.

Learn More

See Wikipedia

ACID Properties

The set of properties of database transactions intended to guarantee validity even in the event of errors or failures, encompassing Atomicity, Consistency, Isolation, and Durability.

Learn More

See Wikipedia

API (Application Programming Interface)

A set of rules and definitions that allow different software entities to communicate with each other.

Learn More

AWS Step Functions

Enables you to coordinate AWS components, applications and microservices using visual workflows.

Learn More

See vendor site

Aggregate

Combine data from multiple sources into a single dataset.

Learn More

Agile Methodology

An iterative approach to software development and project management that prioritizes flexibility and customer satisfaction, often used by data engineering teams to manage projects.

Learn More

See Wikipedia

Alation

A machine learning data catalog that helps people find, understand, and trust the data.

Learn More

Vendor Website

Align

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

Learn More

Aligning

Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.

Learn More

See Glossary entry

Amazon DynamoDB

A managed NoSQL database service provided by Amazon Web Services.

Learn More

Vendor Website

Amazon Kinesis

A platform to stream data on AWS, offering powerful services to make it easy to load and analyze streaming data.

Learn More

Vendor Website

Amazon Kinesis

A platform provided by Amazon Web Services (AWS) to collect, process, and analyze real-time, streaming data.

Learn More

Visit the website

Amazon Redshift

A fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.

Learn More

Vendor Website

Amazon Web Services (AWS)

Offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, and more.

Learn More

Vendor Website

Annotation

The process of adding metadata or explanatory notes to data, often used in machine learning to create labeled data for training models.

Learn More

Anomaly Detection

Identify data points or events that deviate significantly from expected patterns or behaviors.

Learn More

Anomaly Detection

The identification of items, events, or observations which do not conform to an expected pattern or other items in a dataset, crucial in fraud detection, network security, and fault detection.

Learn More

See Glossary entry

Anonymize

Remove personal or identifying information from data.

Learn More

Anonymize

Remove personal or identifying information from data.

Learn More

See Glossary entry

Apache Airflow

A platform to programmatically author, schedule, and monitor workflows of tasks.

Learn More

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Learn More

Project website

Apache Atlas

A scalable and extensible set of core foundational governance services, enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop.

Learn More

Project website

Apache Camel

An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.

Learn More

Project website

Apache Flink

A framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Learn More

Project website

Apache Hadoop

A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Learn More

Project website

Apache Kafka

A distributed streaming platform capable of handling trillions of events a day.

Learn More

Project website

Apache Nifi

A tool designed to automate the flow of data between software systems.

Learn More

Project website

Apache Pulsar

A highly scalable, low-latency messaging platform running on commodity hardware.

Learn More

Project website

Apache Samza

A stream processing framework for running applications that process data as it is created.

Learn More

Project website

Apache Spark

A fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python, and R.

Learn More

Project website

Apache Storm

A free and open-source distributed real-time computation system.

Learn More

Project website

Append

Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.

Learn More

Append

The process of adding new, updated, or corrected information to an existing database or list.

Learn More

See Glossary entry

Argo

An open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes.

Learn More

Vendor Website

Association Rule Mining

A machine learning method aimed at identifying interesting relations between variables (items or events) in large databases, frequently used for market basket analysis.

Learn More

Find out more

AsyncIO

Speed up execution with asynchronous I/O.

Learn More

Asyncio

A Python library for asynchronous I/O. It is built around the coroutines of Python and provides tools to manage them and handle the I/O in an efficient way.

Learn More

See Glossary entry

Augment

Add new data or information to an existing dataset to enhance its value.

Learn More

Augment

The technique of increasing the diversity of your training dataset by modifying the existing data points, often used in training deep learning models to improve model generalization.

Learn More

See Glossary entry

Augmented Data Management

The use of AI and ML technologies to optimize and enhance data management tasks, improving data quality and metadata development.

Learn More

Find out more

Auto-materialize

The automatic execution of computations and the persistence of their results.

Learn More

Auto-materialize

The automatic execution of computations and the persistence of their results.

Learn More

See Glossary entry

Automated Machine Learning (AutoML)

The process of automating the end-to-end process of applying machine learning to real-world problems, facilitating the development of ML models by experts and non-experts alike.

Learn More

Learn more

Avro

A binary serialization format developed within the Apache Hadoop project, compact, fast, and suitable for serializing large amounts of data. It uses JSON for defining data types and protocols, and it serializes data in a compact binary format.

Learn More

Visit the project

BSON (Binary JSON)

A binary-encoded serialization of JSON-like documents used to store documents and make remote procedure calls in MongoDB. BSON supports embedded documents and arrays, offering additional data types not supported by JSON.

Learn More

Backend-as-a-Service (BaaS)

A cloud computing service model that serves as the middleware that provides developers with ways to connect their web and mobile applications to cloud services via application programming interfaces (APIs) and software developers' kits (SDKs).

Learn More

See Wikipedia

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.

Learn More

Backpressure

A mechanism to handle situations where data is produced faster than it can be consumed.

Learn More

See Glossary entry

Backup

Create a copy of data to protect against loss or corruption.

Learn More

Backup

Create a copy of data to protect against loss or corruption.

Learn More

See Glossary entry

Batch Processing

Process large volumes of data all at once in a single operation or batch.

Learn More

Batch Processing

The processing of data in a batch or group where the entire batch is processed before any individual item in the batch is considered processed.

Learn More

See Glossary entry

Big Data

Refers to extremely large datasets that can be analyzed for patterns, trends, and associations, typically involving varied and complex structures. What constitutes 'big' is debated, but a rule of thumb is a volume of data that cannot be analyzed on a single machine.

Learn More

See Wikipedia

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Learn More

Big Data Processing

Process large volumes of data in parallel and distributed computing environments to improve performance.

Learn More

See Glossary entry

Big O Notation

A mathematical notation used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, primarily used to classify algorithms by how they respond to changes in input size.

Learn More

See Wikipedia

Binary Tree

A tree data structure in which each node has at most two children, referred to as the left child and the right child.

Learn More

See Wikipedia

Bitwise Operation

Operations that manipulate one or more bits at the level of their individual binary representation.

Learn More

See Wikipedia

No results, please try different filters.

Data Engineering Terms Explained

A/B Testing

ACID Properties

API (Application Programming Interface)

AWS Step Functions

Aggregate

Agile Methodology

Alation

Align

Aligning

Amazon DynamoDB

Amazon Kinesis

Amazon Kinesis

Amazon Redshift

Amazon Web Services (AWS)

Annotation

Anomaly Detection

Anomaly Detection

Anonymize

Anonymize

Apache Airflow

Apache Arrow

Apache Atlas

Apache Camel

Apache Flink

Apache Hadoop

Apache Kafka

Apache Nifi

Apache Pulsar

Apache Samza

Apache Spark

Apache Storm

Append

Append

Archive

Archive

Argo

Association Rule Mining

AsyncIO

Asyncio

Augment

Augment

Augmented Data Management

Auto-materialize

Auto-materialize

Automated Machine Learning (AutoML)

Avro

BSON (Binary JSON)

Backend-as-a-Service (BaaS)

Backpressure

Backpressure

Backup

Backup

Batch Processing

Batch Processing

Big Data

Big Data Processing

Big Data Processing

Big O Notation

Binary Tree

Bitwise Operation