Dagster Data Engineering Glossary:
Data Engineering Terms Explained
Terms and Definitions You Need to Know as a Data Engineer
A/B Testing
A statistical hypothesis testing for a randomized experiment with two variables, A and B, which are used to compare two models or strategies and determine which performs better.
ACID Properties
The set of properties of database transactions intended to guarantee validity even in the event of errors or failures, encompassing Atomicity, Consistency, Isolation, and Durability.
API (Application Programming Interface)
A set of rules and definitions that allow different software entities to communicate with each other.
AWS Step Functions
Enables you to coordinate AWS components, applications and microservices using visual workflows.
Agile Methodology
An iterative approach to software development and project management that prioritizes flexibility and customer satisfaction, often used by data engineering teams to manage projects.
Alation
A machine learning data catalog that helps people find, understand, and trust the data.
Align
Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
Aligning
Aligning data can mean one of three things: aligning datasets, meeting business rules, or arranging data elements in memory.
Amazon DynamoDB
A managed NoSQL database service provided by Amazon Web Services.
Amazon Kinesis
A platform to stream data on AWS, offering powerful services to make it easy to load and analyze streaming data.
Amazon Kinesis
A platform provided by Amazon Web Services (AWS) to collect, process, and analyze real-time, streaming data.
Amazon Redshift
A fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.
Amazon Web Services (AWS)
Offers a broad set of global cloud-based products including compute, storage, databases, analytics, networking, mobile, developer tools, and more.
Annotation
The process of adding metadata or explanatory notes to data, often used in machine learning to create labeled data for training models.
Anomaly Detection
Identify data points or events that deviate significantly from expected patterns or behaviors.
Anomaly Detection
The identification of items, events, or observations which do not conform to an expected pattern or other items in a dataset, crucial in fraud detection, network security, and fault detection.
Apache Airflow
A platform to programmatically author, schedule, and monitor workflows of tasks.
Apache Arrow
Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
Apache Atlas
A scalable and extensible set of core foundational governance services, enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop.
Apache Camel
An open-source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
Apache Flink
A framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
Apache Hadoop
A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
Apache Kafka
A distributed streaming platform capable of handling trillions of events a day.
Apache Nifi
A tool designed to automate the flow of data between software systems.
Apache Pulsar
A highly scalable, low-latency messaging platform running on commodity hardware.
Apache Samza
A stream processing framework for running applications that process data as it is created.
Apache Spark
A fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python, and R.
Apache Storm
A free and open-source distributed real-time computation system.
Append
Adding or attaching new records or data items to the end of an existing dataset, database table, file, or list.
Append
The process of adding new, updated, or corrected information to an existing database or list.
Archive
Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. Store data for long-term retention and compliance.
Archive
Move rarely accessed data to a low-cost, long-term storage solution to reduce costs. store data for long-term retention and compliance.
Argo
An open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes.
Association Rule Mining
A machine learning method aimed at identifying interesting relations between variables (items or events) in large databases, frequently used for market basket analysis.
Asyncio
A Python library for asynchronous I/O. It is built around the coroutines of Python and provides tools to manage them and handle the I/O in an efficient way.
Augment
The technique of increasing the diversity of your training dataset by modifying the existing data points, often used in training deep learning models to improve model generalization.
Augmented Data Management
The use of AI and ML technologies to optimize and enhance data management tasks, improving data quality and metadata development.
Auto-materialize
The automatic execution of computations and the persistence of their results.
Auto-materialize
The automatic execution of computations and the persistence of their results.
Automated Machine Learning (AutoML)
The process of automating the end-to-end process of applying machine learning to real-world problems, facilitating the development of ML models by experts and non-experts alike.
Avro
A binary serialization format developed within the Apache Hadoop project, compact, fast, and suitable for serializing large amounts of data. It uses JSON for defining data types and protocols, and it serializes data in a compact binary format.
BSON (Binary JSON)
A binary-encoded serialization of JSON-like documents used to store documents and make remote procedure calls in MongoDB. BSON supports embedded documents and arrays, offering additional data types not supported by JSON.
Backend-as-a-Service (BaaS)
A cloud computing service model that serves as the middleware that provides developers with ways to connect their web and mobile applications to cloud services via application programming interfaces (APIs) and software developers' kits (SDKs).
Backpressure
A mechanism to handle situations where data is produced faster than it can be consumed.
Backpressure
A mechanism to handle situations where data is produced faster than it can be consumed.
Batch Processing
The processing of data in a batch or group where the entire batch is processed before any individual item in the batch is considered processed.
Big Data
Refers to extremely large datasets that can be analyzed for patterns, trends, and associations, typically involving varied and complex structures. What constitutes 'big' is debated, but a rule of thumb is a volume of data that cannot be analyzed on a single machine.
Big Data Processing
Process large volumes of data in parallel and distributed computing environments to improve performance.
Big Data Processing
Process large volumes of data in parallel and distributed computing environments to improve performance.
Big O Notation
A mathematical notation used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, primarily used to classify algorithms by how they respond to changes in input size.