Understanding Data Pipelines: Concepts, Templates, and Best Practices
By Andy Bolk | Trace3 Data Solutions Architect
Data pipelines are the circulatory system of modern digital ecosystems, facilitating the seamless flow of information from its raw state to actionable insights. By orchestrating the movement, transformation, and storage of data across various stages, data pipelines streamline complex business logic.
Data Pipeline Concepts
A Data Pipeline run will execute a workflow or a series of workflows. These activities handle data most commonly in two fashions: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT). The pipeline can also manage the control flow by using loops or conditional logic.
Data Engineers build the pipelines to collect data from various batch and streaming sources. They then transform the data to align with internal business logic for other team members such as Data Scientists, Analytical Engineers, or Data Analysts to consume and append.
A popular Data Pipeline toolset comes from Microsoft. Data Factory provides a robust infrastructure for building and orchestrating complex data workflows to ensure efficient data processing and management at scale. Data Factory is a cloud-based data integration service that enables the creation, orchestration, and management of Data Pipelines for ingesting, transforming, and processing data from various sources.
Microsoft Fabric includes Data Factory and provides a broader platform for building and managing scalable and reliable cloud applications. This includes capabilities beyond data processing, such as Data Science, Event Processing, Visualizations, and CoPilot.
A few main concepts with respect to Data Pipelines should be understood before you build. The terms below are specific to Microsoft’s toolset yet are also fairly common across all data management platforms.
Ironically, this unsettling sentiment arising from observing the artificial becoming “too close” to reality is not a new feeling. Over half a century ago, in 1970, a Tokyo Institute of Technology robotics professor, Masahiro Mori, coined the phrase “the uncanny valley”. This phrase was used to describe an intriguing and measurable response from humans as they interact with increasingly more human-like robotic systems. Mori found that as robots appeared more humanlike, they become more appealing to a person interacting with them – but only up to a certain point. At a certain point, the degree of human likeness becomes naturally unsettling, uncomfortable, and even revolting. This dramatic shift in emotional response is “the uncanny valley.”
Activities
Activities are the executed tasks within a pipeline. These fall into two main categories of Data Transformation Activities and Control Flow Activities.
Data Transformation Activities
These activities transfer or transform the data.
-
-
Copy Data – This copies data from a source location (i.e., storage account) to a destination location (i.e., Lakehouse).
-
Notebook – This calls a Spark notebook, created in Fabric.
-
Stored Procedure – This calls a stored procedure within an Azure SQL Database or a Fabric Data Warehouse.
-
Control Flow Activities
These activities help orchestrate the workflow within a pipeline.
-
-
For Each – This iterates over a collection. It could be used to loop through a directory of files.
-
If Condition – This uses conditional logic and will execute one set of activities if the condition is true and another set if the condition is false.
-
Filter – This excludes data, based on a condition, from subsequent activities.
-
Set Variable – This sets a pipeline variable. An example would be to set the value of a file name variable as you loop through a directory of files.
-
Parameters
Parameters are values that are used within a pipeline and can be passed in at run time. A common use for parameters is loading files to or from a file path that contains a date. Logic is used to configure the parameter with the necessary date value.
Pipeline Runs
A run is a single execution of a pipeline. Pipelines can run on-demand, using the Run button, or they can be scheduled, by clicking on the Schedule button, shown in the image below.
By clicking the View Run History button, you can view the execution history for a pipeline, whether it was on-demand or scheduled.
Viewing the pipeline run history is a great way to analyze historical run times and troubleshoot potential bottlenecks within your pipelines. You can view each activity’s execution and see details such as how long it ran and what parameter values were passed in.
Dataflows
Along with Data Pipelines, Dataflows are a part of the overall Microsoft Fabric experience. Whereas Data Pipelines can handle the 3Vs of data (volume, variety, velocity) of data integration, Dataflows are great at quickly handling smaller datasets.
Data engineers may typically perform the data transformation steps by using a Spark notebook or SQL stored procedure. Dataflows can be developed through a low-code option to transform data from one or many sources and load that data to a destination. Many of us familiar with Power Query from Excel or Power BI have already been using Dataflows, making this a viable tool for Business Users and Data Analysts to prep their data. (Or, if you are comfortable with M, you can code your transformations in the advanced editor.)
A key recent enhancement to Dataflows allows for the writing of output to a Fabric Warehouse or a Fabric Lakehouse. Dataflows can be refreshed independently, and the resulting dataset/semantic model can be used by data analysts and report developers.
There are more than 150 data sources that can be used to ingest data into a Dataflow. A subset of those sources is shown in the screenshot below.
Transformation Types
Within Dataflows, there are a number of different transformations that can be applied to your data.
-
Manage Columns – Specify which columns you want to carry forward, just like a select statement.
-
Reduce Rows – Select or exclude the top n or bottom n rows, remove duplicates, or apply filters to your queries.
-
Sort – This type of transformation is self-explanatory.
-
Combine – Merge (join) or append (union) your queries.
-
Transform Table – Group by, transpose (pivot), or count.
-
Transform Column – Apply string functions, aggregates, and mathematical functions.
-
Add Column – Add a new custom column or add new columns using string functions or mathematical functions.
Pipeline Templates
If you are just beginning to develop pipelines or want a quick solution to a common pipeline pattern, there may be a template that you can use to get started. These allow you to see how Microsoft recommends performing common tasks. The current list of templates is shown in the image below.
Data Pipelines serve as indispensable tools for modern data management, enabling organizations to optimize their data processes, enhance productivity, and unlock valuable insights from their data assets. By embracing the principles, concepts, and best practices outlined in this blog, businesses can navigate the complexities of data management with confidence and efficiency.
To learn more about Data Pipelines, reference the Microsoft Fabric Data Factory documentation located here.
To learn more about Dataflows, you can reference the documentation here.
To learn more about Trace3’s approach to managing data and extracting business insights, please visit our website at trace3.com/data-and-analytics.