ETL (Extract, Transform, and Load) is one of the foundations of automated data-driven approaches. It is generally understood to define the process for extracting information from one or more source databases, transforming it into a different, usually application-specific form, and loading it into a target database. It is a staple of data warehousing workflows and is generally used as shorthand for any process that takes “data” as an input and creates “information” outputs for analysis.
Data Fusion is sometimes described as “ETL on steroids”, but this definition may be misleading. Often it is used to paint a vision of huge torrents of data from a myriad of data sources sweeping into an organization and being magically collated, aligned, and visualized, all in a way that fits the organization’s specific analytic and visualization needs. As opposed to previous eras in which fusion fed the organization’s own business processes, the new data fusion paradigm allows organizations to create new information and find new markets for data that they produce other organizations can fuse that for their own purposes.
What data fusion brings to the table is the idea that end-users, whether they are humans or machines, are brought into the data processing loop as collaborators. By iteratively combining multiple data streams in new and interesting ways, driven by the changing needs of users, data fusion produces a wide variety of ways to aggregate data streams.
The technical aspect of “fusion” may use traditional ETL workflows, but data fusion is so much more than ETL. It represents a new way of thinking about data. These days, thanks to advances in technology and institutional willingness to release data through open APIs, the amount of data that can be fused with one’s internal data is staggering. At the same time as data is produced in every greater volume through the creation of new data platforms such as IoT (Internet of Things) and spins on existing platforms, the schema or structure of the data as it is delivered to consumers has also exploded in quantity. The same data, when mapped to a different schema or structure, can be used by different sets of consumers.
Data fusion is actually the business process of selecting which of the myriad of available data sets – where a data set is both the content and the schema – should be combined and presented to follow-on services and users. The ETL that supports data fusion can be mapped to three function types: transformation, enrichment, and augmentations, or TEA functions.
Transformations are functions that take existing input data and apply a function to it such that it changes form. A simple example could be combining first name, middle name, and last name fields in source data and creating a full name field that combines the three sub-fields.
Enrichments are functions that take existing input data, combined with additional data sources, and create new information that could not be gleaned from either source independently. For example, one could take two different lists of individuals and use pattern matching to create relationships that are not apparent from either list itself.
Augmentations are functions that add data of use in combination with the input data. The result is a more complete set of information that combines data from multiple sources. For example, a set of business entities gleaned from a conference attendee list, combined with Dun and Bradstreet profiles for those entities, creates a more complete set of information for each business entity.
There is a great deal of overlap between the function types, but at a high level, each represents a slightly different way of thinking about data fusion. The actual work of data fusion can involve a number of TEA functions that are applied in different ways, sometimes even conditionally based on the data being fused and the business needs of the organizations, internal and external, that soak up information produced along a fusion pipeline.
When to apply a chain of TEA functions, in which order, and at what cost, are business decisions that the architects of data fusion must understand and be able to communicate to decision-makers. It’s no longer about simply identifying the technical ETL requirements. The data architect needs to be able to look at the big picture, especially as data pipelines extend outside of the organization, and identify the cost and risk of fusing data. The rewards are potentially immense.