What is Data Mapping and Transformation?
In most enterprise applications, data mapping and transformation is one of the most complex and most important fields at the same time.
Very often, data from one system must be re-organized, enriched, validated and then transformed or mapped to another structure in order to be able to be passed to another system. Pipelines are optimized for such tasks in order to simplify it as much as possible.
PIPEFORCE offers a huge set of tools to do mappings and transformation of data structures. The most important ones are:
The
transform.*
commandsThe
data.*
commandsThe Pipeline Expression Language (PEL)
The Pipeline Functions like
@data
or@convert
You should get familiar with all of the toolings listed here in order to make the right choice to solve your data transformation task most effectively.
Transformer Commands
A transformer command in PIPEFORCE is a command which transforms / converts data from one structure into another. A transformer is usually used to transform from an "external" data format (like XML for example) into the "internal" data format which is typically JSON. There are out-of-the box transformers to convert from CSV to JSON, from Word to PDF, from PDF to PNG and many more.
Additionally you can write a custom transformation rule using a template and the transform.ftl
command for example.
See the commands reference for transformers.*
to find all transformers commands available.
Data Commands
A data command in PIPEFORCE is a command which can apply some rules on an "internal data structure" (which is mostly JSON). So usually you would load a JSON document from the property store or transform it from some external format using a transformer command to JSON first, and then you can change the JSON structure using the data commands.
See the commands reference for data.*
to find all data commands available.
PEL
The PEL (Pipeline Expression Language) can be used inside the parameters of nearly any command. So it is very important, that you have a good understanding of PEL in case you would like to do data transformation in PIPEFORCE.
There are a lot of built-in language constructs of PEL which help you reading, writing and transforming data the easiest way.
Especially these topics are worth a read in this context:
See the reference documentation for details about the PEL syntax.
PEL Utils
Additionally to the Pipeline Expression core syntax, there are Pipeline Utils available which also can help you to simplify your data transformation tasks. For data transformation these utils could be of special interest:
@calc - For number crunching.
@convert - For convertion tasks (for example from decimal to int).
@data - For data information and alter tasks.
@date - Formatting date and time data.
@list - Read and edit lists.
@text - Text utilities in order to change and test text data.
See the reference documentation for a full list of the available Pipeline Utils.
Transformation Patterns
There are many different ways of data transformation. In order to have a common understanding of the different approaches, below you can find the patterns of most of them listed and named.
Most of them are also mentioned as part of the well-known enterprise integration patterns which can be seen as a "defacto-standard" in the data and message integration world.
Splitter / Iterator
A splitter splits a given data object into multiple data objects. Each data object can then processed separately.
For example you have a data object order which contains a list of order items and you would like to "extract" these order items from the order and process each order item separately:
This is a common pattern also mentioned by the enterprise integration pattern collection.
This approach is sometimes also called Iterator. Looping over a given set of data objects is also called iterating over the items.
Iterate with command data.list.iterate
In PIPEFORCE you can use the data.list.iterate
command in order to iterate over a list of data and apply transformation patterns at the same time.
NOTE
This command is optimized for huge data iteration cycles and it doesn't add command execution counts for each cycle. So you should prefer this approach whenever possible.
Here is an example:
pipeline: - data.list.iterate: listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}] listB: [{"name": "Max", "age": 12}, {"name": "Hennah", "age": 23}] where: "itemA.name == itemB.name and itemB.age > 18" do: "itemA.allowed = true"
As you can see, in this example there are two lists: listA
and listB
. For every item in listA
, the listB
is also iterated. In the where
parameter you can define a PEL expression. In case this expression returns true
, the expression in do
is executed. In this example this means for every entry in listA
it is checked whether there is the same name
entry in listB
and if so, the age
is checked. If this value is > 18
, the origin listA
will be changed and the value of allowed
set to true
. The result will look like this:
[ { "name": "Max", "allowed": false }, { "name": "Hennah", "allowed": true } ]
It is also possible to define multiple do-expressions to be executed on each iteration cycle. See this example, where additionally a new attribute approved
with the current timestamp will be added on each "where-matching" entry:
pipeline: - data.list.iterate: listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}] listB: [{"name": "Max", "age": 12}, {"name": "Hennah", "age": 23}] where: "itemA.name == itemB.name and itemB.age > 18" do: | itemA.allowed = true; itemA.approved = @date.timestamp();
As you can see, multiple do-expressions will be separated by a semicolon ;
. You can write them in one single line, or in multiple lines using the pipe symbol |
. The output will look like this:
[ { "name": "Max", "allowed": false }, { "name": "Hennah", "allowed": true, "approved": 1659266178365 } ]
You can also iterate only a single listA
without any where
condition, like this example shows:
pipeline: - data.list.iterate: listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}] do: "itemA.allowed = true"
If the where
parameter is missing, the do
expression will be executed on any iteration item. In this example the result would be:
[ { "name": "Max", "allowed": true }, { "name": "Hennah", "allowed": true } ]
If-Then-Else conditions inside a do
expression can be implemented using the ternary operator (condition ? whenTrueAction : elseAction
). Let's rewrite the example from above and replace the where
parameter by a ternary operator inside the do
parameter:
pipeline: - data.list.iterate: listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}] listB: [{"name": "Max", "age": 12}, {"name": "Hennah", "age": 23}] do: "(itemA.name == itemB.name and itemB.age > 18) ? itemA.allowed = true : ''"
In case no elseAction
is required in the ternary operator, use an empty string ''
in order to indicate this.
In case no listA
parameter is given, the list is expected in the body or as optional parameter input
, all input commands have in common.
Since the parameters where
and do
can only contain PEL expressions, you can write them optionally without ${
and }
for better readability as shown in these examples.
Iterate with command foreach
Iterate with PEL
In some situations it is also handy to use directly the PEL selection or PEL projection features of the Pipeline Expression Language (PEL) on a given list in order to iterate it.
Iterate with custom function
For very complex data iteration tasks, you could also use the function.run
command and write a serverless function which iterates over the data. Since this approach requires knowledge about the scripting language and is usually not the best performing option, you should choose it only if there is no other option available to solve your iteration task.
Iterate with custom microservice
And if a script (serverless function / lambda) is also not working for you, you can write a custom microservice and run it inside PIEPFORCE. But this approach is outside of the scope of this data transformation section. See section Microservices for more details.
PIPEFORCE TOOLINGS
data.list.iterate
commandPEL projection command
PEL selection command
Aggregator / Merger
An aggregator combines multiple data objects into a single data object. Sometimes it is also called a Merger since it "merges" data objects into a single data object.
For example you have multiple Inventory Items and you would like to aggregate them together into one Inventory Order data object:
This is a common pattern also mentioned by the enterprise integration pattern collection.
Enricher
An enricher adds additional information to a given data object.
The enrich data typically comes from a different data source like a database or similar.
This is a common pattern also mentioned by the enterprise integration pattern collection.
For example you have an address data object with just the zip code in it:
{ "street": "Lincoln Blvd", "zipCode": "90001" }
You could then have an enricher which resolves the zip code and adds the city name belonging to this zip code to the address data object:
{ "street": "Lincoln Blvd", "zipCode": "90001", "city": "Los Angeles" }
In PIPEFORCE there are multiple ways to enrich a data object. You can use for example the data.enrich
command in order to enrich data at a certain point. See this example for this:
pipeline: - data.enrich: input: { "street": "Lincoln Blvd", "zipCode": "90001" } do: "input.city = 'Los Angeles'"
In the set
parameter you can also refer to any pipeline or PEL Util in order to load data from external. For example:
pipeline: - data.enrich: input: { "street": "Lincoln Blvd", "zipCode": "90001" } do: ${ input.city = @command.call('http.get', {'url': 'http://city.lookup?zipCode=' + input.zipCode}) }
As you can see, you can access the input data in the do
expression using the variable input
. Also the variables vars
, headers
and body
will be provided here.
Another possibility is to use the data.list.iterate
command in order to enrich the items of a list while iterating them.
PIPEFORCE TOOLINGS
data.enrich
commanddata.list.iterate
commandset
command
Deduplicator
A deduplicator is a special form of a filter. It removes data duplicates from a given input.
PIPEFORCE TOOLINGS
data.list.filter
command
Filter
A filter removes a selected set of data from a bigger set of data. So only a subset of the origin data will pass to the target.
This is a common pattern also mentioned by the enterprise integration pattern collection.
PIPEFORCE TOOLINGS
data.list.filter
command
Limiter
A limiter limits a given data list to a maximum size. It can be seen as a special form of a filter.
PIPEFORCE TOOLINGS
data.list.limit
command
Mapper
A mapper maps a given data structure into another data structure, so business logic is not required to handle this.
This is a common pattern also mentioned by the enterprise integration pattern collection.
PIPEFORCE TOOLINGS
data.mapping
commanddata.list.iterate
command
Mapping with data.mapping
The command data.mapping
can be used to apply simple data mappings inline in a pipeline. Optionally also the Pipeline Expression Language can be used for additional data transformations.
Let's see an example first:
body: { "firstName": "Max ", "lastName": "smith", "age": 48, "birthDate": "01/12/1977", "hobbies": ["hiking", "biking"], "type": "customer" } pipeline: - data.mapping: rules: | body.firstName -> person.firstName, body.lastName -> person.surname, body.age -> person.age, body.birthDate -> person.dateOfBirth, body.hobbies -> person.hobbies, body.type -> person.type
This example sets a JSON document in the body, then it applies the given mapping rules and writes by default the result as a new JSON in the body (replacing the initial JSON).
As you can see, every mapping rule is placed in a separate line, each ending with a comma (except the last one).
The left part of the mapping rule (left side of the arrow) is the input path (where to read the data from). The right part of the mapping rule (right side of the arrow) is the output path (where to write the data to):
inputPath -> outputPath
All mapping rules in inputPath
are relative to the context given by input
parameter. By default, this value is the current pipeline message.
The output
parameter points to the location, where the mapping results should be written to. This is by default the the pipeline message, so body, vars and headers can be accesses easily.
The final mapping result in the body will look like this:
{ "person": { "firstName": "Max ", "surname": "smith", "age": 48, "dateOfBirth": "01/12/1977", "hobbies": [ "hiking", "biking" ], "type": "customer" } }
As you can see, the applied mapping rules resulted in these changes:
The input field
firstName
was nested inside the new elementperson
. The field namefirstName
was not changed.The input field
lastName
was also mapped to the nested elementperson
. Additionally it was renamed fromfirstName
tosurname
.The field
age
was nested insideperson
without any change.And the input field
birthDate
was nested insideperson
and renamed todateOfBirth
.The field
type
was only nested insideperson
.
Now lets assume we would like to change the values in parallel to the mapping. You can do so by applying Pipeline Expressions on the input path. For example:
body: { "firstName": "Max ", "lastName": "smith", "age": 48, "birthDate": "01/12/1977", "hobbies": ["hiking", "biking"], "type": "customer" } pipeline: - data.mapping: rules: | @text.trim(body.firstName) -> person.firstName, @text.firstCharUpper(body.lastName) -> person.surname, body.age -> person.age, body.birthDate -> person.dateOfBirth, body.hobbies[0] -> person.primaryHobby, @text.upperCase(body.type) -> person.type, body.age > 18 -> person.adult, "male" -> person.gender, @data.emptyList() -> person.myList, @data.emptyObject() -> person.myObject
The result JSON of this pipeline after execution will look like this:
{ "person": { "firstName": "Max", "surname": "Smith", "age": 48, "dateOfBirth": "01/12/1977", "primaryHobby": "hiking", "type": "CUSTOMER", "adult": true, "gender": "male", "myList": [], "myObject": {} } }
As you can see, the nested mapping below person
was kept. Additionally:
the field
firstName
was trimmed from whitespacesthe field
surname
contain now the first char upper casethe first item of the array
hobbies
was selected and set to new elementperson.primaryHobby
.the field
type
was converted to upper case and a new fieldperson.adult
was added with the result of the expressionage > 18
the constant string
male
was set to the new fieldperson.gender
a new, empty list was added in new field
person.myList
a new, empty object was added in new field
person.myObject
.
By default the mapping result gets written to the body. If you would like write to a variable instead, you can use the output
parameter:
vars: mappingResult: null pipeline: - data.mapping: rules: | ... output: ${vars.mappingResult}
Make sure that the output target was created before since this helps for better readability.
Mapping with command data.list.iterate
You can also use the command data.list.iterate
for data mapping. Examples see above.
Sorter
A sorter sorts a given data list based on some condition. This is also known as the Resequencer pattern.
This is a common pattern also mentioned by the enterprise integration pattern collection.
0 Comments