Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In most enterprise applications, data mapping and transformation is one of the most are complex and most important fields at the same timevital.

Very often, data Data from one system must be re-organized, enriched, validated and then transformed or mapped to another structure in order to be able to be passed often requires reorganization, enrichment, validation, cleansing or mapping before it gets passed over to another system. Pipelines in PIPEFORCE are optimized exactly for such tasks in order to simplify it as much make data integration as efficient as possible.

PIPEFORCE It offers a huge set of tools to do mappings and transformation of data structures. The most important ones are:

...

The transform.* commands

...

The data.* commands

...

The Pipeline Expression Language (PEL)

...

You should get familiar with all of the toolings listed here in order to make the right choice to solve your data transformation integration task most effectively.

Transformer Commands

A transformer command in PIPEFORCE is a command which transforms / converts data from one structure into another. A transformer is usually used to transform from an "external" data format (like XML for example) into the "internal" data format which is typically JSON. There are out-of-the box transformers to convert from CSV to JSON, from Word to PDF, from PDF to PNG and many more.

Additionally you can write a custom transformation rule using a template and the transform.ftl command for example.

See the commands reference for transformers.* to find all transformers commands available.

Data Commands

A data command in PIPEFORCE is a command which can apply some rules on an "internal data structure" (which is mostly JSON). So usually you would load a JSON document from the property store or transform it from some external format using a transformer command to JSON first, and then you can change the JSON structure using the data commands.

See the commands reference for data.* to find all data commands available.

PEL

The PEL (Pipeline Expression Language) can be used inside the parameters of nearly any command. So it is very important, that you have a good understanding of PEL in case you would like to do data transformation in PIPEFORCE.

There are a lot of built-in language constructs of PEL which help you reading, writing and transforming data the easiest way.

Especially these topics are worth a read in this context:

See the reference documentation for details about the PEL syntax.

PEL Utils

Additionally to the Pipeline Expression core syntax, there are Pipeline Utils available which also can help you to simplify your data transformation tasks. For data transformation these utils could be of special interest:

  • @calc - For number crunching.

  • @convert - For convertion tasks (for example from decimal to int).

  • @data - For data information and alter tasks.

  • @date - Formatting date and time data.

  • @list - Read and edit lists.

  • @text - Text utilities in order to change and test text data.

See the reference documentation for a full list of the available Pipeline Utils.

Transformation Patterns

There are many different ways of data transformation. In order to have a common understanding of the different approaches, below you can find the patterns of most of them listed and named.

Most of them are also mentioned as part of the well-known enterprise integration patterns which can be seen as a "defacto-standard" in the data and message integration world.

Splitter / Iterator

A splitter splits a given data object into multiple data objects. Each data object can then processed separately.

For example you have a data object order which contains a list of order items and you would like to "extract" these order items from the order and process each order item separately:

This is a common pattern also mentioned by the enterprise integration pattern collection.

This approach is sometimes also called Iterator. Looping over a given set of data objects is also called iterating over the items.

Iterate with command data.list.iterate

In PIPEFORCE you can use the data.list.iterate command in order to iterate over a list of data and apply transformation patterns at the same time.

NOTE

This command is optimized for huge data iteration cycles and it doesn't add command execution counts for each cycle. So you should prefer this approach whenever possible.

Here is an example:

Code Block
languageyaml
pipeline:
    - data.list.iterate:
        listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}]
        listB:  [{"name": "Max", "age": 12}, {"name": "Hennah", "age": 23}]
        where: "itemA.name == itemB.name and itemB.age > 18"
        do: "itemA.allowed = true"

As you can see, in this example there are two lists: listA and listB. For every item in listA, the listB is also iterated. In the where parameter you can define a PEL expression. In case this expression returns true, the expression in do is executed. In this example this means for every entry in listA it is checked whether there is the same name entry in listB and if so, the age is checked. If this value is > 18, the origin listA will be changed and the value of allowed set to true. The result will look like this:

Code Block
languagejson
[
    {
        "name": "Max",
        "allowed": false
    },
    {
        "name": "Hennah",
        "allowed": true
    }
]

It is also possible to define multiple do-expressions to be executed on each iteration cycle. See this example, where additionally a new attribute approved with the current timestamp will be added on each "where-matching" entry:

Code Block
languageyaml
pipeline:
    - data.list.iterate:
        listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}]
        listB:  [{"name": "Max", "age": 12}, {"name": "Hennah", "age": 23}]
        where: "itemA.name == itemB.name and itemB.age > 18"
        do: |
          itemA.allowed = true;
          itemA.approved = @date.timestamp();

As you can see, multiple do-expressions will be separated by a semicolon ;. You can write them in one single line, or in multiple lines using the pipe symbol |. The output will look like this:

Code Block
languagejson
[
    {
        "name": "Max",
        "allowed": false
    },
    {
        "name": "Hennah",
        "allowed": true,
        "approved":  1659266178365
    }
]

You can also iterate only a single listA without any where condition, like this example shows:

Code Block
languageyaml
pipeline:
    - data.list.iterate:
        listA: [{"name": "Max", "allowed": false}, {"name": "Hennah", "allowed": false}]
        do: "itemA.allowed = true"

If the where parameter is missing, the do expression will be executed on any iteration item. In this example the result would be:

Code Block
languagejson
[
    {
        "name": "Max",
        "allowed": true
    },
    {
        "name": "Hennah",
        "allowed": true
    }
]

If-Then-Else conditions inside a do expression can be implemented using the ternary operator (condition ? whenTrueAction : elseAction). Let's rewrite the example from above and replace the where parameter by a ternary operator inside the do parameterBelow, there is an introduction to all data mapping and transformation toolings you can use in PIPEFORCE.

Data Size Classification

Before you select the right data mapping and transformation tool, you should always think about the expected input data first. Depending on its size, some tools could be better suitable than others. Here is a classification on data size which is very often used:

Class

Size

Description

Small

< 10 MB

Can be handled easily in memory (on a multi-user system).

Effort and cost of implementation is usually low.

Medium

< 100MB

Can be handled on a single server node, but needs persistence in most cases because it is too big to be processed in memory (on a multi-user system).

Effort and cost of implementation is usually low to medium, but depends also on overall data complexity.

Large

<= Gigabytes

Requires special data management techniques and systems. Must be distributed across systems.

Effort and cost of implementation is usually expensive but this depends on overall data complexity.

Very Large
(Big Data)

>= Terabytes

Also known as "Big Data", these datasets encompass volumes of data so large that they require special processing techniques on multiple highly scalable nodes. They usually range from terabytes to petabytes or more.

Effort and cost of implementation is usually very expensive and depends highly on overall data complexity.

Info

Note that the boundaries between these classifications are sometimes fuzzy and it is not always obvious in the first place, which class really applies. So make sure you investigate enough to be clear before you start implementation.

For example, ask the user or the customer upfront about the expected amount of data and define this as a non-functional requirement for implementation. Because the difference in duration and cost of implementation could be exponentially depending on the data size and its complexity.

Transformer Commands

A transformer command in PIPEFORCE is a command which transforms / converts data from one structure into another. For example:

  • HTML to Word

  • JSON to XML, XML to JSON

  • PDF to PNG, PNG to PDF

  • Word to PDF

Furthermore, a transformer can also transform data based on a given template. Examples are:

See the commands reference for transform.* to find all transformers commands available.

Also see the pdf.* commands reference.

Data Commands

A data command in PIPEFORCE is a command which can apply rules on given JSON data. Usually you would load a JSON document from the property store or from an external location and then you can change the JSON structure by applying the data commands. Here is a list of important concepts in this field:

  • Enrich - Add more information to a given JSON.

  • Filter - Remove data from a given JSON at a given location.

  • Limit - Limit a list of JSON data depending on a given condition.

  • Encrypt - Encrypt JSON data (and decrypt).

  • Sorter - Sorts a list using a given sort condition.

  • Projection - Extract a single field value from a list of JSON objects matching a given criteria.

  • Selection - Extract one or more objects from a list of JSON objects matching a given criteria.

  • And more…

See the commands reference for data.* to find all data commands available.

Mapping Commands

A mapping command in PIPEFORCE is a command which maps from one JSON data structure into another by applying mapping rules.

For more details on data mapping see this section: /wiki/spaces/DEV/pages/2594668566.

PEL

The PEL (Pipeline Expression Language) is an important tooling when it comes to data mapping and transformation. It can be used inside the parameters of nearly any command. So it is very important, that you have a good understanding of PEL in case you would like to do data transformation in PIPEFORCE.

There are a lot of built-in language constructs of PEL which help you reading, writing and transforming data the easiest way.

Especially these topics are worth a read in this context:

See the reference documentation for details about the PEL syntax.

PEL Utils

Additionally to the Pipeline Expression core syntax, there are Pipeline Utils available which also can help you to simplify your data transformation tasks. For data transformation these utils could be of special interest:

  • @calc - For number crunching.

  • @convert - For convertion tasks (for example from decimal to int).

  • @data - For data information and alter tasks.

  • @date - Formatting date and time data.

  • @list - Read and edit lists.

  • @text - Text utilities in order to change and test text data.

See the reference documentation for a full list of the available Pipeline Utils.

Querying JSON Data

One of the best performing ways of selecting and filtering JSON data is by applying a query directly on the database (property store). Since only the data is returned which matches the given query and the query algorithms are applied directly in the database layer, this should be the preferred way for medium and large sized JSON documents.

For more details on this see: JSON Property Querying.

You can also use one of the data.filter.* commands for this. But they are all working on a JSON document in memory. They are fast and effective for small sized JSONs.

Integration Patterns Overview

There are many different ways of data integration. In order to have a common understanding of the different approaches, below you can find the patterns of most of them listed and named.

Most of them are also mentioned as part of the well-known enterprise integration patterns which can be seen as a "defacto-standard" in the data and message integration world.

Splitter / Iterator

A splitter splits a given data object into multiple data objects. Each data object can then processed separately.

For example you have a data object order which contains a list of order items and you would like to "extract" these order items from the order and process each order item separately:

This is a common pattern also mentioned by the enterprise integration pattern collection.

This approach is sometimes also called Iterator. Looping over a given set of data objects is also called iterating over the items.

Iterate with command data.mapping

You can use the command data.mapping with parameter iterate set to true in order to iterate over a given list and apply calculations and / or mappings on each iteration items.

For more information how to do this, see: JSON Data Mapping .

Iterate with command foreach

The foreach command can also be used for iterations: For every item in the list, a given set of commands will be executed until all items are processed.

For more information how to do this, see: https://logabit.atlassian.net/wiki/spaces/PA/pages/2543714420/Controlling+Pipeline+Flow#Foreach-(Iterator)%E2%80%8B.

Note

Note: You should never use the command foreach to iterate over a huge set of list items.

As a simple rule: If your list contains potentially more than 20 items, you probably should rethink your data design.

Depending on the system load it could be that foreach calls will automatically be throttled. Therefore, your data processing could become very slow if you process too many items using this approach.

Iterate with PEL

In some situations it is also handy to use directly the PEL selection or PEL projection features of the Pipeline Expression Language (PEL) on a given list in order to iterate it.

For more information how to do this, see: https://logabit.atlassian.net/wiki/spaces/PA/pages/2543026496/Pipeline+Expression+Language+PEL#Filtering%E2%80%8B

Iterate with custom function

For very complex data iteration tasks, you could also use the function.run command and write a serverless function which iterates over the data. Since this approach requires knowledge about the scripting language and is usually not the best performing option, you should choose it only if there is no other option available to solve your iteration task.

For more information how to do this, see: Python Functions .

Iterate with custom script

You can also use an embedded script to iterate.

For more information how to do this, see: /wiki/spaces/PA/pages/2603319334

Iterate with custom microservice

And if all the approaches mentioned before do not work for you, you can write a custom microservice and run it inside PIEPFORCE. But this approach is outside of the scope of this data transformation section.

For more information how to do this, see: Microservices Framework.

Info

PIPEFORCE TOOLINGS

These are some suggested PIPEFORCE toolings in order to implement this pattern you can select from to fit your specific needs:

Aggregator / Merger

An aggregator combines multiple data objects into a single data object. Sometimes it is also called a Merger since it "merges" data objects into a single data object.

For example you have multiple Inventory Items and you would like to aggregate them together into one Inventory Order data object:

This is a common pattern mentioned by the enterprise integration pattern collection.

Enricher

An enricher adds additional information to a given data object.

The enrich data typically comes from a different data source like a database or similar.

This is a common pattern also mentioned by the enterprise integration pattern collection.

For example you have an address data object with just the zip code in it:

Code Block
languagejson
{
    "street": "Lincoln Blvd",
    "zipCode": "90001"
}

You could then have an enricher which resolves the zip code and adds the city name belonging to this zip code to the address data object:

Code Block
languagejson
{
    "street": "Lincoln Blvd",
    "zipCode": "90001",
    "city": "Los Angeles"
}

In PIPEFORCE there are multiple ways to enrich a data object. You can use for example the data.enrich command in order to enrich data at a certain point. See this example for this:

Code Block
languageyaml
pipeline:
 
  - data.list.iterateenrich:
        listAinput: [{ "namestreet": "Max", "allowed": false}Lincoln Blvd", {"namezipCode": "Hennah", "allowed": false}]90001" }
      do:  listB:  [{"name": "Max", "age": 12}, {"name": "Hennah", "age": 23}]
        do: "(itemA.name == itemB.name and itemB.age > 18) ? itemA.allowed = true : ''"
Info

In case no elseAction is required in the ternary operator, use an empty string '' in order to indicate this.

In case no listA parameter is given, the list is expected in the body or as optional parameter input, all input commands have in common.

Info

Since the parameters where and do can only contain PEL expressions, you can write them optionally without ${ and } for better readability as shown in these examples.

Iterate with command foreach

Iterate with PEL

In some situations it is also handy to use directly the PEL selection or PEL projection features of the Pipeline Expression Language (PEL) on a given list in order to iterate it.

Iterate with custom function

For very complex data iteration tasks, you could also use the function.run command and write a serverless function which iterates over the data. Since this approach requires knowledge about the scripting language and is usually not the best performing option, you should choose it only if there is no other option available to solve your iteration task.

Iterate with custom microservice

And if a script (serverless function / lambda) is also not working for you, you can write a custom microservice and run it inside PIEPFORCE. But this approach is outside of the scope of this data transformation section. See section Microservices for more details.

Info

PIPEFORCE TOOLINGS

Aggregator / Merger

An aggregator combines multiple data objects into a single data object. Sometimes it is also called a Merger since it "merges" data objects into a single data object.

For example you have multiple Inventory Items and you would like to aggregate them together into one Inventory Order data object:

...

"input.city = 'Los Angeles'"

In the set parameter you can also refer to any pipeline or PEL Util in order to load data from external. For example:

Code Block
languageyaml
pipeline:
  - data.enrich:
      input: { "street": "Lincoln Blvd", "zipCode": "90001" }
      do: ${ input.city = @command.call('http.get', {'url': 'http://city.lookup?zipCode=' + input.zipCode}) }

As you can see, you can access the input data in the do expression using the variable input. Also the variables vars, headers and body will be provided here.

Another possibility is to use the data.list.iterate command to enrich the items of a list while iterating them.

Info

PIPEFORCE TOOLINGS

  • data.enrich command

  • data.list.iterate command

  • set command

Deduplicator

A deduplicator is a special form of a filter. It removes data duplicates from a given input.

Info

PIPEFORCE TOOLINGS

These are the suggested PIPEFORCE toolings in order to implement this pattern you can select from to fit your specific requirements best:

Filter

A filter removes a selected set of data from a bigger set of data. So only a subset of the origin data will pass to the target.

This is a common pattern also mentioned by the enterprise integration pattern collection.

Enricher

An enricher adds additional information to a given data object.

The enrich data typically comes from a different data source like a database or similar.

This is a common pattern also mentioned by the enterprise integration pattern collection.

For example you have an address data object with just the zip code in it:

Code Block
languagejson
{
    "street": "Lincoln Blvd",
    "zipCode": "90001"
}

You could then have an enricher which resolves the zip code and adds the city name belonging to this zip code to the address data object:

Code Block
languagejson
{
    "street": "Lincoln Blvd",
    "zipCode": "90001",
    "city": "Los Angeles"
}

In PIPEFORCE there are multiple ways to enrich a data object. You can use for example the data.enrich command in order to enrich data at a certain point. See this example for this:

Code Block
languageyaml
pipeline:
  - data.enrich:
      input: { "street": "Lincoln Blvd", "zipCode": "90001" }
      do: "input.city = 'Los Angeles'"

In the set parameter you can also refer to any pipeline or PEL Util in order to load data from external. For example:

Code Block
languageyaml
pipeline:
  - data.enrich:
      input: { "street": "Lincoln Blvd", "zipCode": "90001" }
      do: ${ input.city = @command.call('http.get', {'url': 'http://city.lookup?zipCode=' + input.zipCode}) }

As you can see, you can access the input data in the do expression using the variable input. Also the variables vars, headers and body will be provided here.

Another possibility is to use the data.list.iterate command in order to enrich the items of a list while iterating them.

Info

PIPEFORCE TOOLINGS

  • data.enrich command

  • data.list.iterate command

  • set command

Deduplicator

A deduplicator is a special form of a filter. It removes data duplicates from a given input.

Info

PIPEFORCE TOOLINGS

  • data.list.filter command

Filter

A filter removes a selected set of data from a bigger set of data. So only a subset of the origin data will pass to the target.

This is a common pattern also mentioned by the enterprise integration pattern collection.

Info

PIPEFORCE TOOLINGS

  • data.list.filter command

Limiter

A limiter limits a given data list to a maximum size. It can be seen as a special form of a filter.

Info

PIPEFORCE TOOLINGS

  • data.list.limit command

Mapper

A mapper maps a given data structure into another data structure, so business logic is not required to handle this.

This is a common pattern also mentioned by the enterprise integration pattern collection.

Info

PIPEFORCE TOOLINGS

  • data.mapping command

  • data.list.iterate command

Mapping with data.mapping

The command data.mapping can be used to apply simple data mappings inline in a pipeline. Optionally also the Pipeline Expression Language can be used for additional data transformations.

Let's see an example first:

Code Block
languageyaml
body: {
        "firstName": "Max  ",
        "lastName": "smith",
        "age": 48,
        "birthDate": "01/12/1977",
        "hobbies": ["hiking", "biking"],
        "type": "customer"
    }

pipeline:
    - data.mapping:
        rules: |
            body.firstName   -> person.firstName,
            body.lastName    -> person.surname,
            body.age         -> person.age,
            body.birthDate   -> person.dateOfBirth,
            body.hobbies     -> person.hobbies,
            body.type        -> person.type

This example sets a JSON document in the body, then it applies the given mapping rules and writes by default the result as a new JSON in the body (replacing the initial JSON).

As you can see, every mapping rule is placed in a separate line, each ending with a comma (except the last one).

The left part of the mapping rule (left side of the arrow) is the input path (where to read the data from). The right part of the mapping rule (right side of the arrow) is the output path (where to write the data to):

Code Block
inputPath -> outputPath

All mapping rules in inputPath are relative to the context given by input parameter. By default, this value is the current pipeline body.

The output parameter points to the location, where the mapping results should be written to. This is by default the body of the pipeline message.

The final mapping result in the body will look like this:

Code Block
languagejson
{
    "person": {
        "firstName": "Max  ",
        "surname": "smith",
        "age": 48,
        "dateOfBirth": "01/12/1977",
        "hobbies": [
            "hiking",
            "biking"
        ],
        "type": "customer"
    }
}

As you can see, the applied mapping rules resulted in these changes:

  • The input field firstName was nested inside the new element person. The field name firstName was not changed.

  • The input field lastName was also mapped to the nested element person. Additionally it was renamed from firstName to surname.

  • The field age was nested inside person without any change.

  • And the input field birthDate was nested inside person and renamed to dateOfBirth.

  • The field type was only nested inside person.

Now lets assume we would like to change the values in parallel to the mapping. You can do so by applying Pipeline Expressions on the input path. For example:

Code Block
languageyaml
body: {
        "firstName": "Max  ",
        "lastName": "smith",
        "age": 48,
        "birthDate": "01/12/1977",
        "hobbies": ["hiking", "biking"],
        "type": "customer"
    }

pipeline:
    - data.mapping:
        rules: |
            @text.trim(body.firstName)           -> person.firstName,
            @text.firstCharUpper(body.lastName)  -> person.surname,
            body.age                             -> person.age,
            body.birthDate                       -> person.dateOfBirth,
            body.hobbies[0]                      -> person.primaryHobby,
            @text.upperCase(body.type)           -> person.type,
            body.age > 18                        -> person.adult,
            "male"                               -> person.gender,
            @data.emptyList()                    -> person.myList,
            @data.emptyObject()                  -> person.myObject

The result JSON of this pipeline after execution will look like this:

Code Block
languagejson
{
    "person": {
        "firstName": "Max",
        "surname": "Smith",
        "age": 48,
        "dateOfBirth": "01/12/1977",
        "primaryHobby": "hiking",
        "type": "CUSTOMER",
        "adult": true,
        "gender": "male",
        "myList": [],
        "myObject": {}
    }
}

As you can see, the nested mapping below person was kept. Additionally:

  • the field firstName was trimmed from whitespaces

  • the field surname contain now the first char upper case

  • the first item of the array hobbies was selected and set to new element person.primaryHobby.

  • the field type was converted to upper case and a new field person.adult was added with the result of the expression age > 18

  • the constant string male was set to the new field person.gender

  • a new, empty list was added in new field person.myList

  • a new, empty object was added in new field person.myObject.

By default the mapping result gets written to the body. If you would like write to a variable instead, you can use the output parameter:

Code Block
vars:
    mappingResult: null
pipeline:
    - data.mapping:
        rules: |
            ...
        output: "${vars.mappingResult}"

...

Info

PIPEFORCE TOOLINGS

These are some suggested PIPEFORCE toolings in order to implement this pattern you can select from to fit your specific needs:

Limiter

A limiter limits a given data list to a maximum size. It can be seen as a special form of a filter.

Info

PIPEFORCE TOOLINGS

These are some suggested PIPEFORCE toolings in order to implement this pattern you can select from to fit your specific needs:

Mapper

A mapper maps a given data structure into another data structure, so business logic is not required to handle this.

This is a common pattern also mentioned by the enterprise integration pattern collection.

Info

PIPEFORCE TOOLINGS

These are the suggested PIPEFORCE toolings in order to implement this pattern you can select from to fit your specific requirements best:

  • data.mapping command (see JSON Data Mapping)

  • data.list.iterate command

  • data.filter.jmespath command and @data.jmespath util

Mapping with data.mapping

See here for more details how to do JSON data mapping using the command data.mapping: JSON Data Mapping .

Mapping with command data.list.iterate

...

This is a common pattern also mentioned by the enterprise integration pattern collection.

Info

PIPEFORCE TOOLINGS

These are the suggested PIPEFORCE toolings in order to implement this pattern you can select from to fit your specific requirements best: