Content References (Files)

What is a Content Reference?

Usually you will work with JSON documents, already loaded into your pipeline most of the time.

But, sometimes you need to load such JSON documents from resources like the cloud, remote endpoints, files or streams before you can use them. Or you have to load other data types like images, PDFs or similar.

The pointer to all of such resources is called a content reference.

Before this data can be used, it must be read from the location, the content reference is pointing to. Then, it can be parsed to JSON for example, copied or otherwise used.

Each content reference contains meta information about the content to be loaded (like the content type or the filename for example) plus the "instruction" how the content can be loaded.

So, in order to instruct your pipeline to load content from external before it can be used, you have to place a content reference into the body or the parameter where you need the final resource.

Outbound Content Reference

Such a content reference is a simple JSON document with a fixed structure. Here is an example of such a content reference JSON pointing to a file which can be downloaded via HTTP:

{ "name": "contract.pdf", "created": null, "contentType": "application/pdf", "contentLength": 23000, "contentEncoding": "outbound-url", "content": "$uri:https://somehost/contract.pdf" }

As you can see, beside some other information, the content reference contains the name of the file to be downloaded and the $uri pointing to the location where the file can be downloaded from. This is also called an outbound reference since the data is stored on another location and is lazy loaded when it is required. In other words: It is a pointer (=reference) to the file content stored externally.

In most cases you do not need to declare the content reference JSON manually. Instead, it will be generated by commands like drive.read or similar for you. Also the lazy loading of the resource when it is required is most of the time done automatically.

Here is an example to load a file from the drive service into the body scope and then access the attributes of the content reference from there:

pipeline: # Load document from drive and set it as content reference in the body - drive.read: path: "invoice.pdf" # Access the attributes of the content reference in the body - log: message: "Filename: ${body.name}, Size: ${body.contentLength}"

You can write such a content reference easily back to any supported target sink:

pipeline: - drive.read: path: "invoice.pdf" - drive.save: path: "invoice-copy.pdf"

As you can see in this example, you do not need to declare and load the content reference manually, this is done by the drive.read and drive.save commands automatically for you.

Inbound Content Reference

Here is an example which uses an inbound reference instead, since the data is embedded into the content reference as base64 encoded byte array:

In other words: This is a pointer (=reference) to the file content embedded inside the JSON.

It depends on the command and use cases which deal with the content reference whether to use the outbound or inbound reference.

Content Reference Schema

The structure of a content reference is defined by the content reference schema which looks like this:

The public schema for this can be found here so you can include it in your custom JSON Schema if required:
https://resource.pipeforce.io/latest/schema/json/content-reference.json

Nested Content Reference (Folder)

In case a content reference is considered to be a folder, the contentType must be set to application/x-directory and the children array contains the nested references (0..n).

Here is an example of a nested content reference which represents a folder structure:

As you can see, this nested content reference represents a folder structure like this:

  • myRootFolder:

    • contract.pdf

    • nestedFolder:

      • anotherContract.pdf

Attributes

Below you can find the attributes of a content reference and their meanings:

Attribute

Type

Description

Attribute

Type

Description

name

string

Optional. The name of the resource.

created

long

Optional. The unix timestamp in millis when this resource was created.

lastUpdated

long

Optional. The unix timestamp in millis when this resource was last modified.

contentType

string

The content type of the resource. If null or attribute doesn't exist, it is assumed to be text/plain by default. See here for a list of official content types: https://www.iana.org/assignments/media-types/media-types.xhtml .

If this content reference is a folder, this must be set to application/x-directory.

contentLength

long

Optional. The length of the resource in bytes or -1 or null in case the length cannot be determined.

contentEncoding

string

Optional. The encoding used to encode the content field. Can be base64 or outbound-url.

content

object

Required. The content (data) of the resource. Which format the data has, depends on its content type and encoding. For example, if contentType is application/json, then the data object returns a JSON document which can be encoded as string, node or base64 for example.

In case this content reference is folder, this must be null.

checksum

string

Optional. The checksum of the content (before encoding).

children

Array of Content References

Optional. Contains an array of all children content references which are contained in this “folder”. If this field contains a value, then also contentType must be set to application/x-directory.

parent

 

Deprecated.

path

 

Deprecated.

Loading content

In order to work with data from a content reference, you have to load (read) such data first.

When a content reference has been created, this doesn't mean that the data of this reference has also already been loaded. See this example again: It is a content reference, pointing to a PDF which is located at a remote server. We have already information about this PDF but the data of this PDF was not loaded yet:

In order to load the data of a content reference, you have multiple options:

  • Load into memory using toolings like @convert or @json for example.

  • Using a command which can do content loading from content references out-of-the-box (see docs whether such a command supports this).

Some commands also support a streamed content reference. So data is processed in byte chunks instead of a whole. This way also big data can be processed. But this depends on the implementation of the content object in the backend.

Do not load big data into memory! A content reference can also bee seen as a "gatekeeper" in order to make sure big data is only loaded when required and then by default in a streamed way. Not as a whole.

Here is an example in order to load a small, well known document:

If not configured otherwise, by default, the data of a content reference in the body of a pipeline is automatically streamed to the client at the end of the pipeline. This means if you place a content reference in the body of a pipeline without loading it, at the end of the request, the data will be automatically loaded and returned to the client.

Writing content

When it comes to writing content, you have to use a command which can load data from a given content reference and write it to a given target sink like drive.save for example.

Another option is to create your own content reference.

Some examples:

Create a content reference from a JSON document in the body and write this to drive:

Since drive.read converts any input automatically to a content object internally, we could also do something like this:

And in case there is an file embedded in a JSON as base64 for example, we can write this file to drive like this example shows:

This base64 encoded document is automatically converted to target format and then stored at drive. The filename is read from the embedded content reference JSON. Therefore, there is no need to specify the path attribute here. If we would like to store the document at a specific folder we could additionally set this using the path attribute:

This would store the document at /my/folder/hello.txt in Drive. Any non existing folder will be auto-created.