How to use Onesait lookup processor

EN | ES

Description

The Onesait processor stage of Dataflow manages the receieved records and joins data from other ontologies.

How to use it

Importing to the pipeline

To add it to the pipeline, search for Onesait in the stages palette like so.

Drag and drop it to the panel or just click on it.

Once imported, it can be configured to achieve different beheaviours depending on the case. 

Available configuration tabs

General

This tab shows general information and an option to show the event generator point on the stage.

Checking the "Produce Events" option the stage shows another point to make usage of the events.

The events triggered are:

  • Document found
    • Which sends the found records in a list (records-found) for each bulk
  • Document not found
    • Which sends the not found records in a list (records-not-found) for each bulk

For example, records-found:

Source

This tab shows a form to fill with the Digital Broker information of the Onesait platform.


Additionally, it can be configured the maximum characters to make SQL query as it uses the http GET method.

  • Max GET characters: by default it uses 10000 characters to split the queries in chunks, but if the server support less or more it should be set applying the server configuration. It may be useful also if the server returns timeouts doing complex queries by splitting them in simpler ones.

Configuration

This tab shows the main configuration of the stage. 

  • Bulk: Numbers of records per bulk to process.
  • ThreadPool: Number of threads to create.
  • Column selector: The data to retreive from the ontolgy, as SQL sintax. Eg. user.first_name AS firstName
  • Matching fields: Fields to use to match the document in the ontology.
    • Record field: The field of the record to get the value from.
    • Column name: The property path in the data source.
    • Optional path: If the returned document from the ontology lacks of root paths (this coud be caused when using the column selector), the path to search the value may be different from the column name field and the match won't be done. This path can be set on this field separated by dots (.), otherwise leave blank.
  • Multiple value behaviour. If several documents are found by the matching fields.
    • Use the first record: It just uses the first document fetched from the data source.
    • Create a list in the record: Creates a list in the record with all the documents fetched from the ontology.
    • Generate multiple records: Generates new records as the number of the documents found in the data source.
  • Set data on path: Sets the data on the specified path in streamsets format. Eg. /userData
  • No value policy: If no data is found with the matching data pass on, an action is triggered.
    • Send to error: The node generates a record error.
    • Left the record unchanged: The record is left untouched and passed on. This triggers the records-not-found event, but it not produces any error.
  • Retry on missing values: If this option is unchecked, next time the same data is passed on, it will not fetch it from the ontology, preventing then unnecessary queries and performing the action selected on the "No value policy" option.


Performance tip for fetching data from the ontolgy

Use just one unique index (matching field in the node configuration) if possible in ontologies (eg. unique id). This way the lookup will group better the queries (using WHERE IN) to the ontology and use less resources, otherwise, it will group the queries too, but less performant (using AND and OR).

Cache

This tabs specifies the cache settings. Enabled by default to limit queries to the ontology and thus, speeding up the process and reducing the load on the server.

  • Use local cache: Enables or disables the local cache to store data retreived from the data source to limit the queries. Enabled by default.
  • Limit cache entries: Cache evition by number of entries. Disabled by default.
  • Max.  entries: Limits the number of entries in the memory cache if the option is set.
  • Cache expiration: Cache evition by time. Disabled by default.
  • Expiration time: Time value
  • Time unit: Time unit from nanoseconds to days.
  • Expiration policy: How entries expires in time.
    • After last access to the entry.
    • After last write of the entry.

Example 1

Let's say we are receiving records with an unique id refering to some data stored in an ontolgy in the Onesait platform. In this example, we are getting an object (user) with a property id.

Firstly we have to configure the source tab using the configuration of our server. Then we would configure the node with the options showed below.

This configuration will

  • Group the records in bulks of 20 and generate 5 threads.
  • Get all the possible information from the ontolgy, as we have not configured any column selector, and thus sending an * in the select query.
  • Match the information of the record getting the value from the path /user/id and and looking for it in the ontology in the path user.id
  • Use the first coincidence if more are found, due to the multiple value behavior option selected.
  • Set the data in the path /userData
  • If no document is found send to error
  • Keep retrying the query to the ontology if no document is found

and produce the following output.

Example 2

If we just need some info we could select them using the column selector.

But in this case, the data returned by the query won't contain the root node "user" becase we have selected just two properties from the document, 


Returned document
{
	"name" : "Bogart",
	"surname" : "Moyer",
	"id" : 801
}


And it will be necessary to set the optional path to id. Otherwise the data won't be matched and the returned data will be


The data returned by this configuration if the optional path is set 




Example 3 - airportsdata

Using the ontology airportsdata we can create a good example on how to use the lookup.

The final goal is to get the latitude and longitude from an airport using its name. So let's get into it.

Stages involved

  • Onesait Origin - To get the data from the airports
  • Expression evaluator - To move the name property to the root of the record
  • Field remover - To remove the rest of the data
  • Lookup - To retrieve the selected data from the ontology using the name property
  • Trash - Final result

Origin

The sql to send in the request

select * from airportsdata as c offset ${offset} limit ${batch}

Note: We could have just selected the name property, but this is to recreate a possible scenario.

Expression evaluator

Move the property to the root of the record.

Field expression: ${record:value('/airportsdata/name')}

Field remover

Keep the name property in the root of the Record.


Lookup

Connection details.



Query details


The query generated will be 

SELECT airportsdata.name,airportsdata.longitude,airportsdata.latitude /*  SELECT  Column name */
FROM airportsdata AS c
WHERE airportsdata.name /*  WHERE  Column name */ IN ('A 511 Airport','A Coru_a Airport','Aachen-Merzbr_ck Airport','Aalborg Airport','Aalen-Heidenheim/Elchingen Airport','Aarhus Airport','Aasiaat Airport','Aba Tenna Dejazmach Yilma International Airport','Abaco I Walker C Airport','Abadan Airport')

Which will return

[{"name":"A 511 Airport","longitude":"127.03099822998047","latitude":"36.96220016479492"},{"name":"A Coru_a Airport","longitude":"-8.377260208129883","latitude":"43.302101135253906"},{"name":"Aachen-Merzbr_ck Airport","longitude":"6.186388969421387","latitude":"50.823055267333984"},{"name":"Aalborg Airport","longitude":"9.84924316406","latitude":"57.0927589138"},{"name":"Aalen-Heidenheim/Elchingen Airport","longitude":"10.264721870422363","latitude":"48.77777862548828"},{"name":"Aarhus Airport","longitude":"10.619000434899998","latitude":"56.2999992371"},{"name":"Aasiaat Airport","longitude":"-52.7846984863","latitude":"68.7218017578"},{"name":"Aba Tenna Dejazmach Yilma International Airport","longitude":"41.85419845581055","latitude":"9.624699592590332"},{"name":"Abaco I Walker C Airport","longitude":"-78.39969635009766","latitude":"27.266700744628906"},{"name":"Abadan Airport","longitude":"48.2282981873","latitude":"30.371099472"}]


Notice why the Optional path is used in this example. The column name is airportsdata.name but the data returned lacks of root node airportsdata so to be able to match the document, the optional path is set to look into that json path (name) on the root.


The result of the lookup adding the data to the record.



You can get the example in the link below, but you need to edit the clients information.