Conectores

El DataFlow de la plataforma está construido sobre el software open-source Streamsets Data Collector y por tanto la plataforma permite usar todos los componentes integrados en Streamsets, además de los componentes propios de la plataforma integrados en el DataFlow.

En este post enumeraremos los más importantes, para más detalle se puede consultar la referencia en la página de Streamsets (https://streamsets.com/connectors).

Estos conectores se dividen en cuatro tipos:

Origins: representan la fuente de un flujo, sólo puede haber uno por pipeline (link: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Origins/Origins_overview.html)
Processors: permiten procesar los datos en un flujo (link: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Processors/Processors_title.html)
Destinations: representan las salidas de un pipeline, puede haber uno o más destinos (link: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/Destinations_overview.html)
Executors: desencadena una tarea cuando recibe un evento (link: https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Executors/Executors-overview.html)

La plataforma incluye cuatro componentes:

onesait Platform Origin: permite conectarse a la plataforma como origen (How to use Onesait origin stage
onesait Platform Lookup: para mergear datos de consultas en la plataforma con otro origen (How to use Onesait lookup processor)
2 onesait Platform Destination: para volcar datos en la plataforma con un INSERT o un UPDATE (How to use Onesait insert destination & How to use Onesait update destination)

Además de estos existen estos componentes:

ORIGINS

PROCESSORS

DESTINATIONS

EXECUTORS

Standalone Pipelines

Amazon S3 - Reads objects from Amazon S3. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Amazon SQS Consumer - Reads data from queues in Amazon Simple Queue Services (SQS). Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Azure IoT/Event Hub Consumer - Reads data from Microsoft Azure Event Hub. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
CoAP Server - Listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Directory - Reads fully-written files from a directory. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Elasticsearch - Reads data from an Elasticsearch cluster. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
File Tail - Reads lines of data from an active file after reading related archived files in the directory.
Google BigQuery - Executes a query job and reads the result from Google BigQuery.
Google Cloud Storage - Reads fully written objects from Google Cloud Storage.
Google Pub/Sub Subscriber - Consumes messages from a Google Pub/Sub subscription. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Hadoop FS Standalone - Reads fully-written files from HDFS, Azure Data Lake Storage, or Azure HDInsight. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
HTTP Client - Reads data from a streaming HTTP resource URL.
HTTP Server - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
HTTP to Kafka (Deprecated) - Listens on a HTTP endpoint and writes the contents of all authorized HTTP POST requests directly to Kafka.
JDBC Multitable Consumer - Reads database data from multiple tables through a JDBC connection. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
JDBC Query Consumer - Reads database data using a user-defined SQL query through a JDBC connection.
JMS Consumer - Reads messages from JMS.
Kafka Consumer - Reads messages from a single Kafka topic.
Kafka Multitopic Consumer - Reads messages from multiple Kafka topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Kinesis Consumer - Reads data from Kinesis Streams. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR DB CDC - Reads changed MapR DB data that has been written to MapR Streams. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR DB JSON - Reads JSON documents from MapR DB JSON tables.
MapR FS - Reads files from MapR FS.
MapR FS Standalone - Reads fully-written files from MapR FS. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR Multitopic Streams Consumer - Reads messages from multiple MapR Streams topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR Streams Consumer - Reads messages from MapR Streams.
MongoDB - Reads documents from MongoDB.
MongoDB Oplog - Reads entries from a MongoDB Oplog.
MQTT Subscriber - Subscribes to a topic on an MQTT broker to read messages from the broker.
MySQL Binary Log - Reads MySQL binary logs to generate change data capture records.
Omniture - Reads web usage reports from the Omniture reporting API.
OPC UA Client - Reads data from a OPC UA server.
Oracle CDC Client - Reads LogMiner redo logs to generate change data capture records.
PostgreSQL CDC Client - Reads PostgreSQL WAL data to generate change data capture records.
Pulsar Consumer - Reads messages from Apache Pulsar topics.
RabbitMQ Consumer - Reads messages from RabbitMQ.
Redis Consumer - Reads messages from Redis.
REST Service - Listens on an HTTP endpoint, parses the contents of all authorized requests, and sends responses back to the originating REST API.Creates multiple threads to enable parallel processing in a multithreaded pipeline. Use only in microservice pipelines.
Salesforce - Reads data from Salesforce.
SDC RPC - Reads data from an SDC RPC destination in an SDC RPC pipeline.
SDC RPC to Kafka (Deprecated) - Reads data from an SDC RPC destination in an SDC RPC pipeline and writes it to Kafka.
SFTP/FTP Client - Reads files from an SFTP or FTP server.
SQL Server CDC Client - Reads data from Microsoft SQL Server CDC tables. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
SQL Server Change Tracking - Reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
TCP Server - Listens at the specified ports and processes incoming data over TCP/IP connections. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Teradata Consumer - Reads data from Teradata Database tables through a JDBC connection. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
UDP Multithreaded Source - Reads messages from one or more UDP ports. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
UDP Source - Reads messages from one or more UDP ports.
UDP to Kafka (Deprecated) - Reads messages from one or more UDP ports and writes the data to Kafka.
WebSocket Client - Reads data from a WebSocket server endpoint. Can send responses back to the origin system as part of a microservice pipeline.
WebSocket Server - Listens on a WebSocket endpoint and processes the contents of all authorized WebSocket client requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline. Can send responses back to the origin system as part of a microservice pipeline.

In cluster pipelines, you can use the following origins:

Hadoop FS - Reads data from HDFS, Amazon S3, or other file systems using the Hadoop FileSystem interface.
Kafka Consumer - Reads messages from Kafka. Use the cluster version of the origin.
MapR FS - Reads data from MapR FS.
MapR Streams Consumer - Reads messages from MapR Streams.

Edge Pipelines

In edge pipelines, you can use the following origins:

Directory - Reads fully-written files from a directory.
File Tail - Reads lines of data from an active file after reading related archived files in the directory.
gRPC Client - Reads data from a gRPC server.
HTTP Client - Reads data from a streaming HTTP resource URL.
HTTP Server - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests.
MQTT Subscriber - Subscribes to a topic on an MQTT broker to read messages from the broker.
System Metrics - Reads system metrics from the edge device where SDC Edge is installed.
WebSocket Client - Reads data from a WebSocket server endpoint.
Windows Event Log - Reads data from a Microsoft Windows event log located on a Windows machine.

Development Origins

To help create or test pipelines, you can use the following development origins:

Dev Data Generator
Dev Random Source
Dev Raw Data Source
Dev SDC RPC with Buffering
Dev Snapshot Replaying
Sensor Reader

Standalone Pipelines Only

In standalone pipelines, you can use the following destination:

Rabbit MQ Producer - Writes data to RabbitMQ.
Send Response to Origin - Sends records with the specified response to the microservice origin in the pipeline. Use only in a microservice pipeline.

Standalone or Cluster Pipelines

In standalone or cluster pipelines, you can use the following destinations:

Aerospike - Writes data to Aerospike.
Amazon S3 - Writes data to Amazon S3.
Azure Data Lake Storage - Writes data to the Azure Data Lake Storage.
Azure Event Hub Producer - Writes data to Azure Event Hub.
Azure IoT Hub Producer - Writes data to Microsoft Azure IoT Hub.
Cassandra - Writes data to a Cassandra cluster.
CoAP Client - Writes data to a CoAP endpoint.
Couchbase - Writes data to a Couchbase database.
Elasticsearch - Writes data to an Elasticsearch cluster.
Einstein Analytics - Writes data to Salesforce Einstein Analytics.
Flume - Writes data to a Flume source.
Google BigQuery - Streams data into Google BigQuery.
Google Bigtable - Writes data to Google Cloud Bigtable.
Google Cloud Storage - Writes data to Google Cloud Storage.
Google Pub/Sub Publisher - Publishes messages to Google Pub/Sub.
Hadoop FS - Writes data to HDFS, Azure Data Lake Storage, or Azure HDInsight.
HBase - Writes data to an HBase cluster.
Hive Metastore - Creates and updates Hive tables as needed.
Hive Streaming - Writes data to Hive.
HTTP Client - Writes data to an HTTP endpoint. Can send responses to a microservice origin in a microservice pipeline.
InfluxDB - Writes data to InfluxDB.
JDBC Producer - Writes data to JDBC.
JMS Producer - Writes data to JMS.
Kafka Producer - Writes data to a Kafka cluster. Can send responses to a microservice origin in a microservice pipeline.
Kinesis Firehose - Writes data to a Kinesis Firehose delivery stream.
Kinesis Producer - Writes data to Kinesis Streams. Can send responses to a microservice origin in a microservice pipeline.
KineticaDB - Writes data to a table in a Kinetica cluster.
Kudu - Writes data to Kudu.
Local FS - Writes data to a local file system.
MapR DB - Writes data as text, binary data, or JSON strings to MapR DB binary tables.
MapR DB JSON - Writes data as JSON documents to MapR DB JSON tables.
MapR FS - Writes data to MapR FS.
MapR Streams Producer - Writes data to MapR Streams.
MemSQL Fast Loader - Writes data to MemSQL or MySQL.
MongoDB - Writes data to MongoDB.
MQTT Publisher - Publishes messages to a topic on an MQTT broker.
Named Pipe - Writes data to a named pipe.
Pulsar Producer - Writes data to Apache Pulsar topics.
Redis - Writes data to Redis.
Salesforce - Writes data to Salesforce.
SDC RPC - Passes data to an SDC RPC origin in an SDC RPC pipeline.
Snowflake - Writes data to tables in a Snowflake database.
Solr - Writes data to a Solr node or cluster.
Splunk - Writes data to Splunk.
Syslog - Writes data to a Syslog server.
To Error - Passes records to the pipeline for error handling.
Trash - Removes records from the pipeline.
WebSocket Client - Writes data to a WebSocket endpoint.

Edge Pipelines

In edge pipelines, you can use the following destinations:

CoAP Client - Writes data to a CoAP endpoint.
HTTP Client - Writes data to an HTTP endpoint.
Kafka Producer - Writes data to a Kafka cluster.
Kinesis Firehose - Writes data to a Kinesis Firehose delivery stream.
Kinesis Producer - Writes data to Kinesis Streams.
MQTT Publisher - Publishes messages to a topic on an MQTT broker.
To Error - Passes records to the pipeline for error handling.
WebSocket Client - Writes data to a WebSocket endpoint.

Development Destination

To help create or test pipelines, you can use the following development destination:

To Event

Amazon S3 - Creates new Amazon S3 objects for the specified content, copies objects within a bucket, or adds tags to existing Amazon S3 objects.
Databricks - Starts the specified Databricks job upon receiving an event record.
Email - Sends custom email to the configured recipients upon receiving an event.
HDFS File Metadata - Changes file metadata, creates an empty file, or removes a file or directory in HDFS or a local file system upon receiving an event record.
Hive Query - Runs user-defined Hive or Impala queries upon receiving an event record.
JDBC Query - Runs a user-defined SQL query upon receiving an event record.
MapR FS File Metadata - Changes file metadata, creates an empty file, or removes a file or directory in MapR FS upon receiving an event record.
MapReduce - Starts the specified MapReduce job upon receiving an event record.
Pipeline Finisher - Stops and transitions the pipeline to a Finished state upon receiving an event record.
Shell - Executes a shell script upon receiving an event record.
Spark - Starts the specified Spark application upon receiving an event record.