TimescaleDB Entities: TimescaleDB as a TimeSeries Engine on Ontologies
What is TimescaleDB?
Timescale (see TimeScaleDB)Â is an open-source database for Time Series storage and analysis with the power and advantages of using SQL. TimescaleDB is built on PostgreSQL which has the advantage of being able to use the tools of the Postgresql ecosystem.
At the ingest level TimescaleDB is prepared for large point ingests using automatic time-space partitioning. This automatic partitioning optimizes the ingest time and on the other hand users see all their data as a single hypertable.
With the abstraction of the hypertable on the time-series data, the user relates to this structure as if it were a normal SQL table, which allows the creation of tables, indexes,... the hypertable uses native compression and can scale to TBs level.
Â
TimescaleDB includes:
Specific data management capabilities, such as data retention, downsamplings, native compression, data lifecycle management, aggregation policies,....
Time Series analytics-oriented functions, such as windowing, gap filling, LOCF queries,...
And being built on PostgreSQL you can store your business data in the same database allowing JOINS.
It offers a tool that allows you to migrate from one of the most used Time Series databases nowadays:Â InfluxDB:Â https://www.outfluxdata.com
Â
TimescaleDB is offered as a service in the 3 main Clouds:
Hypertables
TimescaleDB is in fact mounted as an extension of Postgresql:
Â
Being an extension I can use the same Postgresql driver (the JDBC in our case).
As we said the main concept of TimescaleDB is the hypertable.
Once we have the extension to create a hypertable I will create my table normally:
Â
And then I convert it to hypertable indicating the field we partition by:
Once this is done, I can perform the SQL operations as usual:
When creating the hypertable I can configure a set of options, such as the default time interval (which if not set is 7 days), we can change it like this:
I can also configure the space partitions.
You can read more about the Best Practices in Hypertables here:Â https://docs.timescale.com/v1.3/using-timescaledb/hypertables#best-practices
Hypertable indexes
When a hypertable is created a temporary index is created like this:
If at the time of creating the hypertable we specify a space partition, an index of this type will be created:
JSON support in Hypertable
TimescaleDB can work with the same data types as Postgresql, including JSON and JSONB which is a very interesting option for many scenarios.
It is recommended:
Use JSONB (binary Json) data type which are more efficient in storage and search.
A semi-structured approach where fields like time, user_id, device_id are outside the JSONB structure as access is more efficient.
TimescaleDB also supports creating indexes on an entire JSONB structure (GIN indexes) or even on a field inside the JSONB.
Multiple inserts and upserts
In addition to normal INSERTs TimescaleDB supports multiple row inserts in a hypertable with a single INSERT:
I also have the UPSERT concept, which allows me to manage conflicts in insertions:
Queries
TimescaleDB and its hypertables behave just like a Postgresql table, so I can make JOINS between normal tables and hypertables:
In addition to complex queries:
Â
In addition to analytical queries such as mean, median, percentile, sum, histograms, fill in the blanks,...
Â
Platform Integration
Next, let's see how TimescaleDB integration works in Platform and how we can define a TimeSeries Ontology/Entity on top of TimescaleDB.
General information
When creating the timeseries, we will add a new option in the database engine selection. The rest of the fields in the section remain the same.
Properties
As far as TAGs and Signals are concerned, everything will remain the same as for MongoDB timeseries.
This selection will compose the table structure, where there will always be a "timestamp" field. The above example would translate to the following query:
CREATE TABLE timeserieExample (
timestamp TIMESTAMP NOT NULL,
tag1 TEXT NOT NULL,
tag2 TEXT NOT NULL,
signal1 numeric null,
signal2 numeric null
unique(timestamp, tag1, tag2)
);
By default, any TAG and the timestamp cannot be null since they would be part of the key that composes a unique record. In the case of events (see below) no unique fields will be generated.
Â
The set of unique fields defines the key for which we are going to allow UPSERTs. Currently, for MongoDB timeseries, an insert for a timestamp is actually an UPSERT for the value specified at the timestamp+TAGS/key level. For TimescaleDB it is a requirement to define these unique fields in order to perform UPSERTs. For each record to be inserted, the platform will get the key fields to compose the "conflict" criteria.
Let's assume the above example and insert the following data:
{"Timeserie":{"timestamp":"2021-05-01T00:00:00Z","tag1":"valor1",tag2":"valor2","signal1":1,"signal2":2}}
This is translated into a SQL statement:
insert into timeserieExample values(TIMESTAMP '2021-05-01 00:00:00','tag1','tag2', 1, 2);
If we re-insert, for example, a data for the first signal and mark the aggregation as "SUM" (see below)
{"Timeserie":{"timestamp":"2021-05-01T00:00:00Z","tag1":"valor1",tag2":"valor2","signal1":2}}
The platform will detect the key and generate a UPSERT as follows:
insert into timeserieExample values(TIMESTAMP '2021-05-01 00:00:00','tag1','tag2', 2) on conflict (timestamp, tag1, tag2) do update set signal1=timeserieExample.signal1+excluded.signal1;
Timeseries properties
The platform will detect the key and generate a UPSERT as follows:
Chunk time interval: This parameter defines the time window that is taken into account to create the chunks. It is recommended that the chnks contain between 1 and 10 million records (depending on the size).
Window frequency: In this case, we define whether we want the timeseries to have discrete intervals when storing the measurements or not. The possibilities are:
No aggregation (EVENTS): There will be no window as such, i.e. the data will be stored as it is inserted. This follows a data pattern like that of events, where there is no frequency. In this case, no aggregation function will be established, since there will be no key defined.
The rest of the values will indicate how we aggregate the data in a selected frequency. In this case, the timestamps will be "rounded" to the period/frequency they belong to (just like we currently do for MongoDB in timeseries).
Aggregation functions: In the case of selecting a frequency, it will be possible to select between different aggregation functions (LAST, SUM) to know how to aggregate the data of each signal if more than one record is received for a set of TAGs and a particular frequency/timestamp.
Â
n the case of MongoDB, you are allowed to create different aggregation windows. Because of how TimescaleDB stores data, this does not make sense. In the case of TimescaleDB, only one temporal window per timeseries is allowed to be defined.
Once the window is defined, we will generate the hypertable creation query, according to the previous selections. The user can change whatever he deems necessary (partitions, properties, etc.) and ultimately this is what will define the hypertable in TimescaleDB.
Continuous temporary aggregates
For TimescaleDB we will add the ability to create continuous aggregates. This is a TimescaleDB's own functionality that creates elements similar to materialized views of hypertables.
With them, we will be able to define views that contain the aggregated data of our timeseries in a temporal way. For example, if we have a timeseries whose data are 15 minutes, we can define a continuous aggregate at hourly level, so that the queries that retrieve information at that level, go against the new view, being the most optimal query.
Based on the previous examples, we are going to create an hourly aggregate on the timeseries "timeseriesExample", taking the average, maximum, minimum, last and first value of the signal "signal1".
CREATE MATERIALIZED VIEW timeserieExample_hourly
WITH (timescaledb.continuous) AS
SELECT tag1,
tag2,
time_bucket(INTERVAL '1 hour', timestamp) AS timestamp,
AVG(signal1) as signal1Avg,
MAX(signal1) as signal1Max,
MIN(signal1) as signal1Min,
LAST(signal1, timestamp) as signal1Last,
FIRST(signal1,timestamp) as signal1First
FROM timeserieExample
GROUP BY tag1, tag2, time_bucket(INTERVAL '1 hour', timestamp);
Para ello accederemos a la edición de la entidad creada previamente
Â
We will select a name for the view, which will be concatenated with the name of the timeserie ontology with the character '_'. The "bucket Aggregation" field will indicate to which data frequency we add, in our case we want a data every hour.
With the initial selection, a template will be created where the user will be able to modify the fields he wants to add to the aggregation (part of the SELECT). It is mandatory to have the TAGS and bucket inside the SELECT and GROUP BY so that the query key is the same. It is possible to use all the temporal functions that TimescaleDB allows us like First and last that will do it based on the defined time interval.
Â
The view update can be done manually, but in our case, we will define an update policy, which will refresh the data as we define:
Schedule Policy: How often the aggregate is recalculated.
Start Offset and End offset : define the time interval relative to the current time (of execution) that the process will take. In the example we will recalculate only the data of the last month.
Â
This planning will end up launching the following command to TimescaleDB:
SELECT add_continuous_aggregate_policy('timeserieExample_hourly',
start_offset => INTERVAL '1 month',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 minute');
Â
We add the PUSH theme from another signal
CHUNKS management, compression and deletion policies
Â
In process: Switching to Cold Data
As for the Cold DB, in the absence of further research, with the same psql tools you can do a COPY to csv for example:
psql -d old_db \ -c "\COPY (SELECT * FROM conditions) TO data.csv DELIMITER ',' CSV"
In the github there are proposals for an export_chnks() but it is rejected because according to them it can be done with user defined actions, which could imply doing it with Postgres tools and then doing the TimescaleDB drop_chunks().