Data Governance in Onesait Platform

EN | ES


Need for Data Government

Currently, companies are detecting too many problems associated with data and its management, resulting in different operational problems and also making it impossible to use data as a strategic asset.

These shortcomings and common problems associated with data management affect different projects in different dimensions:

  • Unavailability:
    • Heterogeneous and dispersed information sources hinder obtaining information.
    • Huge volumes of information cannot be processed with current technology.
    • Lack of understanding between Business and Systems areas.
  • Lack of credibility
    • Deficiencies in the quality of the information being handled.
    • Information duplicity and incoherence.
    • Need to make assumptions in the reconstruction of historical information.
    • Very manual preparation of the information.
  • No single view
    • Difference in criteria between the information handled by the different business units.
    • Information domains with restricted access.
    • Shortage of cross-initiatives to add value.

Principles of Data Governance

Data Governance refers to the management of the availability, integrity, usability and security of the data used in an organisation or system, in our case in the Platform.

  • Data quality is marked by the quality of those data at the time of its capture at the origin of the information. Therefore, it is essential to respect a series of principles to ensure it.
  • To ensure a homogeneous and coherent understanding of the data, a data dictionary must be defined to keep the information and traceability updated.
  • A coherent data architecture must be implemented, from the operation to the exploitation. It is important for the data to be validated, documented and that their traceability is accurately known.
  • For optimal data exploitation, aggregation criteria must be defined, sufficient data granularity must be ensured and processes must be automated.
  • To guarantee data security, appropriate information access profiles and options for data encryption, ciphering and anonymisation must be defined, granted and managed correctly.
  • It is essential to define a number of policies and regulations to be respected in order to ensure the correct operation of the Data Governance Model.

Principles of Data Governance: QUALITY

  • QUALITY PREMISES:
    • Information completeness: Assurance that the standardized data as a whole have logic for their exploitation.
    • Global vision of business / infrastructure: Establishing the business definition for the data, allowing to identify them through an end-to-end vision, as well as with an infrastructure vision that allows to know the physical location of the data.
    • Data responsibility: Assignment of responsibilities and roles on information management, data reliability, integrity, provisioning and exploitation, throughout a governance model that eases the monitoring of its evolution.
    • Efficient data: Compliance with the principles of data non-duplication, integrity and consistency.
  • KEY ELEMENTS IN QUALITY ASSURANCE
    • Standards definition: Generation of minimum information requirements to consider the data as correct (length verification, data typology, formats).
    • Data validation: Establishment of validation mechanisms that allow the integration of the data in the storage infrastructure according to the established standards, minimizing the number of incidents.
    • Cross vision of the data life cycle: Identify the complete process of the data life cycle, allowing their global management throughout the traceability and mapping of information.
    • Implementation of a methodology: Implement perfectly defined and established procedures that allow for the constant execution of best practices.

Principles of Data Governance: HOMOGENEITY

  • Data traceability: Data must be perfectly identifiable with respect to their traceability, mapping and follow-up throughout the processes; this allows to give continuity to the cohesion and coherence of information.
  • Fitting the business definition (data accuracy): Only the data required according to the business need must be considered, avoiding duplicated information, increasing the accuracy and dimensioning of the bases.
  • Related information: Maintenance of data under logical rules and criteria that allow relationships to be established between the different repositories, facilitating aggregation based on the different typology.
  • Standardization: Standards must be defined for the data, from the processes of supply, storage and exploitation, establishing constant rules to assign names and clear methodologies to define procedures.
  • Support in the infrastructure: Use of the tools that facilitate the functional and technical traceability of each data: database administration, database design, creation and maintenance for the database system.
  • Processes aligned to the organization / regulation: Procedures must be fulfilled according to the vision of the entity, as well as the regulatory agents, limiting the margin of information.

Principles of Data Governance: DATA ARCHITECTURE

A solid architecture must be based on key principles that support its structure. Next, you have our vision:

  • Robustness and flexibility: Generating a scalable architecture that allows the agile incorporation of new structural parts and components.
  • Unique sources: Avoiding duplication in information storage, instead generating synergies and avoiding duplicate work through a detailed information analysis.
  • End-to-end vision: Maintaining data traceability at all times to identify the information flow and transformation processes.

The principles must contemplate:

  • Procurement:
    • Alignment and standardization of input processes.
    • Clear identification of sources / databases / files / applications, as well as the delimitation of the typology of information: channels, web, risks.
    • Definition of input data validation processes.
  • Processes and storage:

    • Definition of the infrastructure and support technology with optimal performance and historical storage capacity.
    • Generate centralized repositories aligned with corporate strategy, with validated information.
    • Define calculation and data transformation processes according to business needs.
    • Data homogeneity and cohesion.
  • Exploitation:
    • Identification of the exploitation / distribution tools.
    • Definition of metrics and standards for reporting definition.
    • Align means of delivery and areas in charge of them.
    • Inventory control of outputs to avoid duplication in construction / development.
    • Generate synergies based on existing developments.
  • Monitoring and Control:

    • Documentation: Carry out a correct follow-up of all processes, actions and decisions taken to generate the data architecture, as well as the data life cycle: supply, extraction, loading, exploitation.
    • Aligned user areas: Alignment of user areas for accountability according to their participation in the process.
    • Global vision: that allows the alignment of the life cycle with the corporate structure and procedures.

Principles of Data Governance: AGGREGATION

  • Definition of aggregation: The data architecture and technological infrastructure must support a definition that contains the criteria for grouping and dimensioning the data to achieve aggregation at the minimum level of detail, which cmust be complementary documented.
  • Accuracy and completeness of the information: For the data to be properly aggregated, the data must have a defined level of accuracy and completeness, in order to generate the risk calculation in an accurate and reliable way to meet the operational needs under normal and stress conditions.
  • Automation: The aim is to eliminate manual work in the supply of information, implementing automatic processes that take the information and perform the calculation of the defined processes, allowing a correct deepening in the different levels of information.
  • Analysis by information axis: A proper information aggregation requires analysis of the different cross-references that the data may have through any field with other similar ones, allowing to obtain a transversal vision of the information through its different axes and dimensions, in turn allowing to add and deepen to the minimum level of detail.
  • Clear definition of the logical structure: The structure of the data model must be based on a structure defined under an exploitation vision at different levels and views, supported by the consistency and cohesion between the different modules or components.

Principles of Data Governance: SECURITY

  • Data availability: Keeping the data available for whoever needs to access it; presenting timely and in due form, whether for provisioning, users, applications or processes, meeting service levels.
  • Information segmentation by type: Defining criteria to include data in groups (critical, sensitive, etc.) that allow the assignment of criteria according to type, facilitating the management of profiles.
  • Combination of technological and business vision: Implementing specific physical and logical security rules for data protection in accordance with the level of risk established by the entity.
  • Profile management: Managing access according to standards allowed by user typology: User: with limited vision according to the defined profile and Administrator: access manager with global vision of the information.
  • Logs: Logs that allow a global vision of the flow of information and of the people in charge of this treatment, whether it be data manipulation, extraction, access to databases, applications, files, etc.

Principles of Data Governance: POLICY AND REGULATIONS


Data Governance Support in Onesait Platform


Once the principles of a Data Governance model have been described, we will see the support given to each of them by the Platform:

QUALITY:

  • QUALITY PREMISES:
    • Completeness of the information: to ensure that the data have logic for their exploitation, the Platform adopts a Data-Centric Architecture where any work with the platform starts with the definition of the ontology. More info here: /wiki/spaces/PT/pages/295022

    • Global vision of the business / infrastructure: regarding the definition of the business for the data with an end-to-end vision, and given the Data-Centric architecture, all the data processing is done starting from the ontology, so that you can create an API REST on specific queries of an ontology, generate a dashboard to visualize the data of an ontology, associate a process to an ontology,...

    • Data responsibility: regarding responsibilities and roles on the information management, starting from the ontology concept, the ontology owner can assign permissions to other users, so that they can consult, or even manage, that ontology:

Through the concept of Project, you can assign these permissions to certain roles:

    • Efficient data: Regarding the compliance with the principles of non-duplication, integrity and consistency of data, the platform allows to associate a Notebook where statistical calculations are made about the content of an ontology, such as averages, number of null points in attributes,...

  • KEY ELEMENTS IN QUALITY ASSURANCE
    • Standards definition: when an ingest is made, the platform allows the definition of valid requirements to consider a data as correct (length, data typology, formats) and, according to that, the data is accepted or it is sent to an error processing process. This processing is typically done in an ingest DataFlow. In the example, a data replacement and conversion are made to adapt the incorrect data and finally insert it in the platform:

 

    • Data validation: this same component (DataFlow) allows the definition of validation mechanisms that allow the integration of data according to established standards:

    • Cross vision of the data life cycle: the platform allows the identification of the complete process in the data life cycle through the graph of relations from an ontology. This allows to know which processes are carried out on it. In the example, you can see the relationships of the HelsinkiPopulation ontology with its API:


HOMOGENEITY

  • Data traceability to identify and be able to map and track the data throughout the processes. Beyond the graph that relates them, the platform audits all operations that are performed on the data:

Besides, the DataFlow allows to see the treatment that is done to the data while passing through the different steps:

  • Related information: the platform allows to perform maintenance on the data under rules and aggregation in the different typology. The ontology abstracts from the underlying repository and manages it in a transparent way:

Moreover, the ontologies can be linked to relate to each other (/wiki/spaces/PT/pages/230074).


  • Support in the infrastructure: the platform is in charge of managing all the data processing and storage infrastructure, so the developers don't have to use a tool to model the database - they simply has to define the ontology and indicate on which database they wants to persist. Besides, when they are doing the ingest, they simply indicate that they want to ingest towards that ontology:

  • Processes aligned to the organization / regulation: as for the use of friendly tools for the definition of data structures by non-technical staff, as we have said, the concept of ontology homogenizes the data independently of the technological or functional domain, facilitating the alignment with standards.

DATA ARCHITECTURE

Onesait Platform, as the platform it is, provides a solid architecture, based on key principles that support its structure, such as:

  • Robustness and Flexibility.
  • Unique sources.
  • End-to-end vision.

Which are materialized in: 

  • Procurement: the platform offers several components for the provisioning of information, mainly DataFlow and FlowEngine, although data scientists can also use Notebooks for this.
    • Alignment and standardization of input processes: specific pipelines can be created in DataFlow to adapt input messages from different sources:

    • Clear identification of sources / databases / files / applications, as well as the delimitation of the typology of information: in DataFlow you can have the sources identified at a glance:

    • Definition of input data validation processes:

  • Processes and storage: as indicated, the concept of ontology isolates the database in which it is stored and the platform supports different types of storage.

    • Definition of the infrastructure and support technology with optimal performance and historical storage capacity: the platform is integrated with different databases: relational databases such as MySQL, Postgresql, Oracle, SQL Server; NoSQL databases such as Mongo, Elasticsearch; Big Data databases such as Kudu, BigQuery or HIVE,... depending on the use you will make of an ontology, the most appropriate repository will be selected and the platform will provide you with that repository or connect it to the already connected database.

    • Storage in the most suitable repository for the data life cycle: the developers can choose the most suitable repository depending on the type of data and the exploitation, so they can choose to use Mongo if it is semi-structured data, or Elasticsearch if it is data they want to search.
    • Generate centralized repositories aligned with corporate strategy, with validated information: the platform allows you to easily align technology with the strategy as it allows you to use different repositories, process the data in a consistent way,...
    • Define calculation and data transformation processes according to business needs: as we have already seen the platform offers different tools for this.
    • Data homogeneity and cohesion.
  • Exploitation:
    • Identification of the exploitation / distribution tools: the platform offers different tools to exploit the data. Depending on the need for exploitation, the following can be used:
  • Monitoring and Control: 

    • Centralized control panel facilitates the management of data, rules, security, access, and backup and recovery processes: the platform offers the Control Panel, a web console from which all processes can be performed. Depending on the role, different tasks can be done, so an Administrator can see all the concepts defined in the platform, define security, accesses, schedule the backup, make a recovery...

while a developer role can manage all her own concepts.

    • Definition of users and roles per project, configuring different levels of access to each ontology (entity) stored: through the concept of Realm you can create roles, assign users to these roles, configure access to these roles to the different elements,...

    • Exposing subsets of information to specific users through queries

    • Monitoring of data intake: the intake tool (DataFlow) constantly monitors.

AGGREGATION

  • Definition of aggregation: By combining the capabilities of DataFlow, Notebooks and FlowEngine, you can run automatic (as event response or planning) or manual aggregation and analysis information processes.
  • Accuracy and completeness of the information: as we have seen, the DataFlow allows to control the integrity of the information, so that, according to this, the data is completed or sent to an error queue for a manual revision:

  • Automation: all the supply processes are launched automatically, they can be planned,...

  • Clear definition of the logical structure: the concept of ontology that emancipates from the underlying repository and at the same time allows the other modules to work on this abstraction, provides consistency and cohesion to the platform.

SECURITY

  • Data availability: Through the configuration of the import processes and the alerts, the availability of the imported data in its final consolidated state and the knowledge of any related incident is guaranteed.
  • Combination of technological and business vision: this alignment is achieved through the assistants, visual development and blogal web environment.
  • Profile management: the web console handles three roles: administrator, analytics developer and user, each with limited permissions, see /wiki/spaces/PT/pages/130318408. In addition, the platform allows you to define other roles through the concept of Realm: /wiki/spaces/PT/pages/3080233.
  • Logs: It is possible to consult the execution statistics of any pipeline, the processed data and the history of the pipeline in real time. There is also a pre-configured information audit system for use in the different projects.

Supporting materials

The following is a presentation that illustrates the concepts discussed in the article: