GxP Data Lake

GxP Data Lake

A highly scalable and compliant cloud-based data management platform that enables pharmaceutical companies to integrate data from[…]

Commissioned by: NTT DATA
Assignment: Brochure

(Download the brochure here or from the NTT DATA website.)

The NTT DATA GxP Data Lake provides a highly scalable and compliant cloud-based data management platform for pharmaceutical companies that wish to gather various types of data from multiple IT systems and external data sources. It can be applied to a wide variety of business challenges, enabling pharma companies to become more data-driven and embrace Pharma 4.0.

Business Case

With the rise of new technologies, hyperscale cloud-based platforms and predictive analytics, pharmaceutical companies want to extract the full potential of the data stored in their IT ecosystem by integrating all systems into a single data platform.

By doing so, they can achieve significant benefits by integrating data along the complete value chain, from early research to clinical stages, through to manufacturing, distribution and market surveillance. By moving from a chain into a value loop, they turn insights into business value.

Addressing the Challenge

The proliferation of information siloes, common to many large enterprises, makes it difficult to achieve integrated business intelligence, and manifests itself in different ways in a pharmaceutical company, such as a loss of competitive advantage, an inability to gain market insights, wide deviations from expected results and a higher risk to patient safety, product quality and efficacy.

Enterprises that have tried to break down these siloes using conventional data management solutions have often hit roadblocks due to the lack or scalability, not having the data in a central place for cross-references, the burden of performing error-prone manual activities for cleansing and harmonizing data, or problems validating solutions built on cloud infrastructure. These companies also have a huge challenge in maintaining data integrity, a key requirement that regulators focus on in audits.

The NTT DATA GxP Data Lake helps pharmaceutical companies overcome these data integration challenges.

NTT DATA GxP Data Lake

The NTT GxP Data Lake preserves all data in its original form and capture changes to data and contextual semantics throughout the data lifecycle. This is particularly relevant in the pharmaceutical sector when it comes to ensuring compliance and performing audits, as the data lake keeps track of all transformations and updates performed on a given piece of data, so ensuring data integrity by design.

The data collected in the lake can then be made available to a wide variety of tools and analytical platforms, according to the desired use case.

Main Components

Data Lake

The Snowflake cloud-based platform provides standard data lake features including unlimited scalability, mixed data types, and support for different languages including SQL, Python and Java. Snowflake can be deployed on Microsoft Azure or AWS, according to customer preference.

Compliance Control Room

This acts as the central nervous system for the solution and can be deployed on top of the data ecosystem components to keep them all in check from a GAMP point of view. On a use case level, the criticality of a deployed use case can be determined, including alerting, authorizations and change control. The control room monitors aspects like data freshness, schema drift, value deviations and component health in order to ensure GxP-critical actions can be executed while keeping risks in control.

Graph Database

A graph database uses graph structures for semantic queries and is an important tool for uncovering relationships across diverse data.

Data Integration

Data for the data lake can be ingested directly from any SAP environment with the NTT SAP Connector, from Microsoft Azure Data Factory, Fivetran or using Matillon.

Data Catalog

The data catalog, MS Purview or Atlan, provides an inventory of all the data assets in an organization, and ensures that the most appropriate data for a particular application or can be quickly located. It is also leveraged to ensure governance across the data and manage the quality of that data.

Data Lineage

Data lineage tools keep track of the source systems from which the data originates, how the data was transformed along the way and if it is still trustworthy. It also allows impact analysis to be performed to update existing pipelines. The lineage can pinpoint what use cases consume data coming from the same pipelines and allow regression testing of the use cases before pipeline updates are brought live.


  • Ensures data integrity by design
  • Splits data lake into zones for optimal control
  • Automates data cleansing and harmonization activities
  • Enables big data analytics at scale
  • Optimizes integration between data science, engineering, development and validation activities
  • Allows self-service analytics for business users

Comments are closed