Gobernanza de Big Data: los metadatos son la clave

Se necesita un nuevo enfoque para la gestión de datos en la era de Big Data, cuando los datos se encuentran dispersos en toda la empresa en muchos formatos y provienen de muchas fuentes.

A new approach to data governance is needed in the age of big data, when data is scattered throughout the enterprise in many formats, and coming from many sources.

As the volume, variety and velocity of available data all continue to grow at astonishing rates, businesses face two urgent challenges: how to uncover actionable insights within this data, and how to protect it. Both of these challenges depend directly on a high level of data governance.

The Hadoop ecosystem can provide that level of governance using a metadata approach, ideally on a single data platform.

A new approach to governance is needed for several reasons. In the age of big data, data is scattered throughout the enterprise. It’s in structured, unstructured, semi-structured and various other formats. Furthermore, the sources of the data are not under the control of the teams that need to manage it.

In this environment, data governance includes three important goals:

Maintaining the quality of the data
Implementing access control and other data security measures
Capturing the metadata of datasets to support security efforts and facilitate end-user data consumption

Solutions within the Hadoop Ecosystem

One way to approach big data governance in a Hadoop environment is through data tagging. In this approach, the metadata that will govern the data’s use is embedded with that data as it passes through various enterprise systems. Furthermore, this metadata is enhanced to include information beyond common attributes like filesize, permissions, modification dates and so on. For example, it might include business metadata that would help a data scientist evaluate its usefulness in a particular predictive model.

Finally, unlike enterprise data itself, metadata can be centralized on a single platform.

The standard Hadoop Distributed Filing System HDFS has an extended attributes capability that allows enriched metadata, but it isn’t always adequate for big data. Fortunately, an alternate solution exists. The Apache Atlas metadata management system enables data tagging, and can also serve as a centralized metadata store, one that can offer “one stop shopping” for data analysts who are searching for relevant datasets. Also, users of the popular Hadoop-friendly Hive and Spark SQL data retrieval systems can do the tagging themselves.

For security, Atlas can be integrated with Apache Ranger, a system that provides role-based access to Hadoop platforms.

Platform loading challenges

The initial loading of metadata to the Atlas platform and incremental loading that will follow both present significant challenges. For large enterprises, the sheer volume of data will be the main problem in the initial phase, and it may be necessary to optimize some code in order to carry out this phase efficiently.

Incremental loading is a more complex issue, because tables, indexes and authorized users change all the time. If these changes aren’t quickly reflected in the available metadata, the ultimate result is a reduction in the quality of the data available to end users. To avoid this problem, event listeners should be included in the system’s building blocks so that changes can be captured and processed in near real time. A real-time solution not only means better data quality. It also improves developer productivity because the developers don’t have to wait for a batch process.

The foundation of digital transformation

As businesses pursue digital transformation and seek to be more data-driven, senior management needs to be aware that no results in this direction can be achieved without quality data, and that requires strong data governance. When big data is involved, governance based on enhanced metadata that resides in a central repository is a solution that works.

Fuente: https://www.informationweek.com