Using Schema Registry
Schema Registry provides a shared repository of schemas that enables applications to flexibly interact with each other.
Applications built often need a way to share metadata across three dimensions:
-
Data format
-
Schema
-
Semantics or meaning of the data
The value that the Schema Registry provides and the applications that integrate with it are the following:
-
Centralized registry – Provide reusable schema to avoid attaching schema to every piece of data
-
Version management – Define the relationship between schema versions so that consumers and producers can evolve at different rates
-
Schema validation – Enable generic format conversion, generic routing, and data quality
Why Schema Registry?
Because of the decoupled nature of Kafka, producers and consumers don't communicate directly. Instead, information transfer happens through Kafka topic. At the same time, the consumer still needs to know the type of data the producer is sending to deserialize it. If the producer starts sending bad data to Kafka or if the data type gets changed, the downstream consumers start breaking. There must be a way to have a common data type that's agreed upon.
That’s where Schema Registry comes in. Schema Registry is an application that resides outside of the Kafka cluster and handles the distribution of schemas to the producer and consumer by storing a copy of schema in its local cache.
Use Cases
Schema Registry provides a set of Rest APIs (with web interfaces) to make the Schema centrally available for client applications. The common use cases are:
- Registering and Querying a Schema for a Kafka Topic
When Kafka is integrated into enterprise organization deployments, you typically have many different Kafka topics used by different apps and users.
- Reading/Deserializing and Writing/Serializing Data from and to a Kafka Topic
In addition to storing schema metadata, another key use case is to store metadata for the format of how data is read and how it's written. Schema Registry supports this use case by providing capabilities to store JAR files for serializers and deserializers and then mapping the serdes to the schema.
- Dataflow Management with Schema-based Routing
If you're using NiFi to move different types of
syslog
events to downstream systems, there is data movement requirements where you must parse thesyslog
event to extract the event type, and route the event to a certain downstream system (different Kafka topics, for example) based on the event type. -
Dynamic schema evolution enables Schema to evolve over a period of time.
Any updates in the schema create a new revision that's communicated to consumers.
- Schema enforcement/validation from the consumer side can be achieved seamlessly for data quality.
Starting the Kafka Producer/Consumer Using Java Client
Adding Apache Schema Registry to ODH Clusters
You can add Apache Schema Registry to Big Data Service clusters using Ambari
Schema Registry isn't installed by default on a Big Data Service cluster. Schema registry supports two types of storages for storing schema metadata.
- Mysql: Value to be set: mysql
- In Memory: Value to be set: inmemory
To set the desired storage type, set the Storage Type
parameter under Advance Schema-registry section from Ambari UI.
To set Mysql as a storage type, several tables must be created before starting up the server. This is automatically done when the Storage Type
parameter is set to mysql.
The
MySql Passowrd
parameter is mandatory when installing the Schema Registry service. This password is used for accessing the schema information stored in the Mysql tables.For Big Data Service 3.0.28 or earlier clusters, MySql Passowrd
must be reset manually from un0 node by running the command:
sudo update_mysql_password registry <password>
Provide this password during the Schema Registry service installation.