Using Schema Registry

Schema Registry provides a shared repository of schemas that enables applications to flexibly interact with each other.

Applications built often need a way to share metadata across three dimensions:

  • Data format

  • Schema

  • Semantics or meaning of the data

The value that the Schema Registry provides and the applications that integrate with it are the following:

  • Centralized registry – Provide reusable schema to avoid attaching schema to every piece of data

  • Version management – Define the relationship between schema versions so that consumers and producers can evolve at different rates

  • Schema validation – Enable generic format conversion, generic routing, and data quality

Why Schema Registry?

Because of the decoupled nature of Kafka, producers and consumers don't communicate directly. Instead, information transfer happens through Kafka topic. At the same time, the consumer still needs to know the type of data the producer is sending to deserialize it. If the producer starts sending bad data to Kafka or if the data type gets changed, the downstream consumers start breaking. There must be a way to have a common data type that's agreed upon.

That’s where Schema Registry comes in. Schema Registry is an application that resides outside of the Kafka cluster and handles the distribution of schemas to the producer and consumer by storing a copy of schema in its local cache.

Use Cases

Schema Registry provides a set of Rest APIs (with web interfaces) to make the Schema centrally available for client applications. The common use cases are:

  1. Registering and Querying a Schema for a Kafka Topic

    When Kafka is integrated into enterprise organization deployments, you typically have many different Kafka topics used by different apps and users.

  2. Reading/Deserializing and Writing/Serializing Data from and to a Kafka Topic

    In addition to storing schema metadata, another key use case is to store metadata for the format of how data is read and how it's written. Schema Registry supports this use case by providing capabilities to store JAR files for serializers and deserializers and then mapping the serdes to the schema.

  3. Dataflow Management with Schema-based Routing

    If you're using NiFi to move different types of syslog events to downstream systems, there is data movement requirements where you must parse the syslog event to extract the event type, and route the event to a certain downstream system (different Kafka topics, for example) based on the event type.

  4. Dynamic schema evolution enables Schema to evolve over a period of time.

    Any updates in the schema create a new revision that's communicated to consumers.

  5. Schema enforcement/validation from the consumer side can be achieved seamlessly for data quality.

Starting the Kafka Producer/Consumer Using Java Client

  1. In pom.xml, add the following to the Kafka Producer configuration:
    import com.hortonworks.registries.schemaregistry.serdes.avro.kafka.KafkaAvroSerializer;
    ..
    config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    props.setProperty("schema.registry.url", "http://<host name>:9093/api/v1");
    config.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); 
    config.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class.getName());

    Configurations to focus on are:

    • schema.registry.url
    • key.serializer
    • value.serializer
  2. Add the following text to the Kafka Consumer configuration:
    import com.hortonworks.registries.schemaregistry.serdes.avro.kafka.KafkaAvroSerializer;
    ..
    config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    props.setProperty("schema.registry.url", "http://<host name>:9093/api/v1");
    config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); 
    config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class.getName());
  3. After the Schema Registry starts, access the Web UI: http://<FQDN>:9093

Adding Apache Schema Registry to ODH Clusters

You can add Apache Schema Registry to Big Data Service clusters using Ambari

Schema Registry isn't installed by default on a Big Data Service cluster. Schema registry supports two types of storages for storing schema metadata.

  • Mysql: Value to be set: mysql
  • In Memory: Value to be set: inmemory

To set the desired storage type, set the Storage Type parameter under Advance Schema-registry section from Ambari UI.

To set Mysql as a storage type, several tables must be created before starting up the server. This is automatically done when the Storage Type parameter is set to mysql.

Note

The MySql Passowrd parameter is mandatory when installing the Schema Registry service. This password is used for accessing the schema information stored in the Mysql tables.

For Big Data Service 3.0.28 or earlier clusters, MySql Passowrd must be reset manually from un0 node by running the command:

sudo update_mysql_password registry <password>

Provide this password during the Schema Registry service installation.

  1. Access Apache Ambari.
  2. Select Add Service from Services menu.
  3. From the Add Service Wizard, select Scheme Registry, and then click Next.
  4. Click Next.
  5. Under Customize Services page for Advanced schema-registry parameter, update Mysql Password to desired value.

    This resets the password for registry MySQL database.

  6. Click Next.
  7. Review the configuration, and then click Deploy.
  8. Wait for the installation to complete, and then click Next.
  9. Click Complete.
  10. Restart all the required services from Service menu.
  11. To verify a successful installation, access Schema Registry from Apache Ambari dashboard, and then click Actions to Run Service check.