sample readme updated

jerrinot · jerrinot · commit a2f0b35024f7 · 2022-10-19T14:30:58.000+02:00
diff --git a/kafka-questdb-connector-samples/stocks/readme.md b/kafka-questdb-connector-samples/stocks/readme.md
@@ -51,26 +51,95 @@ Bear in mind the sample starts multiple containers. It's running fine on my mach
 13. Don't forget to stop the containers when you're done. The project generates a lot of data and you could run out of disk space. 
 
 ## Project Internals
-The Postgres table has the following schema:
-```sql
-create table if not exists stock (
- id serial primary key,
- symbol varchar(10) unique,
- price float8,
- last_update timestamp
-);
+If you liked what you see and want to learn more about the internals of the project, read on. It's time do demistify the black box. We will discuss these components:
+1. Postgres and its schema
+2. Java stock price updater
+3. Debezium Postgres connector
+4. Kafka QuestDB connector
+5. QuestDB
+6. Grafana
+
+### Postgres
+The docker-compose start Postgres image. It's using a container image provided by the Debezium project as they maintain a Postgres which is preconfigured for Debezium. 
+
+## Java stock price updater
+It's a Spring Boot application which during startup creates a table in Postgres and populates it with initial data.  
+You can see the SQL executed in the [schema.sql](src/main/resources/schema.sql) file. The table has always one row per each stock symbol.
+
+Once the application is started, it starts updating stock prices in regular intervals. The `price` and `last_update` columns are updated every time a new price is received for the stock symbol. It mimics a real-world scenario where you would have a Postgres table with the latest prices for each stock symbol. Such table would be typically used by a transactional system to get the latest prices for each stock symbol. It our case the transactional system is simulated by a [simple Java application](src/main/java/io/questdb/kafka/samples/StockService.java) which is randomly updating prices for each stock symbol in the Postgres table. The application generates 1000s of updates each second.
+
+The application is build and packaged as container image when executing `docker-compose build`. Inside the [docker-compose file](docker-compose.yml) you can see the container called `producer`. That's our Java application.
+```Dockerfile
+  producer:
+    image: kafka-questdb-connector-samples-stocks-generator
+    build:
+      dockerfile: ./Dockerfile
+      context: .
+    depends_on:
+      postgres:
+        condition: service_healthy
+    links:
+      - postgres:postgres
 ```
-The table has always one row per each stock symbol. The `price` and `last_update` columns are updated every time a new price is received for the stock symbol. It mimics a real-world scenario where you would have a Postgres table with the latest prices for each stock symbol. Such table would be typically used by a transactional system to get the latest prices for each stock symbol. It our case the transactional system is simulated by a [simple Java application](src/main/java/io/questdb/kafka/samples/StockService.java) which is randomly updating prices for each stock symbol in the Postgres table. The application generates 1000s of updates each second. 
 
-Then we have a pipeline which reads the changes from the Postgres table and feeds them to a QuestDB table. The pipeline is composed of the following components:
-- [Debezium Postgres connector](https://debezium.io/documentation/reference/1.9/connectors/postgresql.html) which reads changes from the Postgres table and feeds them to a Kafka topic.
-- Kafka QuestDB connector which reads changes from the Kafka topic and feeds them to a [QuestDB](https://questdb.io) table.
-- QuestDB SQL console which is used to query the QuestDB table.
-- [Grafana](https://grafana.com/) which is used to visualize the data in the QuestDB table.
+## Debezium Postgres connector
+Debezium is an open source project which provides connectors for various databases. It is used to capture changes from a database and feed them to a Kafka topic. In other words: Whenever there is a change in a database table, Debezium will read the change and feed it to a Kafka topic. This way it translates operations such as INSERT or UPDATE into events which can be consumed by other systems. Debezium supports a wide range of databases. In this sample we use the Postgres connector.
 
-Debezium is open source project which provides connectors for various databases. It is used to capture changes from a database and feed them to a Kafka topic. In other words: Whenever there is a change in a database table, Debezium will read the change and feed it to a Kafka topic. This way in translates operations such as INSERT or UPDATE into events which can be consumed by other systems. Debezium supports a wide range of databases. In this sample we use the Postgres connector. Debezium is technically implemented as a Kafka Connect source. 
+The Debezium Postgres connector is implemented as a Kafka Connect source connector. Inside the [docker-compose file](docker-compose.yml) it's called `connect` and its container image is also built during `docker-compose build`. The [Dockerfile](../../Dockerfile-Samples) uses Debezium image. The Debezium image contains Kafka Connect runtime and Debezium connectors. Our Dockerfile amends it with Kafka Connect QuestDB Sink. 
 
-For every change in the Postgres table the Debezium emits a JSON message to a Kafka topic. Messages look like this:
+What's important: When this container start it just connects to Kafka broker, but it does not start any connectors. We need to start the connectors using `curl` command. This is how we started the Debezium connector:
+```shell
+curl -X POST -H "Content-Type: application/json" -d  '{"name":"debezium_source","config":{"tasks.max":1,"database.hostname":"postgres","database.port":5432,"database.user":"postgres","database.password":"postgres","connector.class":"io.debezium.connector.postgresql.PostgresConnector","database.dbname":"postgres","database.server.name":"dbserver1"}} ' localhost:8083/connectors
+```
+It uses Kafka Connect REST interface to start a new connector with a give configuration. Let's have a closer look at the configuration. This is how it looks like when formatted for readability:
+```json
+{
+  "name": "debezium_source",
+  "config": {
+    "tasks.max": 1,
+    "database.hostname": "postgres",
+    "database.port": 5432,
+    "database.user": "postgres",
+    "database.password": "postgres",
+    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
+    "database.dbname": "postgres",
+    "database.server.name": "dbserver1"
+  }
+}
+```
+Most of the fields are self-explanatory. The only non-obvious one is `database.server.name`. It's a unique name of the database server. It's used by Kafka Connect to store offsets. It's important that it's unique for each database server. If you have multiple Postgres databases, you need to use different `database.server.name` for each of them. It's used by Debezium to generate Kafka topic names. The topic name is generated as `database.server.name`.`schema`.`table`. In our case it's `dbserver1.public.stock`.
+
+## Kafka QuestDB connector
+The Kafka QuestDB connector re-uses the same Kafka Connect runtime as the Debezium connector. It's also started using `curl` command. This is how we started the QuestDB connector:
+```shell
+curl -X POST -H "Content-Type: application/json" -d '{"name":"questdb-connect","config":{"topics":"dbserver1.public.stock","table":"stock", "connector.class":"io.questdb.kafka.QuestDBSinkConnector","tasks.max":"1","key.converter":"org.apache.kafka.connect.storage.StringConverter","value.converter":"org.apache.kafka.connect.json.JsonConverter","host":"questdb", "transforms":"unwrap", "transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState", "include.key": "false", "symbols": "symbol", "timestamp.field.name": "last_update"}}' localhost:8083/connectors
+```
+This is the connector JSON configuration nicely formatted:
+```json
+{
+  "name": "questdb-connect",
+  "config": {
+    "topics": "dbserver1.public.stock",
+    "table": "stock",
+    "connector.class": "io.questdb.kafka.QuestDBSinkConnector",
+    "tasks.max": "1",
+    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
+    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
+    "host": "questdb",
+    "transforms": "unwrap",
+    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
+    "include.key": "false",
+    "symbols": "symbol",
+    "timestamp.field.name": "last_update"
+  }
+}
+```
+Again, most of the fields are obvious. Let's focus on the non-obvious ones.
+1. `"symbols": "symbol"` this instruct to connector to use the [QuestDB symbol type](https://questdb.io/docs/concept/symbol/) for a column named "symbols". This column has low cardinality thus it's a good candidate for symbol type.
+2. `"timestamp.field.name": "last_update"` this instructs the connector to use the `last_update` column as the [designated timestamp](https://questdb.io/docs/concept/designated-timestamp/) column.
+3. `"transforms":"unwrap"` and `"transforms.unwrap.type"` this instructs the connector to use Debezium's ExtractNewRecordState. 
+
+Let's focus on the ExtractNewRecordState transform a bit more. Why is it needed at all? For every change in the Postgres table the Debezium emits a JSON message to a Kafka topic. Messages look like this:
 ```json
 {
   "schema": {
@@ -117,7 +186,7 @@ You can see the `payload` field contains the actual change. Let's zoom it a bit
 ```
 This is the actual change in a table. It's a JSON object which contains the new values for the columns in the Postgres table. Notice has the structure maps to the Postgres table schema described above. 
 
-We cannot feed a full change object to Kafka Connect QuestDB Sink, because the sink would create a column for each field in the change object, including all metadata, like the source part of the JSON:
+We cannot feed a full change object to Kafka Connect QuestDB Sink, because the sink would create a column for each field in the change object, including all metadata, for example the source part of the JSON:
 ```json
 "source": {
   "version": "1.9.6.Final",
@@ -134,16 +203,45 @@ We cannot feed a full change object to Kafka Connect QuestDB Sink, because the s
   "xmin": null
 },
 ```
-We do not want to create columns in QuestDB for all this metadata. We only want to create columns for the actual data. Debezium comes to the rescue! It ships with a Kafka Connect transform which can extract the actual data from the change object and feed it to the Kafka Connect sink. The transform is called `ExtractNewRecordState`. 
 
-Postgres -> Kafka:
-```shell
-curl -X POST -H "Content-Type: application/json" -d  '{"name":"debezium_source","config":{"tasks.max":1,"database.hostname":"postgres","database.port":5432,"database.user":"postgres","database.password":"postgres","connector.class":"io.debezium.connector.postgresql.PostgresConnector","database.dbname":"postgres","database.server.name":"dbserver1"}} ' localhost:8083/connectors
+We do not want to create columns in QuestDB for all this metadata. We only want to create columns for the actual data. This is where the `ExtractNewRecordState` transform comes to the rescue! It extracts only the actual new data from the overall change object and feeds only this small part to the QuestDB sink. The end-result is that each INSERT and UPDATE in Postgres will insert a new row in QuestDB.
+
+## QuestDB
+QuestDB is a fast, open-source time-series database. It uses SQL for querying and it adds a bit of syntax sugar on top of SQL to make it easier to work with time-series data. It implements the Postgres wire protocol so many tools can be used to connect to it. 
+
+## Grafana
+Grafana is a popular open-source tool for visualizing time-series data. It can be used to visualize data from QuestDB. There is no native QuestDB datasource for Grafana, but there is a Postgres datasource. We can use this datasource to connect to QuestDB. Grafana is provisioned with a dashboard that visualizes the data from QuestDB in a candlestick chart. The char is configured to execute this query:
+```sql
+SELECT
+  $__time(timestamp),
+  min(price) as low,
+  max(price) as high,
+  first(price) as open,
+  last(price) as close
+FROM
+  stock
+WHERE
+  $__timeFilter(timestamp)
+  and symbol = '$Symbol'
+SAMPLE BY $Interval ALIGN TO CALENDAR;
 ```
+`$__time` is a Grafana macro that converts the timestamp column to the format expected by Grafana. `$__timeFilter` is another Grafana macro that filters the data based on the time range selected in the Grafana dashboard. `$Symbol` is a variable that can be set in the Grafana dashboard. `$Interval` is another variable that can be set in the Grafana dashboard. It controls the granularity of the data.
 
-Kafka -> QuestDB
-```shell
-curl -X POST -H "Content-Type: application/json" -d '{"name":"questdb-connect","config":{"topics":"dbserver1.public.stock","table":"stock", "connector.class":"io.questdb.kafka.QuestDBSinkConnector","tasks.max":"1","key.converter":"org.apache.kafka.connect.storage.StringConverter","value.converter":"org.apache.kafka.connect.json.JsonConverter","host":"questdb", "transforms":"unwrap", "transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState", "include.key": "false", "symbols": "symbol", "timestamp.field.name": "last_update"}}' localhost:8083/connectors
+Grafana will resolve the macros and execute queries similar to this:
+```json
+SELECT
+  timestamp AS "time",
+  min(price) as low,
+  max(price) as high,
+  first(price) as open,
+  last(price) as close
+FROM
+  stock
+WHERE
+  timestamp BETWEEN '2022-10-19T12:23:44.951Z' AND '2022-10-19T12:28:44.951Z'
+  and symbol = 'SNAP'
+SAMPLE BY 5s ALIGN TO CALENDAR;
 ```
+And this is then used by the candlestick chart to visualize the data.
 
 [Grafana Dashboard](http://localhost:3000/d/stocks/stocks?orgId=1&refresh=5s&viewPanel=2)