Merge pull request #6579 from adamnsch/hashgnn-dimension-docs

adamnsch · web-flow · commit 91e01423223d · 2022-12-01T14:24:04.000+01:00
Clarify the need for higher dimension in HashGNN docs
diff --git a/doc/modules/ROOT/pages/machine-learning/node-embeddings/hashgnn.adoc b/doc/modules/ROOT/pages/machine-learning/node-embeddings/hashgnn.adoc
@@ -197,6 +197,9 @@ As a loose guideline, one may try to set `embeddingDensity` to 128, 256, 512, or
 The `dimension` parameter determines the number of binary features when feature generation is applied.
 A high dimension increases expressiveness but requires more data in order to be useful and can lead to the curse of high dimensionality for downstream machine learning tasks.
 Additionally, more computation resources will be required.
+However, binary embeddings only have a single bit of information per dimension.
+In contrast, dense `Float` embeddings have 64 bits of information per dimension.
+Consequently, in order to obtain similarly good embeddings with HashGNN as with an algorithm that produces dense embeddings (e.g. FastRP or GraphSAGE) one typically needs a significantly higher dimension.
 Some values to consider trying for `densityLevel` are very low values such as `1` or `2`, or increase as appropriate.
 
 
@@ -208,6 +211,10 @@ Therefore, a higher dimension should also be coupled with higher `embeddingDensi
 Higher dimension also leads to longer training times of downstream models and higher memory footprint.
 Increasing the threshold leads to sparser feature vectors.
 
+However, binary embeddings only have a single bit of information per dimension.
+In contrast, dense `Float` embeddings have 64 bits of information per dimension.
+Consequently, in order to obtain similarly good embeddings with HashGNN as with an algorithm that produces dense embeddings (e.g. FastRP or GraphSAGE) one typically needs a significantly higher dimension.
+
 The default threshold of `0` leads to fairly many features being active for each node.
 Often sparse feature vectors are better, and it may therefore be useful to increase the threshold beyond the default.
 One heuristic for choosing a good threshold is based on using the average and standard deviation of the hyperplane dot products plus with the node feature vectors.