-
Notifications
You must be signed in to change notification settings - Fork 874
Make hash by the value allocated, not by pointer #22129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
mchacki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some clarifying questions
arangod/Aql/AqlValue.cpp
Outdated
| return std::hash<void const*>()(x._data.rangeMeta.range); | ||
| using T = AqlValue::AqlValueType; | ||
| auto aqlValueType = x.type(); | ||
| // as this is non owning, we hash by the pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the owning make for a difference?
Hashing is a quick look of the content is equal.
In my opinion it should not take into account if the value is owned or not.
arangod/Aql/AqlValue.cpp
Outdated
| if (h == 0) { // fallback to avoid collision with the marker that uses h == | ||
| // 0, very unlikely to happen | ||
| h = 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hö what is this?
arangod/Aql/AqlValue.cpp
Outdated
| return a._data.slicePointerMeta.pointer == | ||
| b._data.slicePointerMeta.pointer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this pointer comparison?
And not real equality comparison like the default case?
… semantically equal values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| cache.emplace(a.data()); | ||
| res->setValue(row, col, a); | ||
| // Transfer ownership to res - guard won't destroy it | ||
| guard.steal(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Shadow row deduplication removed while cache remains populated
The shadow row handling now calls cache.emplace(a.data()) but ignores the return value and always transfers ownership via guard.steal(). The old code used the emplace result to deduplicate: when a pointer was already cached, it would destroy the duplicate value and reuse the cached one. The new code removes this deduplication but still populates the cache, suggesting incomplete refactoring. If multiple shadow row cells share the same underlying data pointer, this could lead to multiple AqlValue objects referencing the same memory, potentially causing memory management issues.
… remove safety handling of collisions of value 0
| v = 0.0; | ||
| } | ||
| return VELOCYPACK_HASH(&v, sizeof(v), seed); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Hash inconsistency for doubles may violate hash/equal contract
The VPACK_INLINE_DOUBLE type uses a custom hash implementation that directly hashes the double bytes, while other numeric types (like VPACK_INLINE_INT64) use slice(t).normalizedHash(seed). The equal_to implementation considers numerically equivalent values across types as equal (e.g., INT64(42) and DOUBLE(42.0) via VelocyPackHelper::equal). If the custom double hash doesn't produce the same result as normalizedHash for equivalent numeric values, this violates the hash/equal contract: equal values must have equal hashes. This could cause hash table lookups to fail for semantically equal values stored with different representations.
Scope & Purpose
In AqlValue.cpp, there's a hashing function and a comparator. An AqlValue object can have different types of values, inline, slice pointers, managed slices and managed strings (and, in the future, supervised slice). Inline values are non-allocated values that are distributed within the AqlValue's 16 bytes of available space, slice pointers are pointers to values not owned by the AqlValue object, and managed slices and strings are dynamically allocated payloads owned by the AqlValue object. Even if multiple aql values point to the same managed slice, still, it's owned by AqlValue, and the last AqlValue that points to it to be destroy has to free this allocated memory. The hashing and comparator are used in AqlItemBlock for deduplication, when inserting the AqlValue on a block, so it has to compare if two aql values are equal.
Now onto the hashing, before, the hash was made with the pointers, not the data pointed to by the pointers. The data that the pointer stores is the address of the payload it points to, hence, if two aql values that own memory were compared using this approach, they would never be considered equal, and they could be, because, even though having memory allocated in different addresses, the payload that each store could be semantically the same. This would lead to the comparison used in AqlItemBlock not working properly, and wrongfully distinguishing values that should be considered the same in practice.
The new approach is to hash by the value stored in the address in cases where there's dynamic allocation.
When the value is inline, it's safe to maintain the original approach.
When the value is of type slice pointer, it means AqlValue doesn't own it, hence, the pointers should be hashed. If they point to the same address, then they could be considered the same.
Checklist
Related Information
(Please reference tickets / specification / other PRs etc)
Note
Switches AqlValue hash/equality to normalized, content-based semantics (incl. ranges and doubles), updates AqlItemBlock deduplication/ownership paths, and adds extensive unit tests.
AqlValue::hash(seed=kDefaultSeed)uses normalized, content-based hashing; handlesRANGEand normalizes-0.0for doubles.std::hash<AqlValue>to use normalized hash andstd::equal_to<AqlValue>to compare by content across storage types; adds assertions and range-aware comparisons.kDefaultSeedconstant.FlatHashMap<AqlValue,size_t>::try_emplacefor content-based deduplication intoVelocyPack.cloneDataAndMoveShadow()by setting stolen values directly and unconditionally stealing the guard; cache only by pointer for managed values within the move.tests/Aql/AqlValueHashAlgorithmCorrectnessTest.cpp,tests/Aql/AqlValueHashEqualTest.cpp, andtests/Aql/AqlValueHashTest.cppvalidating content-based hashing/equality and dedup behavior across storage types and scenarios; updatetests/CMakeLists.txt.Written by Cursor Bugbot for commit f1c6396. This will update automatically on new commits. Configure here.