Skip to content

Conversation

@kangkaisen
Copy link
Contributor

For #1610

Before I do "avoid serialize and deserialize from Scannode to Aggnode" work, I need to unify Bitmap and HyperLogLog behavior and interface firstly. So I encapsulate the HyperLogLog related logic.

if (str != "\\N") {
uint64_t hash = HashUtil::murmur_hash64A(src, len, HashUtil::MURMUR_SEED);
char buf[HllHashFunctions::HLL_INIT_EXPLICT_SET_SIZE];
char buf[10];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a const or a macro, better not to use a magic number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Use magic number only for simple.
Because I think we could delete this class code in doris 0.12 version.

auto* src_hll = reinterpret_cast<HyperLogLog*>(src_slice->data);
dst_hll->merge(*src_hll);

delete src_hll;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dangerous to delete input source content.
Better to leave a comment and try to find a better way to handle this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will Fix this issue with object_pool in next PR.

std::unique_ptr<HyperLogLog> hll {new HyperLogLog()};
if (!input.is_null) {
uint64_t hash_value = HashUtil::murmur_hash64A(input.ptr, input.len, HashUtil::MURMUR_SEED);
hll.reset(new HyperLogLog(hash_value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case we create HyperLogLog two times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will Fix

} else {
buf.resize(HLL_EMPTY_SIZE);
}
hll->serialize((char*)buf.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not serialize(std::string*) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because slice.data type is char* and many HLL methods use char*.

char buf[HLL_EMPTY_SET_SIZE];
buf[0] = HLL_DATA_EMPTY;
result = AnyValUtil::from_buffer_temp(ctx, buf, sizeof(buf));
StringVal HllHashFunctions::hll_hash(FunctionContext* ctx, const StringVal& input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have two implementation? here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have commented. keep HllHashFunctions only for backward compatibility, we will remove the HllHashFunctions class when doris 0.12 release.

}

void merge(const HyperLogLog& other) {
if (other._type == HLL_DATA_EMPTY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 4 * 3 switch case is too big.

Use if block one by one, and directly return in every switch case, which like Guard Clauses and seems more readable.


// _type must change
if (_type == HLL_DATA_EMPTY) {
_type = other._type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this->_type is changed, do you want it do following if block? I think there is no need to do following if block, which is a waste of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So I directly return in every switch case.

update_registers(_registers, hash_value);
}

void merge(const HyperLogLog& other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large code block, we'd better to write it in cpp/cc files to accelerate our compile time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

return HLL_COLUMN_DEFAULT_LEN;
}

static int serialize_sparse(char *result, const std::map<int, uint8_t>& index_to_value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

@imay imay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@imay imay merged commit cd5cfea into apache:master Sep 9, 2019
@imay imay mentioned this pull request Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants