Track column lineage#7354
Conversation
|
A few things to be discussed is how can we expose the column lineage details
|
61bdf04 to
c7fb784
Compare
kasiafi
left a comment
There was a problem hiding this comment.
A few comments and questions.
Also, could you explain what is the advantage of this approach vs using originTable and originColumnName?
There was a problem hiding this comment.
Adjust the method name.
BTW, I can see no usage of the field referencedFields in Analysis. If you removed it, you could simplify creating originColumnDetails by avoiding one level of mapping by the source node.
There was a problem hiding this comment.
Could you explain how it is possible to have more than 1 element in this list?
There was a problem hiding this comment.
Like if we have field for a expression like func(func2(col1, col2), col3) then we might need fetch the OriginColumnDetail for col1, col2 and col3
c7fb784 to
2ca4eb0
Compare
|
@kasiafi AC |
This allow us to track column lineage for each column.
2ca4eb0 to
828dbb8
Compare
828dbb8 to
ba5b4ef
Compare
kasiafi
left a comment
There was a problem hiding this comment.
Collecting dependent fields is supported for single-column select expressions.
So, we will get this info for a query like SELECT a, b, c FROM (SELECT ... ).
But we will not get the info for SELECT * FROM (SELECT ...), although they might be equivalent queries. Is this limitation intentional?
I am also concerned about replacing originTable and originColumnName with a list of dependent fields.
In the previous design, we knew that the field is a column reference. Now, we only know that it has a column reference. In some contexts, it might be an important difference.
| analyzer.analyze(expression, scope); | ||
|
|
||
| updateAnalysis(analysis, analyzer, session, accessControl); | ||
| analysis.addReferencedFields(expression, analyzer.getReferencedFields()); |
There was a problem hiding this comment.
Probably referencedFields could be now simplified to Set<Field>. I don't think the mapping by source node is used.
| } | ||
|
|
||
| public void addReferencedFields(Multimap<NodeRef<Node>, Field> references) | ||
| public void addReferencedFields(Expression expression, Multimap<NodeRef<Node>, Field> references) |
| return result.toString(); | ||
| } | ||
|
|
||
| public static class OriginColumnDetail |
There was a problem hiding this comment.
The details are distinct per expression, but if you want to reason about them further, e.g. collect all references from a query, this class will need equals().
Other approach is to maintain both |
This allows us to compute the source column for each field. This is still in a WIP state.