Skip to content

[FEATURE] PPL ClickBench queries should produce results consistent with ClickHouse/Spark #4760

@penghuo

Description

@penghuo

Is your feature request related to a problem?
Check 2 queries in PPLClickBenchIT, it produce in-consistent results compare to spark clickbench results.

Q43

Current

  • PPL
source=hits
| where CounterID = 62 and EventDate >= '2013-07-01 00:00:00' and EventDate <= '2013-07-15 00:00:00' and IsRefresh = 0 and DontCountHits = 0
| eval M = date_format(EventTime, '%Y-%m-%d %H:00:00')
| stats bucket_nullable=false count() as PageViews by M
| sort M
| head 10 from 1000

Expected

  • ppl
source=hits | where CounterID = 62 and EventDate >= '2013-07-01 00:00:00' and EventDate <= '2013-07-15 00:00:00' and IsRefresh = 0 and DontCountHits = 0 | stats bucket_nullable=false count by span(EventTime, 1m) as M | sort M | head 10 from 1000
  • spark sql
SELECT DATE_TRUNC('minute', EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', EventTime) ORDER BY DATE_TRUNC('minute', EventTime) LIMIT 10 OFFSET 1000;
  • results
+-------------------+---------+
|                  M|PageViews|
+-------------------+---------+
|2013-07-15 12:40:00|      513|
|2013-07-15 12:41:00|      457|
|2013-07-15 12:42:00|      470|
|2013-07-15 12:43:00|      468|
|2013-07-15 12:44:00|      453|
|2013-07-15 12:45:00|      462|
|2013-07-15 12:46:00|      481|
|2013-07-15 12:47:00|      458|
|2013-07-15 12:48:00|      466|
|2013-07-15 12:49:00|      467|
+-------------------+---------+

Q28

Current

  • ppl
source=hits
| where URL != ''
| stats bucket_nullable=false avg(length(URL)) as l, count() as c by CounterID
| where c > 100000
| sort - l
| head 25
  • results
l,c,CounterID
453.1628285069236,2938865,233773
261.55616482670234,2510103,245438
230.68442591186866,3574007,122612
198.08936562473812,238660,234004
187.9562693941447,323229,1634
180.36009889818135,120528,786
139.11823037965326,216408,114157
123.96623033619909,146907,515
121.86100788770537,858171,256004
116.91921850037136,374306,95427
106.89180332891429,7115413,199550
103.7586784846365,494614,220992
95.01398682515553,163797,196239
92.1134607276574,396093,96948
91.9427013479645,738150,62
90.26380938339348,311998,188878
90.00957407022646,120325,249603
85.8864014117864,8527069,3922
85.03551145479048,124664,191697
82.76263550290444,131178,97467
82.05937991006292,802561,186300
74.73223497987863,169223,230962
74.65631730856313,253961,77639
74.13011535042938,605286,146891
73.96797959706166,507770,38

Expected

  • spark sql
SELECT CounterID, AVG(length(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL <> '' GROUP BY CounterID HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
+---------+------------------+-------+
|CounterID|                 l|      c|
+---------+------------------+-------+
|   233773| 453.1628285069236|2938865|
|   245438|261.55616482670234|2510103|
|   122612|230.68442591186866|3574007|
|   234004|198.08936562473812| 238660|
|     1634| 187.9562693941447| 323229|
|      786|180.36009889818135| 120528|
|   114157|139.11823037965326| 216408|
|      515|123.96623033619909| 146907|
|   256004|121.86100672243644| 858171|
|    95427|116.91921850037136| 374306|
|   199550|106.89180332891429|7115413|
|   220992| 103.7586784846365| 494614|
|   196239| 95.01398682515553| 163797|
|    96948|  92.1134607276574| 396093|
|       62|  91.9427013479645| 738150|
|   188878| 90.26380938339348| 311998|
|   249603| 90.00957407022646| 120325|
|     3922| 85.88640117723921|8527069|
|   191697| 85.03551145479048| 124664|
|    97467| 82.76263550290444| 131178|
+---------+------------------+-------+

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions