-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Motivation
Recently, we have introduced the Spark Load, which currently needs to upload many jar packages to the Yarn cluster before load. These jar packages include $DORIS_HOME/lib/palo-fe.jar(the Dpp runtime dependency) and all jars in the $SPARK_HOME/jars folder(the Spark dependencies), which usually takes 2~3 minutes to be uploaded.
Currently, these jars are uploaded to the temporary directories in HDFS. The palo-fe.jar is uploaded to {working_dir}/jobs/DB_ID/LABEL/JOB_ID/configs. Other jars are packaged as zip file and uploaded to {stage_dir}/APPLICATION_ID/__spark_lib__.zip.
In most cases, the jar packages uploaded by two different load are completely same, which means we don't have to upload these jar packages every time. Secondly, the jar packages should be stored in one directory so that we can manage them easily. Moreover, we can put all jars in a zip file in the compilation phase.
Therefore, as a proposal, I suggest to create a repository for all dependencies of Spark Load in HDFS.
The repository structure
Repository/
|-lib_{version}.zip
| {All spark dependencies}
| |-roaringbitmap.jar
| |-activation-1.1.1.jar
| |-aircompressor-0.10.jar
| |-...
| {All dpp dependencies}
| |-spark-dpp.jar
|-lib_{version}.zip
|-lib_{version}.zip
|-...
The Repository/ directory is the parent dir of all zip files. When a spark load is submitted, fe will first compare the version between remote zip files in the repository and local zip file, and only upload when fe can not find the right version.
We must set the spark config spark.yarn.archive to the right zip file which fe found in the above step so that the yarn cluster will know where to find these dependencies.
Note that, the spark-dpp.jar is built by spark-dpp sub-modules. The difference between palo-fe.jar and spark-dpp.jar is that spark-dpp.jar contain other third-party libraries that palo-fe.jar depends on. You can see the details about spark-dpp sub-modules of fe in this issue #4098 .
Meanwhile, it is neccessary to set AppResourceHdfsPath argument of spark-submit to lib.jar file.
Works to do
[] Build lib.zip in compliation phase
[] Add a repository property for spark load
[] Add a prepare phase for spark load to check and upload dependencies