Skip to content

[Feature][Datax] Datax configures the building module #4288

@felix-thinkingdata

Description

@felix-thinkingdata

#3885

摘要

DataX 是阿里巴巴发布的开源项目,是一个高效的离线数据同步工具,常用于异构数据源之间的数据同步
DataX 采用的是 Framework + plugin 架构,数据源读取和写入分别对应 Reader 与 Writer 插件,每一种数据源会有对应的 Reader 或者 Writer,DataX 默认地提供了丰富的 Reader 与 Writer 支持,用于适配多种主流数据源。Framework 用于连接 Reader 和 Writer,并负责同步任务中的数据处理、扭转等核心过程。

DataX, an open source project released by Alibaba, is an efficient offline data synchronization tool, often used for data synchronization between heterogeneous data sources
DataX adopts the Framework + Plugin architecture. Reading and writing from data sources correspond to Reader and Writer plug-ins respectively, and each data source has its own Reader or Writer. By default, DataX provides rich support for Reader and Writer to adapt to various mainstream data sources. The Framework is used to connect readers and Writers and is responsible for synchronizing core processes such as data processing and twisting in tasks

需求

目前dophinscheduler 已经支持该类型任务的运行和简单配置。在实际运行datax过程中,datax json格式的编写成为了使用datax的痛点。

Currently, the Dophinscheduler supports running and simple configuration of this type of task. In the actual running of datax, the writing of the DATax JSON format becomes the sore point with datax.

于是产生了,简化datax配置,与dolphinscheduler资源中心,数据源中心联合,在依托于dolphinscheduler强大的调度能力下。让datax 任务更易用,更好用的需求。

As a result, simplify the datax configuration and combine it with the Dolphinscheduler resource centre and data source centre, under the powerful scheduling capabilities of dolphinscheduler. The need to make datax tasks easier and more usable.

模块设计

1. 与资源中心和数据源中心联合应用。
1.Associate with resource centers and data source centers

datax 不再简单作为流程中一个task的单独配置。而是把某一种reader到writer的数据流向作为一个数据通路模板。配置后保存在资源中心中的datax 模板模块中。

Datax is no longer simply configured as a single task in the process. Instead, a particular flow of reader to a Writer is used as a data path template. The configuration is saved in the Datax template module in the resource center.

在dag页面配置datax 任务时,直接选取datax模板即可。也优化了目前datax配置页面无法承载过于复杂的datax配置显示问题。

When configuring the Datax task on the DAG page, you simply select the Datax template. It also optimizes the current datax configuration page for not being able to handle overly complex datax configuration displays.

2. 独立的datax配置生成页面
2.A separate datax configuration generates pages

如上文中所提到的目前的datax配置页面不太方便完成复杂的datax配置工作。所以在datax配置生成页面,需独立成为一个菜单。把datax的配置分步进行。
The current Datax configuration page mentioned above is not easy to do complex datax configuration work. So in the datax configuration generate page, need to be a separate menu. Step through the configuration of the Datax

例如:
For example:

1.datax 基础参数
Datax base parameter

2 datax reader 类型和元数据相关信息
Datax Reader type and metadata related information

  1. datax writer 类型和元数据相关信息
    Datax Writer types and metadata related information

  2. Reader和writer映射关系
    Reader and Writer map relationships

  3. 构建datax 配置文件并保存为模板。
    Build the Datax configuration file and save it as a template.

image
image
image
image

3.插件化实现

Plug-in implementation
主要解决几个问题:
Mainly solve a few problems:

3.1 datax 插件相对多。需要大家同时贡献对应的配置生成方式。
There are relatively many datax plug-ins. You need to contribute to the corresponding configuration generation.
3.2 不同插件依赖包需要多种,如果不采用切换classloader的方式,很容易造成冲突。
Different plug-in dependencies require multiple packages, which can easily cause conflicts if classloader is not switched

https://github.com/felix-thinkingdata/dolphinscheduler-datax-generator.git
配置文件生成配套的插件工程
The configuration file generates the supporting plug-in project

使用方式在ds中common.properties中配置
Usage is configured in Common.Properties in DS

datax.config.generator.path=/data/app/dolphinscheduler-data-generator/

编译好的插件放入相应位置。
The compiled plug-in is put in place.
就可在页面中传入相关参数,获取列的配置信息:
You can pass in the relevant parameters on the page to get the configuration information for the column

image

{"defaultFS":"hdfs://ta2",
"demoFile":"/wjd/parquet/3e4d9ae66082ea20-a03a193600000000_1355680585_data.3.parq",
"fileType":"PAR"}

以图片为例,传入hdfs url ,demo文件路径,文件类型。可以解析出文件元数据。反向生成了datax hdfs reader的配置列信息。
Take the picture as an example, pass in HDFS URL, demo file path, file type. File metadata can be parsed out. The configuration column information for the Datax HDFS reader is reversely generated

Simplified page design

datax配置设计

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions