|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | + or more contributor license agreements. See the NOTICE file |
| 3 | + distributed with this work for additional information |
| 4 | + regarding copyright ownership. The ASF licenses this file |
| 5 | + to you under the Apache License, Version 2.0 (the |
| 6 | + "License"); you may not use this file except in compliance |
| 7 | + with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | + software distributed under the License is distributed on an |
| 13 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | + KIND, either express or implied. See the License for the |
| 15 | + specific language governing permissions and limitations |
| 16 | + under the License. |
| 17 | +
|
| 18 | +Concepts |
| 19 | +======== |
| 20 | + |
| 21 | +In this section, you would know the core concepts of *PyDolphinScheduler*. |
| 22 | + |
| 23 | +Process Definition |
| 24 | +------------------ |
| 25 | + |
| 26 | +Process definition describe the whole things except `tasks`_ and `tasks dependence`_, which including |
| 27 | +name, schedule interval, schedule start time and end time. You would know scheduler |
| 28 | + |
| 29 | +Process definition could be initialized in normal assign statement or in context manger. |
| 30 | + |
| 31 | +.. code-block:: python |
| 32 | +
|
| 33 | + # Initialization with assign statement |
| 34 | + pd = ProcessDefinition(name="my first process definition") |
| 35 | +
|
| 36 | + # Or context manger |
| 37 | + with ProcessDefinition(name="my first process definition") as pd: |
| 38 | + pd.submit() |
| 39 | +
|
| 40 | +Process definition is the main object communicate between *PyDolphinScheduler* and DolphinScheduler daemon. |
| 41 | +After process definition and task is be declared, you could use `submit` and `run` notify server your definition. |
| 42 | + |
| 43 | +If you just want to submit your definition and create workflow, without run it, you should use attribute `submit`. |
| 44 | +But if you want to run the workflow after you submit it, you could use attribute `run`. |
| 45 | + |
| 46 | +.. code-block:: python |
| 47 | +
|
| 48 | + # Just submit definition, without run it |
| 49 | + pd.submit() |
| 50 | + |
| 51 | + # Both submit and run definition |
| 52 | + pd.run() |
| 53 | +
|
| 54 | +Schedule |
| 55 | +~~~~~~~~ |
| 56 | + |
| 57 | +We use parameter `schedule` determine the schedule interval of workflow, *PyDolphinScheduler* support seven |
| 58 | +asterisks expression, and each of the meaning of position as below |
| 59 | + |
| 60 | +.. code-block:: text |
| 61 | +
|
| 62 | + * * * * * * * |
| 63 | + ┬ ┬ ┬ ┬ ┬ ┬ ┬ |
| 64 | + │ │ │ │ │ │ │ |
| 65 | + │ │ │ │ │ │ └─── year |
| 66 | + │ │ │ │ │ └───── day of week (0 - 7) (0 to 6 are Sunday to Saturday, or use names; 7 is Sunday, the same as 0) |
| 67 | + │ │ │ │ └─────── month (1 - 12) |
| 68 | + │ │ │ └───────── day of month (1 - 31) |
| 69 | + │ │ └─────────── hour (0 - 23) |
| 70 | + │ └───────────── min (0 - 59) |
| 71 | + └─────────────── second (0 - 59) |
| 72 | +
|
| 73 | +Here we add some example crontab: |
| 74 | + |
| 75 | +- `0 0 0 * * ? *`: Workflow execute every day at 00:00:00. |
| 76 | +- `10 2 * * * ? *`: Workflow execute hourly day at ten pass two. |
| 77 | +- `10,11 20 0 1,2 * ? *`: Workflow execute first and second day of month at 00:20:10 and 00:20:11. |
| 78 | + |
| 79 | +Tenant |
| 80 | +~~~~~~ |
| 81 | + |
| 82 | +Tenant is the user who run task command in machine or in virtual machine. it could be assign by simple string. |
| 83 | + |
| 84 | +.. code-block:: python |
| 85 | +
|
| 86 | + # |
| 87 | + pd = ProcessDefinition(name="process definition tenant", tenant="tenant_exists") |
| 88 | +
|
| 89 | +.. note:: |
| 90 | + |
| 91 | + Make should tenant exists in target machine, otherwise it will raise an error when you try to run command |
| 92 | + |
| 93 | +Tasks |
| 94 | +----- |
| 95 | + |
| 96 | +Task is the minimum unit running actual job, and it is nodes of DAG, aka directed acyclic graph. You could define |
| 97 | +what you want to in the task. It have some required parameter to make uniqueness and definition. |
| 98 | + |
| 99 | +Here we use :py:meth:`pydolphinscheduler.tasks.Shell` as example, parameter `name` and `command` is required and must be provider. Parameter |
| 100 | +`name` set name to the task, and parameter `command` declare the command you wish to run in this task. |
| 101 | + |
| 102 | +.. code-block:: python |
| 103 | +
|
| 104 | + # We named this task as "shell", and just run command `echo shell task` |
| 105 | + shell_task = Shell(name="shell", command="echo shell task") |
| 106 | +
|
| 107 | +If you want to see all type of tasks, you could see :doc:`tasks/index`. |
| 108 | + |
| 109 | +Tasks Dependence |
| 110 | +~~~~~~~~~~~~~~~~ |
| 111 | + |
| 112 | +You could define many tasks in on single `Process Definition`_. If all those task is in parallel processing, |
| 113 | +then you could leave them alone without adding any additional information. But if there have some tasks should |
| 114 | +not be run unless pre task in workflow have be done, we should set task dependence to them. Set tasks dependence |
| 115 | +have two mainly way and both of them is easy. You could use bitwise operator `>>` and `<<`, or task attribute |
| 116 | +`set_downstream` and `set_upstream` to do it. |
| 117 | + |
| 118 | +.. code-block:: python |
| 119 | +
|
| 120 | + # Set task1 as task2 upstream |
| 121 | + task1 >> task2 |
| 122 | + # You could use attribute `set_downstream` too, is same as `task1 >> task2` |
| 123 | + task1.set_downstream(task2) |
| 124 | + |
| 125 | + # Set task1 as task2 downstream |
| 126 | + task1 << task2 |
| 127 | + # It is same as attribute `set_upstream` |
| 128 | + task1.set_upstream(task2) |
| 129 | + |
| 130 | + # Beside, we could set dependence between task and sequence of tasks, |
| 131 | + # we set `task1` is upstream to both `task2` and `task3`. It is useful |
| 132 | + # for some tasks have same dependence. |
| 133 | + task1 >> [task2, task3] |
| 134 | +
|
| 135 | +Task With Process Definition |
| 136 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 137 | + |
| 138 | +In most of data orchestration cases, you should assigned attribute `process_definition` to task instance to |
| 139 | +decide workflow of task. You could set `process_definition` in both normal assign or in context manger mode |
| 140 | + |
| 141 | +.. code-block:: python |
| 142 | +
|
| 143 | + # Normal assign, have to explicit declaration and pass `ProcessDefinition` instance to task |
| 144 | + pd = ProcessDefinition(name="my first process definition") |
| 145 | + shell_task = Shell(name="shell", command="echo shell task", process_definition=pd) |
| 146 | +
|
| 147 | + # Context manger, `ProcessDefinition` instance pd would implicit declaration to task |
| 148 | + with ProcessDefinition(name="my first process definition") as pd: |
| 149 | + shell_task = Shell(name="shell", command="echo shell task", |
| 150 | +
|
| 151 | +With both `Process Definition`_, `Tasks`_ and `Tasks Dependence`_, we could build a workflow with multiple tasks. |
| 152 | +
|
| 153 | +DolphinScheduler daemon |
| 154 | +----------------------- |
| 155 | +
|
0 commit comments