Reliable Task Framework: Raju Pandey
Reliable Task Framework: Raju Pandey
Framework
Raju Pandey
Questions
• Why do we need another workflow system?
• What doesn’t get solved by Oklahoma or Flyte?
• Azkaban like DAGs
• Which workflow to choose?
• Deployment?
• How to migrate?
Organization
• Programming Model
• Execution and Failure Model
• Under the hood: Architecture
At the most basic level:
T1
T2 T3 T4
DAG/Workflow
• Partially ordered set of tasks & workflows
T5
What do we look for?
• How are DAGs defined?
• What kind of development support?
• Implications
Airflow DAG definition
Build DAG
T
1
T T T
2 3 4
T
5
Schedule DAG
Workflow as Code
• Java (Temporal):
• Workflow Interface and Workflow Implementation
• Task interface and Task implementation
• Workers who will execute workflow and tasks.
@ActivityInterface
public interface T1 {
@ActivityMethod String t1(P1 x);
}
T
1
workflow/tasks
T
5
Book a car
Book a flight
End
Why does that matter?
Begi
n
Compensating transactions
End
https://github.com/temporalio/samples-java/blob/main/src/main/java/io/temporal/samples/bookingsaga/TripBookingWorkflowImpl.java
Saga pattern
public class BookTripWorkflowImpl implements BookTripWorkflow {
…
@Override void bookTrip(String name){
// Configure SAGA to run compensation activities in parallel
Saga.Options sagaOptions = new
Saga.Options.Builder().setParallelCompensation(true).build();
Saga saga = new Saga(sagaOptions);
try {
String carReservationID = activities.reserveCar(name);
saga.addCompensation(activities::cancelCar, carReservationID, name);
https://github.com/temporalio/samples-java/blob/main/src/main/java/io/temporal/samples/bookingsaga/TripBookingWorkflowImpl.java
Temporal workflows
Workflo
• MPs in Java, Python, Go, … w SDK
Code
• Integration with:
• Config2, InGraph,
DataVault, …
MP
• Debugging, Observability
mint build
mint deploy
Execution Model
Execution Model: Airflow
• Schedule Tasks T1
• Run Tasks
• Store execution state information T2 T3 T4
• Replay a DAG
• Use state to determine which tasks to re-excute
T5
Execution Model for Temporal
• Event sourcing
• Capture workflow/task events – begin, end, fail, etc.
• Replay
• Recreate program state and execution state:
• Variable values
• Stacks, Threads
• Skip what has already been executed
• Execute unknowns
Child
Parent workflow workflow
T1 T3
W1 1 2 3 4 W2 5 6 7 8
T2 T4
Activity
Nested Workflow – failure at 3 in parent workflow
T1 T3
W1 1 2 3 4 W2 5 6 7 8
T2 T4
Update = Change + Replay
T1 W3 T3
W1 1 2 3 4 W2 5 6 7 8
T2 T5 T4
Architecture
Architecture - Airflow
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html
30
Oklahoma Job Execution
https://docs.google.com/document/d/1sXax1Rs- 31
hs7qt2NL1JlAt2t0WwHcNDUhDK18DqO8eU4/edit#
RTF Architecture @ LinkedIn
Temporal Server Cluster
Rain InstanceRain Instance Rain Instance
Fronte Fronte Fronte
nd nd nd
servic servic servic
e
Histor e
Histor e
Histor
…
y y y
App & Worker Cluster servic servic servic
e e e
Rain Instance Rain Instance Matchi Matchi Matchi
ng ng ng
App. App. servic servic servic
Code
… Code e
Worke
r
e
Worke
r
e
Worke
r
Runti Runti servic servic servic
me me e e e
Storage Cluster
Monitoring & Alerting MYSQL MYSQL … MYSQL
InGraphs
Worker Performance Monitoring & Alerting
Server Performance
Scalability
https://docs.google.com/presentation/d/1x0ETmVVJcbluTSnJGo8F2sNL1GKJPwOh-2s53x_UKLg/edit#slide=id.g1157260aeaa_0_386
Other features
• Airflow: support for several utilities
• Sensors: Check for some conditions
• Files
• SQLSensor
• HivePartition
• DateTime
• Operators: Predefined tasks
• BashOperator
• PythonOPerator
• Email, mysql, postgres
• Make it easier to integrate with external sources
• RTF: Forthcoming integration with events (Kafka, etc.)
• DB, Hive Table: none yet, but can existing integration code from other
services be used?
Other features
• ML workflow needs (Flyte)
• Dynamic and High frequency pipelines
• Data lineage
• Resource management (GPU allocation, etc..)
• Resource isolation – one task cannot affect others
• Results caching
Questions
• Why do we need another workflow system?
• What doesn’t get solved by Oklahoma or Flyte?
• Azkaban like DAGs
• Which workflow to choose?
• Deployment?
• How to migrate?