Passive and partially active fault tolerance for massively parallel stream processing engines

Li Su; Yongluan Zhou

doi:10.1109/TKDE.2017.2720602

Passive and partially active fault tolerance for massively parallel stream processing engines

Li Su, Yongluan Zhou

4 Citations (Scopus)

159 Downloads (Pure)

Abstract

Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.

Original language	English
Journal	IEEE Transactions on Knowledge and Data Engineering
Volume	31
Issue number	1
Pages (from-to)	32-45
ISSN	1041-4347
DOIs	https://doi.org/10.1109/TKDE.2017.2720602
Publication status	Published - 1 Jan 2019

Keywords

Data models
Distributed Stream Processing
Engines
Fault Tolerance
Fault tolerance
Fault tolerant systems
Semantics
Storms
Topology

Access to Document

10.1109/TKDE.2017.2720602

paper

Cite this

@article{9487dd4809d442c79b5042f0ee0eea27,

title = "Passive and partially active fault tolerance for massively parallel stream processing engines",

abstract = "Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.",

keywords = "Data models, Distributed Stream Processing, Engines, Fault Tolerance, Fault tolerance, Fault tolerant systems, Semantics, Storms, Topology",

author = "Li Su and Yongluan Zhou",

year = "2019",

month = jan,

day = "1",

doi = "10.1109/TKDE.2017.2720602",

language = "English",

volume = "31",

pages = "32--45",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE Computer Society Press",

number = "1",

}

TY - JOUR

T1 - Passive and partially active fault tolerance for massively parallel stream processing engines

AU - Su, Li

AU - Zhou, Yongluan

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.

AB - Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task's runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.

KW - Data models

KW - Distributed Stream Processing

KW - Engines

KW - Fault Tolerance

KW - Fault tolerance

KW - Fault tolerant systems

KW - Semantics

KW - Storms

KW - Topology

UR - http://www.scopus.com/inward/record.url?scp=85023742527&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2017.2720602

DO - 10.1109/TKDE.2017.2720602

M3 - Journal article

SN - 1041-4347

VL - 31

SP - 32

EP - 45

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 1

ER -

Passive and partially active fault tolerance for massively parallel stream processing engines

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this