Regular Expression Matching with Multi-Strings and Intervals

Philip Bille; Mikkel Thorup

Regular Expression Matching with Multi-Strings and Intervals

32 Citations (Scopus)

Abstract

Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream of characters in O(k log w/w + log k) time per character, where w is the number of bits in a memory word. For large w, this corresponds to the previous best bound of O(m log w/w + log m). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x,y}. Here, C is a set of characters and C{x,y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications.

Original language	Undefined/Unknown
Title of host publication	SODA
Number of pages	12
Publisher	SIAM
Publication date	2010
Pages	1297-1308
Publication status	Published - 2010
Externally published	Yes

Cite this

@inproceedings{51f5452a3f574485ad81c79f13ffba29,

title = "Regular Expression Matching with Multi-Strings and Intervals",

abstract = "Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream of characters in O(k log w/w + log k) time per character, where w is the number of bits in a memory word. For large w, this corresponds to the previous best bound of O(m log w/w + log m). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x,y}. Here, C is a set of characters and C{x,y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications.",

author = "Philip Bille and Mikkel Thorup",

year = "2010",

language = "Udefineret/Ukendt",

pages = "1297--1308",

booktitle = "SODA",

publisher = "SIAM",

}

TY - GEN

T1 - Regular Expression Matching with Multi-Strings and Intervals

AU - Bille, Philip

AU - Thorup, Mikkel

PY - 2010

Y1 - 2010

N2 - Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream of characters in O(k log w/w + log k) time per character, where w is the number of bits in a memory word. For large w, this corresponds to the previous best bound of O(m log w/w + log m). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x,y}. Here, C is a set of characters and C{x,y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications.

AB - Regular expression matching is a key task (and often computational bottleneck) in a variety of software tools and applications. For instance, the standard grep and sed utilities, scripting languages such as perl, internet traffic analysis, XML querying, and protein searching. The basic definition of a regular expression is that we combine characters with union, concatenation, and kleene star operators. The length m is proportional to the number of characters. However, often the initial operation is to concatenate characters in fairly long strings, e.g., if we search for certain combinations of words in a firewall. As a result, the number k of strings in the regular expression is significantly smaller than m. Our main result is a new algorithm that essentially replaces m with k in the complexity bounds for regular expression matching. More precisely, after an O(m log k) time and O(m) space preprocessing of the expression, we can match it in a string presented as a stream of characters in O(k log w/w + log k) time per character, where w is the number of bits in a memory word. For large w, this corresponds to the previous best bound of O(m log w/w + log m). Prior to this work no O(k) bound per character was known. We further extend our solution to efficiently handle character class interval operators C{x,y}. Here, C is a set of characters and C{x,y}, where x and y are integers such that 0 ≤ x ≤ y, represents a string of length between x and y from C. These character class intervals generalize variable length gaps which are frequently used for pattern matching in computational biology applications.

M3 - Konferencebidrag i proceedings

SP - 1297

EP - 1308

BT - SODA

PB - SIAM

ER -