Two-pass greedy regular expression parsing

Niels Bjørn Bugge Grathwohl, Fritz Henglein, Lasse Nielsen, Ulrik Terp Rasmussen

6 Citationer (Scopus)

Abstract

We present new algorithms for producing greedy parses for regular
expressions (REs) in a semi-streaming fashion. Our lean-log
algorithm executes in time O(mn) for REs of size m and input
strings of size n and outputs a compact bit-coded parse tree
representation. It improves on previous algorithms by:
operating in only 2 passes; using only O(m) words of
random-access memory (independent of n); requiring only kn
bits of sequentially written and read log storage, where k <
1/3 m is the number of alternatives and Kleene stars in the
RE; processing the input string as a symbol stream and not
requiring it to be stored at all. Previous RE parsing
algorithms do not scale linearly with input size, or require
substantially more log storage and employ 3 passes where the
first consists of reversing the input, or do not or are not
known to produce a greedy parse. The performance of our
unoptimized C-based prototype indicates that the superior
performance of our lean-log algorithm can also be observed in
practice; it is also surprisingly competitive with RE tools
not performing full parsing, such as Grep.
OriginalsprogEngelsk
TitelImplementation and Application of Automata : 18th International Conference, CIAA 2013, Halifax, NS, Canada, July 16-19, 2013. Proceedings
RedaktørerStavros Konstantinidis
Antal sider12
ForlagSpringer
Publikationsdato2013
Sider60-71
ISBN (Trykt)978-3-642-39273-3
ISBN (Elektronisk)978-3-642-39274-0
DOI
StatusUdgivet - 2013
Begivenhed18th International Conference on Implementation and Application of Automata - Halifax, Canada
Varighed: 16 jul. 201319 jul. 2013
Konferencens nummer: 18

Konference

Konference18th International Conference on Implementation and Application of Automata
Nummer18
Land/OmrådeCanada
ByHalifax
Periode16/07/201319/07/2013

Fingeraftryk

Dyk ned i forskningsemnerne om 'Two-pass greedy regular expression parsing'. Sammen danner de et unikt fingeraftryk.

Citationsformater