Abstract
In the first part of this thesis, we investigate methods for regular expression parsing.
We present an O(mn) two-pass algorithm for greedy regular expression parsing in a semi-streaming fashion for expressions of size m and input of size n.
Pass one outputs a log of k bits per input symbol, where k is the number of alternatives and Kleene stars in the expression. This log is used in the second pass to produce a full parse tree.
The two-pass algorithm is extended to an O(2^{m log m}+mn)-time optimally streaming parsing algorithm: parts of the parse tree are output as soon as it is semantically possible to do so.
To be optimal, the algorithm performs a PSPACE-complete preprocessing step; for a fixed RE the running time is linear in the input size.
Finally, we present and implement a determinization procedure, omitting the preprocessing step, and a surface language, Kleenex, for expressing general string transductions.
We have implemented a compiler that translates Kleenex programs into efficient C code.
The resulting programs are essentially optimally streaming, run in worst-case linear time in the input size, and show consistent high performance in the 1 Gbps range on various use cases.
In the second part of this thesis, we study two extensions to Kleene algebra.
Chomsky algebra is an algebra with a structure similar to Kleene algebra, but with a generalized mu-operator for recursion instead of the Kleene star.
We show that the axioms of idempotent semirings along with continuity of the mu-operator completely axiomatize the equational theory of the context-free languages.
KAT+B! is an extension to Kleene algebra with tests (KAT) that adds mutable state.
We describe a test algebra B! for mutable tests and give a commutative coproduct between KATs.
Combining the axioms of B! with those of KAT and some commutativity conditions completely axiomatize the equational theory of an arbitrary KAT enriched with mutable state.
We present an O(mn) two-pass algorithm for greedy regular expression parsing in a semi-streaming fashion for expressions of size m and input of size n.
Pass one outputs a log of k bits per input symbol, where k is the number of alternatives and Kleene stars in the expression. This log is used in the second pass to produce a full parse tree.
The two-pass algorithm is extended to an O(2^{m log m}+mn)-time optimally streaming parsing algorithm: parts of the parse tree are output as soon as it is semantically possible to do so.
To be optimal, the algorithm performs a PSPACE-complete preprocessing step; for a fixed RE the running time is linear in the input size.
Finally, we present and implement a determinization procedure, omitting the preprocessing step, and a surface language, Kleenex, for expressing general string transductions.
We have implemented a compiler that translates Kleenex programs into efficient C code.
The resulting programs are essentially optimally streaming, run in worst-case linear time in the input size, and show consistent high performance in the 1 Gbps range on various use cases.
In the second part of this thesis, we study two extensions to Kleene algebra.
Chomsky algebra is an algebra with a structure similar to Kleene algebra, but with a generalized mu-operator for recursion instead of the Kleene star.
We show that the axioms of idempotent semirings along with continuity of the mu-operator completely axiomatize the equational theory of the context-free languages.
KAT+B! is an extension to Kleene algebra with tests (KAT) that adds mutable state.
We describe a test algebra B! for mutable tests and give a commutative coproduct between KATs.
Combining the axioms of B! with those of KAT and some commutativity conditions completely axiomatize the equational theory of an arbitrary KAT enriched with mutable state.
Original language | English |
---|
Publisher | Department of Computer Science, Faculty of Science, University of Copenhagen |
---|---|
Number of pages | 197 |
Publication status | Published - 2015 |