1. Introduction
The congruence closure data structure, also known as the egraph, is a central component of SMTsolvers (simplify; z3; moskal; cvc4) and equality saturationbased optimizers (eqsat; egg). An egraph compactly represents a set of terms and an equivalence relation over the terms. An important operation on egraphs is ematching, which finds the set of terms in an egraph matching a given pattern. In SMTsolvers, ematching is used to instantiate quantified formulas over ground terms. In equality saturation, ematching is used to match rewrite rules on an egraph to discover new equivalent programs. The efficiency of ematching greatly affects the overall performance of the SMTsolver (z3; cvc4), and slow ematching is a major bottleneck in equality saturation (egg; tensat; szalinski). In a typical application of equality saturation, ematching is responsible for 60–90% of the overall run time (egg).
Several algorithms have been proposed for ematching (efficientematching; moskal; simplify). However, due to the NPcompleteness of ematching (ematchingnph), most algorithms implement some form of backtracking search, which are inefficient in many cases. In particular, backtracking search only exploits structural constraints, which are constraints about the shape of a pattern, but defers checking equality constraints, which are constraints that variables should be consistently mapped. This leads to suboptimal run time when the equality constraints dominate the structural constraints.
To improve the performance of backtrackingbased ematching, existing systems implement various optimizations. Some of these optimizations only deal with patterns of certain simple shapes and are therefore ad hoc in nature (moskal). Others attempt to incrementalize ematching upon changes to the egraph, or match multiple similar patterns together to eliminate duplicated work (efficientematching). However, these optimizations are complex to implement and fail to generalize to workloads where the egraph changes rapidly or when the patterns are complex.
To tackle the inefficiency and complexity involved in ematching, we propose a systematic approach to ematching called relational ematching. Our approach is based on the observation that ematching is an instance of a wellstudied problem in the databases community, namely answering conjunctive queries. We therefore propose to solve ematching on an egraph by reducing it to answering conjuctive queries on a relational database. This approach has several benefits. First, by reducing ematching to conjunctive queries, we simplify ematching by taking advantage of decades of study by the databases community. Second, the relational representation provides a unified way to express both the structural constraints and equality constraints in patterns, allowing query optimizers to leverage both kinds of constraints to generate asymptotically faster query plans. Finally, by leveraging the generic join algorithm, a novel algorithm developed in the databases community, our technique achieves the first worstcase optimal bound for ematching.
Relational ematching is provably optimal despite the NPhardness of ematching. The databases community makes a clear distinction between query complexity, the complexity dependent on the size of the query, and data complexity, the complexity dependent on the size of the database. The NPhardness result (ematchingnph) is stated over the size of the pattern, yet in practice only small patterns are matched on a large egraph. When we hold the size of each pattern constant, relational ematching runs in time polynomial over the size of the egraph.
Our approach is widely applicable. For example, multipatterns are typically framed as an extension to ematching that allows the user to find matches satisfying multiple patterns simultaneously. Efficient support for multipatterns requires modifying the basic backtracking algorithm (efficientematching). In contrast, relational ematching inherently supports multipatterns for free. The relational model also opens the door to entirely new kinds of optimizations, such as persistent or incremental egraphs.
To evaluate our approach, we implemented relational ematching for egg, a stateoftheart implementation of egraphs. Relational ematching is simpler, more modular, and orders of magnitude faster than egg’s ematching implementation.
In summary, we make the following contributions in this paper:

We propose relational ematching, a systematic approach to ematching that is simple, fast, and optimal.

We adapt generic join to implement relational ematching, and provide the first data complexity results for ematching.

We prototyped relational ematching ^{1}^{1}1
We will open source our implementation.
in egg, a stateoftheart egraph implementation, and we show that relational ematching can be orders of magnitude faster.
The rest of the paper is organized as follows: Section 2 reviews relevant background on the egraph data structure, the ematching problem, conjunctive queries and join algorithms. Section 3 presents our relational view of egraphs, our ematching algorithm, and the complexity results. Section 4 discusses optimizations on our core algorithm and addresses various practical concerns. Section 5 evaluates our algorithm and implementation with a set of experiments in the context of equality saturation. Section 6 discusses how the relational model opens up many avenues for future work in egraphs and ematching, and Section 7 concludes.
2. Background
Throughout the paper we follow the notation in Figure 1. We define the egraph data structure and the ematching problem, and review background on relational queries and join algorithms that form the foundation of our ematching algorithm.
2.1. EGraphs and EMatching
function symbols  ::=  

variables  ::=  
eclass ids  ::=  
ground terms  ::=  
patterns  ::=  
enodes  ::=  
eclasses  ::= 
Terms.
Let be a set of function symbols with associated arities. A function symbol is called a constant if it has zero arity. Let be a set of variables. We define to be the set of terms constructed using function symbols from and variables from . More formally, is the smallest set such that (1) all variables and constants are in and (2) implies , where has arity . A ground term is a term in that contains no variables. All terms in are ground terms. A nonground term is also called a pattern. We call a term of the form an application term.
Congruence relation.
An equivalence relation is a binary relation over that is reflexive, symmetric, and transitive. A congruence relation is an equivalence relation satisfying:
We write and when is clear from the context.
Egraph.
An egraph is a set of eclasses, where each eclass is a set of enodes. Each enode consists of a function symbol and a list of children eclass ids. Similar to terms, we call an enode of the form an application enode.
Given the definitions and syntax in Figure 1, we can more formally define an egraph as a tuple where:

A unionfind (unionfind) data structure stores an equivalence relation (denoted with ) over eclass ids. The unionfind provides a function find that canonicalizes eclass ids such that . An eclass id is canonical if .

The eclass map maps eclass ids to eclasses. All equivalent eclass ids map to the same eclass, i.e., iff is the same set as . An eclass id is said to refer to the eclass .

A function lookup that maps enode to the id of eclass that contains it: .
No two enodes have the same symbol and children, i.e., an enode’s symbol and children together uniquely identify the eclass that contains it. This property is necessary for lookup to be a function. Section 4.3 explains how this property also translates to a functional dependency in the egraph’s relational representation, which could be leveraged to further optimize relational ematching.
An egraph efficiently represents sets of ground terms in a congruence relation. An egraph, eclass, or enode is said to represent a term if can be “found” within it. Representation is defined recursively:

An egraph represents a term if any of its eclasses does.

An eclass represents a term if any enode does.

An enode represents a term if they have the same function symbol and eclass represents term for .
Figure 2 shows an egraph representing the set of terms (where ):
In addition, all terms are equivalent, and all terms are equivalent. Note that the egraph has size , yet it represents many terms. In general, an egraph is capable of representing exponentially many terms in polynomial space. If the egraph has cycles, it can even represent an infinite set of terms. For example, the egraph with a single eclass represents the infinite set of terms .
Ematching.
Ematching finds the set of terms in an egraph matching a given pattern. Specifically, ematching finds the set of ematching substitutions and a root class. An ematching substitution is a function that maps every variable in a pattern to an eclass. For convenience, we use to denote the set of terms obtained by replacing every occurrence of variable in with terms represented by .
Given an egraph and a pattern , ematching finds the set of all possible pairs such that every term in is represented in the eclass . Terms in are said to be matched by pattern , and is said to be the root of matched terms. For example, matching the pattern against the egraph in Figure 2 produces the following substitutions, each with the same root :
Existing ematching algorithms perform backtracking search directly on the egraph (efficientematching; simplify; egg). Figure 3 shows an abstract backtrackingbased ematching algorithm. Most ematching algorithms using backtracking search can be viewed as optimizations based on this abstract algorithm. Specifically, it will perform a topdown search following the shape of the pattern and prune the result set of substitutions when necessary. To match pattern against , backtracking search visits terms in the following order (each marks a backtrack step):
For each term visited, whenever the algorithm yields a match . Despite there being only matches, backtracking search runs in time .
This inefficiency is due to the fact that naïve backtracking does not use the equality constraints to prune the search space globally. Specifically, the above ematching pattern corresponds to three constraints for a potential matching term :

should have function symbol .

’s second child should have function symbol .

’s first child should be equivalent to the child of ’s second child.
We can categorize these constraints into two kinds:

Structural constraints are derived from the structure of the pattern. The structure of pattern constrains the root symbol and the second symbol to be and respectively (i.e., constraints 1 and 2).

Equality constraints are implied by multiple occurrences of the same variable. Here, the occurrences of implies that the terms at these positions should be equivalent with each other for all matches (i.e., constraint 3), which we call equality constraints. Following moskal, we define patterns without equality constraints to be linear patterns.
Backtracking search exploits the structural constraints first and defers checking the equality constraints to the end. In our example pattern , backtracking search enumerates all , regardless of whether and are equivalent, only to discard inequivalent matches later. Complex query patterns may involve many variables that occur at several places, which will makes naïve backtracking search enumerate a very large number of candidates, even though the result size is small.
2.2. Conjunctive Queries
Conjunctive queries are a subset of queries in relational algebra that use only select, project, and join operators (as opposed to union, difference, or aggregation). Conjunctive queries have many desirable theoretical properties (like computable equivalence checking), and they enjoy efficient execution thanks to decades of research from the databases community.
Relational databases.
A relational schema over domain is a set of relation symbols with associated arities. A relation under a schema is a set of tuples; for each tuple , is the arity of in and is an element in . A database instance (or simply database) of is a set of relations under .
We use the notation to denote projection, i.e., .
Conjunctive queries.
A conjunctive query over the schema is a formula of the form:
where are relation symbols in with arities and the are variables.^{2}^{2}2 Some definitions of conjunctive queries allow both variables and constants. We only allow variables without loss of generality: any constant can be specified with a distinguished relation . We call the part the head of the query, the remainder is the body. Each is called an atom. Variables that appear in the head are called free variables, and they must appear in the body. Variables that appear in the body but not the head are called bound variables, since they are implicitly existentially quantified.
Semantics of conjunctive queries.
Similar to ematching, evaluating a conjunctive query yields substitutions. Specifically, evaluation yields substitutions that map free variables in to elements in the domain such that there exists a mapping of the bound variables that causes every substituted atom to be present in the database. Bound variables are projected out and not present in resulting the substitutions.
More formally, let be a database of schema and let be a conjunctive query over the same schema with variables in its head. Let the atoms in the body of be where has arity . Evaluating over yields a substitution iff there exists a mapping all variables in such that:
In practice, conjunctive queries are often evaluated according to a query plan which dictates each step of execution. For example, many industrial database systems will construct treelike query plans, where each node describes an operation like scanning a relation or joining two intermediate relations. Industrial database systems typically construct query plans based on binary join algorithm such as hash joins and mergesort join, which process two relations at a time. The quality of a query plan critically determines the performance of evaluating a conjunctive query.
We observe that conjunctive query and ematching are structurally similar: both are defined as finding substitutions whose instantiations are present in a database. Therefore, it is tempting to reduce ematching to a conjunctive query over the relational database, thereby benefiting from wellstudied techniques from the databases community, including join algorithms and query optimization. We achieve exactly this in Section 3.
2.3. WorstCase Optimal Join Algorithms
The run time of any algorithm for answering conjunctive queries is lowerbounded by the output size, assuming the output must be materialized. How large can the output of a conjunctive query be on a particular database? A naïve bound is simply the product of the size of each relation, which is the size of their cartesian product. Such a naïve bound fails to consider the query’s structure. The AGM bound (agm) gives us a bound for the worstcase output size. In fact, the AGM bound is tight; there always exists a database where the query output size is the AGM bound.
The AGM bound and worstcase optimal joins are recent developments in databases research. We do not attempt to provide a comprehensive background on these topics here; familiarity with the existence of the AGM bound and the generic join algorithm is sufficient for this paper.
Consider , also known as the “triangle query”, since output tuples are triangles between the edge relations . We calculate a trivial bound . If , then . We can derive a tighter bound from . That is because contains fewer tuples than the query as further requires . The AGM bound for is even smaller: . It is computed from the fractional edge cover of the query hypergraph.
Query Hypergraph
The hypergraph of a query is simply the hypergraph with a vertex for each variable and a (hyper)edge for each atom. The edge for an atom connects the vertices correponding to the variables . Figure 4 illustrates ’s hypergraph.
Cyclic and Acyclic Queries
Certain queries can be represented by a tree, called the join tree, where each node corresponds to an atom in the query. Furthermore, for each variable the nodes corresponding to the atom containing must form a connected component. Queries that admit such a join tree are said to be acyclic; otherwise, the query is cyclic.^{3}^{3}3 A cycle in the hypergraph does not necessarily entail a cyclic query, since the hypergraph may still admit a join tree. The triangle query is cyclic because it cannot be represented by a join tree. Acyclic queries can be answered more efficiently than cyclic ones.
Fractional Edge Cover
A set of edges cover a graph if they touch all vertices. For ’s hypergraph, any two edges form a cover. A fractional edge cover assigns a weight in the interval to each edge such that, for each vertex , the weights of the edges containing sum to at least 1. Every edge cover is a fractional cover, where every edge is assigned a weight of 1 if it is in the cover, and 0 otherwise. For ’s hypergraph, is the fractional edge cover with lowest total weight.
The AGM Bound
The AGM bound (agm) for a query with body atoms is defined as , where forms a fractional edge cover. For example, the AGM bound for is when . This is the upper bound of ’s output size; i.e. in the worst case outputs this many tuples.
Generic Join
A desirable algorithm for answering conjunctive queries shoud run in time linear to the worst case output size. Recent developments in the databases community have led to such an algorithm (wcoj), one of which is generic join (gj). Generic join has one parameter: an ordering of the variables in the query. Any ordering guarantees a run time linear to the worstcase output size, but different orderings can lead to dramatically different run time in practice (emptyheaded).
Algorithm 1 shows the generic join algorithm. Generic join is recursive, proceeding in two steps. First, it chooses a variable from the query and collects all possible values for that variable in the query. Then, for each of those values, it builds a residual query by replacing occurrences of the variable in the query with a possible value for that variable. These residual queries are solved recursively, and when there are no more variables in the residual query, the algorithm yields the substitution it has accumulated so far.
Generic join requires two important performance bounds to be met in order for its own run time to meet the AGM bound. First, the intersection on line 1 must run in time. Second, the residual relations should be computed in constant time, i.e., computing from the relation the relation for some must take constant time. Both of these can be solved by using tries (sometimes called prefix or suffix trees) as an indexing data structure. Figure 5 shows an example trie. Tries allow fast intersection because each node is a map which can be intersected in time linear to the size of the smaller map. Tries also allow constanttime access to residual relations according to a compatible variable ordering.

A useful way to understand generic join is to observe the loops and intersections it performs on a specific query. Algorithm 2 shows generic join computing the triangle query. Given a variable ordering ( in this case), generic join assumes the input relations are stored in tries according to the ordering, so is stored in a trie with s on the first level, and s on the second. This makes accessing the residual relations () fast since the replacement of variables with values is done according to the given variable ordering. Note how the algorithm is essentially just nested for loops. There is no explicit filtering step; the intersection of residual queries guarantees that once a complete tuple of values is selected, it can be immediately output without additional checking.
3. Relational EMatching
Ematching via backtracking search is inefficient because it handles equality constraints suboptimally. In fact, backtracking search follows edges in the egraph and only visits concrete terms that satisfy structural constraints. However, equality constraints are checked a posteriori only after the search visits a (partial) term.^{4}^{4}4 Backtracking ematching can perform the check as soon as it has traversed enough of the pattern to encounter a variable more than once. Whenever there are many terms that satisfy the structural constraints but not the equality constraints, as is in our example pattern , backtracking will waste time visiting terms that do not yield a match.
By reducing ematching to evaluating conjunctive queries, we can use join algorithms that take advantage of both structural and equality constraints. Figure 6 conveys this intuition using the pattern and the example egraph and database from Figure 7. The backtracking approach considers every possible assignment to the variables, even those where the two occurrences of do not agree.
We can instead formulate a conjunctive query that is equivalent to the following pattern:
Later subsections will detail how this conversion is done, but note how the auxiliary variable captures the structural constraint from the pattern. Evaluating with a simple hash join strategy exemplifies the benefits of the relational approach: it considers structural and equality constraints (in this case by doing a hash join keyed on ); indeed, the relational perspective sees no difference between the two kinds of constraints.

This observation leads us to a very simple algorithm for relational ematching, shown in Algorithm 3. Relational ematching takes an egraph and a set of patterns ps. It first transforms the egraph to a relational database . Then, it reduces every pattern to a conjunctive query . Finally, it evaluates the conjunctive queries over . These intermediate steps will be detailed in the following subsections.
3.1. From the EGraph to a Relational Database


The first step of relational ematching is to transform the egraph into a relational database . The domain of the database is eclass ids, and its schema is determined by the function symbols in . Every enode with symbol in the egraph corresponds to a tuple in the relation in the database. If has arity , then will have arity ; its first attribute is the eclass id that contains the corresponding enode, and the remaining attributes are the children of the enode. Figure 7 shows an example egraph and part of its corresponding database. In particular, only the relations of function symbols and are presented in this figure. There are other relations; each relation represents a constant and has exactly one tuple (i.e., singleton ).
We construct the database by simply looping over every enode in every eclass in and making a tuple in the corresponding relation:
Note that the tuples in the database contain only canonical eclass ids returned from the find function.
Our presentation in this paper specifically targets ematching use cases like equality saturation, where the building of the database can be amortized. In this setting, ematching is done in large batches, and expensive work like congruence closure can be amortized between these batches using a technique called “rebuilding” (egg). The time complexity of building this database is always linear, which is subsumed by the time complexity of most nontrivial ematching patterns. In Section 6.3, we discuss how this technique could be generalized to the nonamortized setting of frequently updated egraph as future work.
3.2. From Patterns to Conjunctive Queries
where are variables in  
and  
Once we have a database that corresponds to the egraph, we must convert each pattern we wish to ematch to a conjunctive query. We use the algorithm in Figure 8 to “unnest” a pattern to a conjunctive query by connecting nested patterns with auxiliary variables.
The Aux function returns a variable and a conjunctive query atom list. Particularly, for nonvariable pattern , Aux produces a fresh variable and a concatenation of and atoms from , where is the result of calling . For variable pattern , Aux simply returns and an empty list. Note that the auxiliary variables introduced by are not included in the head of the query, and thus are not part of the output.
Given a pattern , the Compile function returns a conjunctive query with body atoms from and the head atom consisting of the root variable and variables in . The compiled conjunctive query and the original ematching query are equivalent because there is an onetoone correspondence between the output of them. Specifically, each ematching output corresponds to a query output of . The only difference is that returning the root eclass id is a special consideration for ematching, but it is just another variable in the conjunctive query.
The Compile function (specifically the Aux subroutine) relies on the fact that the database contains only canonical eclass ids. Without this fact, nested patterns would require an addition join on the equivalence relation . But since if and are canonical eclass ids, we can omit introducing the additional join, instead joining nested patterns directly on the auxiliary variable.
Using this algorithm, the example pattern is compiled to the following conjunctive query:
(1) 
Compared to the original ematching pattern, this flattened representation enables relational ematching to utilize both the structural and the equality constraints. For example, a reasonable query plan that database optimizers will synthesize is a hash join on both join variables (i.e., and ), which takes time. In contrast, backtrackingbased ematching takes time.
Figure 6 shows the traces for running a direct backtracking search on the egraph and running hash join on the relational representation. Every term enumerated by hash join will simultaneously satisfy all the constraints. Conceptually, backtrackingbased ematching can be seen as a hash join that only builds and lookups a single variable (i.e., ), and filters the outputs using the equality predicate on . In other words, existing ematching algorithms will consider all terms regardless of whether is congruent to , while the generated conjunctive query gives the query optimizer the freedom to synthesize query plans that will consider only tuples where .
3.3. Answering CQs with Generic Join
Finally, we consider the problem of efficiently solving the compiled conjunctive queries. We propose to use the generic join algorithm to solve the generated conjunctive queries. Although traditional query plans, which are based on two way joins such as hash joins and mergesort joins, are extensively used in industrial relational database engine, they may suffer on certain queries compiled from patterns. For example, consider the pattern . The compiled conjunctive query is:
(2) 
Like the classic triangle query, this is a cyclic conjunctive query (Section 2.3). We call ematching patterns that generate cyclic conjunctive queries cyclic patterns. For such cyclic queries, wcoj show there exist databases on which any twoway join plan is suboptimal. In contrast, generic join is guaranteed to run in time linear to the worst case output size. Moreover, generic join can have comparable performance on acyclic queries with twoway join plans. These properties make generic join our ideal solver for conjunctive queries generated from ematching patterns.
Using the generic join algorithm, suppose we fix the variable ordering to be on the generated conjunctive query 1. The algorithm below shows generic join instantiated on this particular CQ:
3.4. Complexity of Relational Ematching
Generic join guarantees worstcase optimality with respect to the output size, and relational ematching preserves this optimality. In particular, we have the following theorem:
Theorem 1 ().
Relational ematching is worstcase optimal; that is, fix a pattern , let be the set of substitutions yielded by ematching on an egraph with enodes, relational ematching runs in time .
Proof.
Notice that there is an onetoone correspondence between output tuples of the generated conjunctive query and the ematching pattern. Therefore, the worstcase bound is the same across an ematching pattern and the conjunctive query it generated. Because generic join is worstcase optimal, relational ematching also runs in worstcase optimal time with respect to the output size. ∎
The structure of ematching patterns allows us to derive an additional bound dependent on the actual output size rather than the worstcase output size.
Theorem 2 ().
Fix an egraph with enodes that compiles to a database , and a fix pattern that compiles to conjunctive query . Relational ematching on runs in time .
Proof.
Let be the set of isolated variables, those that occur in only one atom. Note that , since is precisely the pattern variables and the root, and auxiliary variables must occur in at least two atoms. Using these, define two new queries:
Since , is the same query as but with zero or more variables projected out. Therefore, every tuple in corresponds to one in the output , so and .
Now we can compute the AGM bound for . Our new atom includes all those variables that only appear in one atom of . Therefore, every variable in occurs in at least two atoms, so assigning to each edge is a fractional edge cover. Thus:
since  
since 
Let denote the running time of generic join with query on database . We know that . Because , we also know , and we can use to bound . Now we show that .
The query is just with an additional atom that covers the variables that only appeared in one atom from . Fix a variable ordering for generic join that puts those variables in at the end. So loops of both GJ instantiations are the same, except that, in , each loop corresponding to a variable in performs an intersection with , but not in . But these intersections are in the innermost loops, at which point all intersections with atoms from have already been done. So the intersections with do nothing, since is precisely projected down to the variables in ! Since those intersections are not helpful and simply does not do them, .
Putting the inequalities together, we get:
. ∎
Example 3 (Complexity of relational ematching).
Consider the pattern , which compiles to the query . Following the proof, we define and . The AGM bound for is . This also bounds the run time of generic join on .
The above bound is tight for linear patterns, in which case each variable occurs exatly twice in . In the case of nonlinear patterns, we may find tighter covers than assigning to each atom, thereby improving the bound.
3.5. Supporting Multipatterns
Multipatterns are an extension to ematching used in both SMT solvers (efficientematching) and program optimizations (tensat). A multipattern is a list of patterns of the form that are to be simultaneously matched (i.e., the instantiation of each contained pattern should use the same substitution ). For example, the ematching the multipattern searches for pairs of two applications whose first arguments are equivalent. Efficient support for multipatterns on top of backtracking search requires complicated additions to stateoftheart ematching algorithms (efficientematching). Relational ematching supports multipatterns “for free”: a multipattern is compiled to a single conjunctive query just like a single pattern. For example, the conjunctive query for the multipattern is
(3) 
This is one example that shows the wide applicability of the relational model adopted in relational ematching.
4. Optimizations
Our implementation of relational ematching using generic join is simple (under 500 lines), but that does not preclude having several optimizations important for practical performance.
4.1. Degenerate Patterns
Not all patterns correspond to conjunctive queries that involve relational joins. Nonnested patterns (whether linear or nonlinear) will produce relational queries without any joins:
The corresponding query plan is simply a scan of a relation with possible filtering. For these queries, generic join (or any other join plan) offers no benefit, and building the indices for generic join incurs uncessary overhead. A relational ematching implementation (or any other kind) should have a “fast path” for these relatively common types of queries that simply scans the egraph/database for matching enodes/tuples. For this reason, we exclude these kinds of patterns from our evaluation in Section 5.
4.2. Variable Ordering
Different variable orderings may result in dramatically different performance for generic join (evalwcoj; emptyheaded)
, so choosing an variable ordering is important. Compared to join plans for binary joins, query plans for generic join is much less studied. In relational ematching we choose a variable ordering using two simple heuristics: First, we prioritize variables that occur in many relations, because the intersected set of many relations is likely to be smaller. Second, we prioritize variables that occur in small relations, because intersections involving a small relations are also likely to be smaller. Performing smaller intersections first can prune down the search space early.
Using these two heuristics, the optimizer is able to find more efficient query plans than the topdown search of backtrackingbased ematching. This is even true for linear patterns, where our relational ematching has no more information than ematching, but it does have more flexibility. Consider the linear pattern compiled to the query . When there are very few application enodes in the egraph, will be small. The variable ordering takes advantage of this by intersecting first, resulting to an intersection no larger than . This “bottomup” traversal is not possible in conventional ematching.
4.3. Functional Dependencies
Functional dependencies describe the dependencies between columns. For example, a functional dependency on relation of the form indicates that for each tuple of , the values of , , and uniquely determines the value . Functional dependencies are ubiquitous in relational ematching. In fact, every transformed schema of the form has a functional dependency from to eclass. When the variable graph formed by functional dependencies is acyclic ^{5}^{5}5Cyclicity of functional dependencies is unrelated to cyclicity of the query., we can speed up generic join by ordering the variables to follow the topological order of functional dependency graph. Every conjunctive query compiled from an ematching pattern has acyclic functional dependencies, because each dependency goes from the enode’s children to the enode’s parent eclass. Relational ematching can therefore always choose a variable ordering that is consistent with the functional dependency. Our implementation tries to respect functional dependencies, but prioritizes larger intersections more.
As an example, consider conjunctive query 2 again. It is synthesized from pattern and, assuming each relation has size , an AGM bound of . Suppose however that we pick the variable ordering to be . For every possible value of chosen, there will be at most one possible value for and root by functional dependency, which can be immediately determined. This reduces the run time from to .
4.4. Batching
Generic join always processes one variable at a time, even if multiple consecutive variables are from the same atom. We find this strategy to be inefficient in practice, as it results in deeper recursion that does little useful work.
Consider the query . The right variable ordering places at the front, since it is the only intersection. We observe that variables that only appear in one atom can be “batched” with others that only appear in the same atom. Batched variables are treated as a single variable in the trie and intersections. So instead of variable ordering , we can use . This lowers the recursion depth of generic join (from 5 to 3) and improves data locality by reducing pointerchasing.
5. Evaluation
To empirically evaluate relational ematching, we implemented it inside the egg equality saturation toolkit (egg). Our implementation consists about 80 lines of Rust inside egg itself to convert patterns into conjunctive queries, paired with a a separate, egraphagnostic Rust library to implement generic join in fewer than lines.
egg’s existing ematching infrastructure is also about lines of Rust, and it is interconnected to various other parts of egg. Qualitatively, we claim that the relational approach is simpler to implement, especially since the CQ solver is completely modular. We could plug in a different generic join implementation ^{6}^{6}6 There is no reusable generic join implementation at the time of writing., or even a more conventional binary join implementation.
In this section, we refer to egg’s existing ematching implementation as “EM” and our relation approach as “GJ.”
5.1. Benchmarking setup
We use egg’s two largest benchmark suites as the basis for our two benchmark suites. The math suite implements a simple computer algebra system, including limited support for symbolic differentiation and integration. The lambda suite implements a partial evaluator for the lambda calculus. Each egg benchmark suite provides a set of rewrite rules, each with a left and right side pattern, and a set of starting terms.
To construct the egraphs used in our benchmarks we ran equality saturation on a set of terms selected from egg’s test suite, stopping before before the egraph reached 1e5, 1e6, 2e6, and 3e6 enodes. The result is four increasingly large egraphs for each benchmark suite filled with terms that are generated by the suite’s rewrite rules. For each benchmark suite and each of the four egraph sizes, we then ran ematching on the egraph using both EM and GJ. We ran each approach 10 times and took the minimum run time.
For our GJ approach, we ran each trial twice. The first time builds the index tries necessary for generic join justintime, and the run time includes that. On the second trial, GJ uses the prebuilt index tries from the first run, so the time to build them is excluded. In both Figure 9 and Table 1, orange bars/rows show the first runs (including indexing), and blue bars/rows show the second runs (excluding indexing).
All benchmarks are singlethreaded, and they were executed on a 4.6GHz CPU with 32GB of memory.
Idx  Suite  EG Size  GJ  EM  Total  HMean  GMean  Best  Medn  Worst 
+  lambda  4,142  15  3  1.69  .84  1.71  13.62  1.60  .12 
–  lambda  4,142  18  0  2.58  2.99  4.23  39.17  3.68  1.10 
+  lambda  57,454  16  2  2.60  .95  2.66  136.54  2.65  .12 
–  lambda  57,454  18  0  2.87  3.33  9.11  406.70  4.05  1.03 
+  lambda  109,493  15  3  1.66  1.75  3.11  148.96  2.03  .65 
–  lambda  109,493  18  0  1.70  3.32  7.46  291.18  4.10  1.05 
+  lambda  213,345  15  3  2.20  1.55  3.40  304.33  1.72  .43 
–  lambda  213,345  18  0  2.21  2.96  8.23  501.12  5.04  1.04 
white  
white Idx  Suite  EG Size  GJ  EM  Total  HMean  GMean  Best  Medn  Worst 
+  math  8,205  30  2  5.49  0.64  4.61  66.54  2.79  0.03 
–  math  8,205  30  2  5.21  2.93  8.62  1,630.00  5.48  0.62 
+  math  53,286  29  3  311.23  2.61  13.50  50,030.29  3.62  0.74 
–  math  53,286  30  2  318.95  3.39  29.60  1,325,802.56  30.72  0.74 
+  math  132,080  29  3  96.55  2.66  15.18  61,488.73  4.02  0.60 
–  math  132,080  30  2  97.84  3.46  34.16  2,447,939.38  68.71  0.75 
+  math  217,396  30  2  119.82  2.83  18.34  101,023.37  3.91  0.72 
–  math  217,396  31  1  119.73  3.45  41.35  8,575,830.58  80.84  0.76 
ratios for each pattern: harmonic mean, geometric mean, max, median, and min.
5.2. Results
Figure 9 show the results of our benchmarking experiments. GJ can be over 6 orders of magnitude faster than traditional ematching on complex patterns. Speedup tends to be greater when the output size is smaller, and when the pattern is larger and nonlinear. A large output indicates the egraph is densely poplulated with terms matching the given pattern, therefore backtracking search wastes little time on unmatched terms, and using relational ematching contributes little or no speedup. Large and complex (nonlinear) patterns require careful query planning to be processed efficiently. For example, the pattern experiencing the largest speedup in Figure 9 is 4 enodes deep with 4 occurrences of the variable . Relational ematching using generic join can devise a variable ordering to put smaller relations with fewer children on the outer loop, thereby pruning down a large search space early. In contrast, backtracking search must traverse the egraph topdown.
In some cases index building time takes a significant proportion of the run time in relational ematching, sometimes offsetting the gains. Overall, relational ematching remains competitive with the index building overhead. In Section 6.3 we discuss potential remedies to alleviate this overhead.
Table 1 shows summary statistics across all patterns for each benchmark configuration. Notably, GJ is faster across patterns in every benchmarking configuration (the “Total” column). Much of the total benchmarking run time is dominated by simple, linear patterns (e.g. (+ (+ a b) c)) that return many results. Terms matching such patterns come to dominate the egraph over time, due to the explosive expansion of associativity and commutativity. As a result, the total speedup does not necessarily increase as the egraph grows, whereas the best speedup as well as different average statistics steadily increase.
In summary, relational ematching is almost always faster than backtracking search, and is especially effective in speeding up complex patterns.
6. Discussion
The relational model for ematching is not only simple and fast, but it opens the door to a wide range of future work to further improve egraphs and ematching.
6.1. Pushdown optimization
An ematching pattern may have additional filtering
condition associated with it.
For example, a rewrite rule with left hand side (* (/ x y) y)
may
additionally require that .
When the variables involved in conditions all occur in a single relation (e.g.,
and ),
this relation can be effectively filtered
even before being joined
(e.g., using predicate ),
which could immediately prune a large search space.
We call this pushdown optimization, which can be considered as egraph’s version of the relational query optimization that always pushes the filter operations down to the bottom of the join tree. Note that the ability to do pushdown optimization stems from relational ematching’s ability to consider the constraints in any order; backtracking ematching could not support this technique. We currently do not implement this, because it requires breaking changes to egg’s interface.
Conditions that involve multiple variables can be “pushed down” as well. In generic join, the filter can occur immediately after the variables appear in the variable ordering. Thus, an implementation that supports these conditional filters should take this into account when determining variable ordering.
6.2. Join Algorithms
Research in databases has proposed a myriad of different join algorithms. For example, stateoftheart database systems implement twoway joins like hash join and mergesort join. They have a longer history than generic join, and benefit from various constant factor optimizations. Extensive research has focused on generating highly efficient query plans using twoway joins. On the other hand, Yannakakis’ algorithm (yannakakis) is proven to be optimal on a class of queries called full ayclic queries, running in time linear to the total size of the input and the output. All linear patterns correspond to acyclic queries, but some nonlinear patterns correspond to cyclic ones. Recent research (mhedhbi19; freitag) has also experimented with combining traditional join algorithms with generic join, achieving good performance. In this paper we choose generic join for its simplicity, and future work may consider other join algorithms for relational ematching.
6.3. Incremental Processing
We have focused on improving the core ematching algorithm in this paper, yet prior work has successfully sped up ematching by making it incremental (efficientematching). When the changes to the egraph are small and the results of ematching patterns are frequently queried, maintaining the alreadydiscovered matches becomes crucial for efficiency. From our relational perspective, incremental ematching is captured precisely by the classic problem of incremental view maintenance (IVM) (DBLP:conf/vldb/CeriW91; DBLP:conf/sigmod/SalemBCL00; DBLP:conf/sigmod/ZhugeGHW95) in databases. IVM aims to efficiently update an alreadycomputed query result upon changes to the database, without recomputing the query from scratch.
There is oppotunity to improve relational ematching even without a fullfledged IVM solution. For example, we have shown in Figure 9 that index building can take up a significant portion of the run time. Our index implementation is based on a hash trie, which is simple but difficult to update efficiently. We are experimenting with an alternative index design based on sort tries, in the hope that it can make updates as simple as inserting into a sorted array.
6.4. Building on Existing Database Systems
Given our reduction from ematching to conjunctive query answering, one may wonder if other egraph operations could be reduced to relational operations so that a fully functioning egraph engine can be implemented purely on top of an offtheshelf database system. There are many benefits to it. For example, we could enjoy an industrialstrength query optimization and execution engine for free (although most industrial databases do not use worstcase optimal join algorithms), and eliminates the cost of transforming an egraph to a relational database. Moreover, this approach would enjoy any properties of the host database system, including persistence, incremental maintenance, concurrency, and faulttolerance.
As a proof of concept, we implemented a prototype egraph implementation on top of SQLite, an embedded relational database system, with 160 lines of Racket code. Egraph operations like insertion and merging are translated into highlevel SQL queries and executed using SQLite. This naïve prototype is not competitive with to highly optimized implementations like egg, especially given our relational ematching approach. However, with appropriate indices and query plan, it could have similar asymptotic performance. Specialized data structures to represent equivalence relations (nappa2019fast) could also help performance. Therefore, not only ematching but also other egraph operations can be expressed as relational queries, which hints at the possibility of developing realworld egraph engines on top of existing relational database systems.
7. Conclusion
In this paper, we present relational ematching, a novel ematching algorithm that is conceptually simpler, asymptotically faster, and worstcase optimal. We reduce ematching to conjunctive queries answering, a wellstudied problem in databases research. This relational presentation provides a unified way to exploit in query planning not only structural constraints, but also equality constraints, which are constraints that traditional ematching algorithms cannot effectively leverage. We implement relational ematching with the worstcase optimal generic join algorithm, using which we derive the first data complexity for ematching. We integrate our implementaiton in the stateoftheart equality saturation engine egg, and show relational ematching to be flexible (readily supports multipatterns) and efficient (achieves orders of magnitude speedup).
Comments
There are no comments yet.