Annotation Basics

A numeric annotation specifies how many database entities must match a particular query element. Both vertices and edges may be annotated. (Subqueries are also annotated, as described in Chapter 6, Subqueries.) Numeric annotations can specify a  range of values, giving them a great deal more flexibility than the alternative of specifying exact structural matches.

There are three legal forms of numeric annotation:

We can use numeric annotations to restate the query with which we began this chapter, finding movies produced by exactly two studios:

Movies produced by exactly two studios [Annot_DB01_Q02.qg2.xml]

Figure 4.3. Movies produced by exactly two studios [Annot_DB01_Q02.qg2.xml]


Figure 4.3 includes two numeric annotations, one on the vertex representing studio objects, and one on the adjacent edge. The [2] annotation on the studio vertex indicates that the query can only match subgraphs containing exactly two studio objects. Of course, all the other parts of the query must also be satisfied—those two studio objects must be linked to the same actor object by produced links. This annotation also serves to group the two studio-movie pairs in a single subgraph with one movie object and two linked studio objects, rather than returning the multiple subgraphs we saw in Chapter 2, Query Basics.

The [1..] edge annotation is included because we cannot assume that linked objects in a database are connected by only a single link. If a database contains multiple links between objects, then we usually want to group these links, in addition to grouping the objects, in the query results. Because we may not know how many links connect one object to another, we use the unbounded annotation [1..] on the edge. For now, we’ll note that this is usually the correct annotation for an adjacent edge and simply follow this convention in defining the next several queries. The section on “Understanding Multiple Annotations” later in this chapter provides a more complete explanation of this edge annotation.

Edges adjacent to annotated vertices must be annotated for the reason cited above. Only one of two adjacent vertices may be annotated because annotating adjacent vertices can result in ambiguities in interpreting the query. Proximity enforces these requirements and will not execute queries with illegal annotations. See “Adjacency Requirements” later in this chapter for a more detailed explanation of the reasons behind these requirements.

When executing an annotated query, the vertex annotation takes precedence over the edge annotation. That is, the query processor first satisfies requirements on the vertex and then checks to see if it can satisfy requirements on the corresponding edges.

To see how Proximity handles the query shown in Figure 4.3, consider the database fragment shown in Figure 4.4. This fragment contains information about studios that produced some recent Academy Award winning pictures.

Database fragment [Annot_DB01.xml]

Figure 4.4. Database fragment [Annot_DB01.xml]


The above fragment includes four different movies: two produced by a single studio (Forrest Gump and Chicago), one produced by two studios (Titanic), and one produced by three studios (Shakespeare in Love). Executing the query shown in Figure 4.3 on this database fragment yields the matching subgraph shown in Figure 4.5.

Query results

Figure 4.5. Query results


Rather than returning two subgraphs, each with one movie and one studio, this query returns a single subgraph containing the same data that would have been spread across multiple matches had we omitted the annotations from the query. Because we used an exact annotation of [2] on the studio vertex, the query does not match subgraphs containing movies connected to a single studio or to more than two studios. If we want to instead find all the movies produced by two or more studios, we need to change the numeric annotation on the studio vertex to use the unbounded range [2..], as shown in Figure 4.6.

Movies produced by two or more studios [Annot_DB01_Q03.qg2.xml]

Figure 4.6. Movies produced by two or more studios [Annot_DB01_Q03.qg2.xml]


The results of executing this modified query on the data shown in Figure 4.4 are shown below:

Query results

Figure 4.7. Query results


This time, the unbounded annotation [2..] on the studio vertex matches both the subgraph containing the two studios that produced Titanic and the subgraph containing the three studios that produced Shakespeare in Love.

A variation on this query structure forms one of the most common QGraph queries, the star query. Star queries find all database elements linked to a core object. Star queries are typically used to find subgraphs such as “all actors in a movie” or “all authors for a paper” (assuming the corresponding database contains the appropriate objects and links).

General star query

Figure 4.8. General star query


Star queries can use either directed or undirected edges.

To create a star query for the movie and studio database, we need to determine which type of objects should serve as the core vertex for the query. Because this database links multiple studios to a single movie, we make the core vertex match movie objects.

Star query with movies as core objects [Annot_DB01_Star.qg2.xml]

Figure 4.9. Star query with movies as core objects [Annot_DB01_Star.qg2.xml]


The query above finds and returns a subgraph for each movie in the database. Each subgraph includes all the actors linked from that movie. The results of executing this query on the database fragment in Figure 4.4 are shown below:

Query results

Figure 4.10. Query results


Just as we can annotate vertices so that they match more than one object, we can also annotate edges so that they match more than one link. For example, the database fragment shown in Figure 4.11 contains information on several actors and the roles they played in the movie Angels in America.

Database fragment [Annot_DB02.xml]

Figure 4.11. Database fragment [Annot_DB02.xml]


The database fragment indicates that Al Pacino played a single role, Justin Kirk played two different roles, and Meryl Streep played four different roles in this movie.

It’s worth noting that the database fragment shown in Figure 4.11 uses a different schema to represent actors and roles from that used in Figure 3.8. The example in Chapter 3, Conditions used multiple attribute values on a single link to indicate that an actor played multiple roles in a movie. The example in this chapter uses multiple links to represent the same kind of information. Proximity does not requires any particular representational schema for a given dataset, although consistency within a dataset is important. You can determine the appropriate schema for your data.

A query that uses edge annotations to find actors playing multiple roles is shown in Figure 4.12.

Actors playing multiple roles [Annot_DB02_Q01.qg2.xml]

Figure 4.12. Actors playing multiple roles [Annot_DB02_Q01.qg2.xml]


Here we include the annotation [2..] on the edge connecting the actor vertex to the movie vertex, indicating that the query matches actor-movie pairs connected by two or more role links. Annotated edges can stand alone; they do not require that any adjacent vertices be annotated. The existence condition on the role edge requires that matching edges have a Role attribute, but doesn’t place any requirements on the specific value of this attribute.

The results of executing this query on the database fragment shown in Figure 4.11 are shown below.

Query results

Figure 4.13. Query results


Just as vertex annotations group matching objects, the query’s edge annotation groups matching links into a single subgraph in the query’s results. Without the edge annotation, this query returns seven subgraphs—one for each unique actor-role-movie subgraph in the database.

An annotation of [1] is not equivalent to no annotation. A [1] annotation requires that the query only match subgraphs that contain exactly one of the annotated entities. A query with no annotation will match each appropriate database entity regardless of number, although it will not group the matches into a single subgraph. This can be seen by comparing the results for the two queries below.

Annotated and unannotated queries

Figure 4.14. Annotated and unannotated queries


The query on the right includes an exact [1] annotation on the studio vertex (and the standard [1..] annotation on the incident edge to satisfy QGraph’s adjacency requirements for annotations). The query on the left has no annotations. Executing these queries on the database fragment shown in Figure 4.4 yields distinctly different results. Figure 4.15 shows the results from the unannotated query.

Query results

Figure 4.15. Query results


The query without annotations matches all the studio-movie pairs in the database. Studios are not grouped; each match forms a separate subgraph. Compare these results to that for the query containing the [1] vertex annotation.

Query results

Figure 4.16. Query results


The results of executing the annotated query include just two subgraphs, matching the two instances in the database where a movie is linked to exactly one studio.