Biology Bdbms System
Biology Bdbms System
Data
GID Ann_GID GName Ann_GName GSequence Ann_GSequence GID Ann_GID GName Ann_GName GSequence Ann_GSequence
JW0080 mraW ATGATGGAAAA… A3 JW0080 B1, B5 mraW B1, B5 ATGATGGAAAA… B3, B5
JW0082 A1 ftsI A1 ATGAAAGCAGC… JW0041 B1 fixB B1 ATGAACACGTT… B3
JW0055 A1, A2 yabP A1, A2 ATGAAAGTATC… A2 JW0037 B1, B4 caiB B1, B4 ATGGATCATCT… B3, B4
JW0078 A2 fruR A2 GTGAAACTGGA… A2 JW0055 yabP B2 ATGAAAGTATC… B3
DB2_Gene
Figure 3: Simple annotation storage scheme: Every data column has a corresponding annotation column
statement to update these columns by adding A2 to the de- users along with query answers. For example, annotation
sired annotation cells (Figure 3). B5 in Figure 2 states that gene JW0080 has an unknown
To support data annotation, the system has to provide function. But if the function of this gene becomes known
new mechanisms for seamlessly adding annotations at var- and gets added to the database, then B5 becomes invalid
ious granularities. It is essential to provide new expressive and users do not want to propagate this annotation along
commands as well as visualization tools that allow users to with query answers. Without providing a mechanism for
add their annotations graphically. archiving annotations, the archival operation may not be an
Storing annotations at multiple granularities: As easy task. For example, to archive annotation B5, the user
in Figure 2, users may annotate a single cell, e.g., A3, few needs to find out which tuples/cells in the database has B5,
cells, e.g., A1 and B2, entire rows, e.g., A2 and B4, or entire then the contents of each of these cells are parsed to archive
columns, e.g., B3. Multi-granularity annotations motivate then delete B5.
the need for efficient storage and indexing schemes. Other- Propagating annotations: A key requirement in
wise, storing and processing the annotations can be very ex- allowing annotation propagation is to simplify users’
pensive. For example, annotations A2 and B3 are repeated queries. This can be only achieved by providing database
in the annotation columns 6 and 5 times, respectively. The system support for annotation propagation; for example,
need for such efficient schemes is especially important in the by extending the query operators. Otherwise, users’ queries
context of provenance where a single provenance record can may become complex and user-unfriendly. For example,
be attached to many tuples or even entire columns or tables. consider a simple query that retrieves the genes that
Categorizing annotations: Although all annotations are common in DB1 Gene and DB2 Gene along with their
are metadata, they may have different importance, mean- annotations (Figure 3). To answer this query, the user has
ing, and creditability. For example, annotations that are to write the following SELECT statements (a–c):
added by a certain user or group of users can be more im-
portant than annotations added by the public or unknown (a) R1 (GID, GN ame, GSequence) =
users. Moreover, annotations that represent the lineage of SELECT GID, GName, GSequence
the data have different purpose and importance from the FROM DB1 Gene
annotations that represent users’ comments. For example, INTERSECT
annotations A2 and B3 represent the lineage of some data, SELECT GID, GName, GSequence
and users may be interested only in these annotations. As FROM DB2 Gene;
will be discussed later in the paper, the different types of
annotations will also have an impact on the storage mech- In Step (a), the user selects only the data columns
anism adopted for each type. This diversity in annotations from both gene tables, i.e., GID, GName, GSequence, and
motivates the need for separating or categorizing the anno- performs the intersection operation.
tations. bdbms provides a mechanism that allows users to
categorize their annotations at the storage, query process- (b) R2 (GID, GN ame, GSequence, Ann GID,
ing, and annotation propagation levels. Ann GN ame, Ann GSequence) =
Archiving annotations: Users may need to archive or SELECT R.GID, R.GName, R.GSequence,
delete annotations as they become obsolete, old, or simply G.Ann GID, G.Ann GName, G.Ann GSequence
invalid. Archived annotations should not be propagated to FROM R 1 R, DB1 Gene G
Time
CREATE ANNOTATION TABLE <ann_table_name>
ON <user_table_name>
DROP ANNOTATION TABLE <ann_table_name> (B5, T5)
ON <user_table_name>
(B3, T3)
Columns
Figure 4: The A-SQL commands CREATE and (B4, T4)
DROP
(B1, T1)
(B2, T2)
WHERE R.GID = G.GID;
In Step (b), the user joins the output from Step (a) back Tuples
with Table DB1 Gene in order to retrieve the annotations
from this table. Notice that we cannot select the annotation Figure 5: Compact storage for annotations
columns in Step (a) because, since the annotation values
in the annotation attributes may vary in the two tables, in
this case the intersection operation would not return any several storage and indexing schemes. One possible direc-
tuples. tion is to consider compact representation of annotations
that would improve the system performance with respect to
(c) R3 (GID, GN ame, GSequence, Ann GID, storage overhead, I/O cost to retrieve the annotations, and
Ann GN ame, Ann GSequence) = the query processing time. For example, instead of stor-
SELECT R.GID, R.GName, R.GSequence, ing the annotations at the cell level, we may store some of
R.Ann GID+G.Ann GID, the annotations at coarser granularities. For instance, the
R.Ann GName+G.Ann GName, annotations over Table DB2 Gene (Figure 2) can be repre-
R.Ann GSequence+G.Ann GSequence sented as rectangles attached to groups of contiguous cells
FROM R 2 R, DB2 Gene G as illustrated in Figure 5, where DB2 Gene is viewed as two-
WHERE R.GID = G.GID; dimensional space, e.g., columns represent the X-axis and
tuples represent the Y-axis. In this case, an annotation over
In Step (c), a join is performed between R2 and DB2 Gene any group of contiguous cells can be represented by a sin-
to consolidate the annotations from DB2 Gene with R2 ’s an- gle annotation record. So, in general, an annotation over a
notations, where + is the annotation union operator. subset of a table will map to multiple rectangular regions.
The main reason for the complexity of querying and prop- Other annotation characteristics that may need to be taken
agating the annotations is that users view annotations as into account include whether the annotation is linked to
metadata, whereas the DBMSs view annotations as normal multiple data items in different tables or is linked to very
data. For example, from a user’s view point, the two tu- few specific cells.
ples corresponding to genes JW0080 and JW0055 in Table
DB1 Gene are identical to those in Table DB2 Gene (Fig- 3.2 Adding Annotations at Multiple Granu-
ure 3). They only have different annotations. Whereas, larities
from the database view point, these tuples are not identical To add annotations using A-SQL, we propose the ADD
because annotations are viewed as normal attribute data. ANNOTATION command (Figure 6(a)). The annota-
As a result, users’ queries may become complex in order to tion table names specifies to which annotation table(s) the
overcome the mismatch in interpreting the annotations. added annotation will be stored. The annotation body spec-
In the following subsections, we introduce our initial in- ifies the annotation value to be added. The output of the
vestigations through bdbms to address the challenges and SQL statement specifies the data to which the annotation is
requirements highlighted above along with some preliminary attached. Since annotations may contain important infor-
results. mation that users want to query, we plan to support XML-
formatted annotations. That is, annotation body is an XML-
3.1 Storing and Indexing Annotations formatted text. In this case, users can (semi-)structure their
bdbms allows a user relation to have multiple annotation annotations and make use of XML querying capabilities over
tables attached to it. For example, table DB1 Gene may have the annotations. The output of the SQL statement can be at
an annotation table that stores the provenance information various granularities, e.g., entire tuples, columns, or group
and another annotation table that stores users’ comments. of cells. For example, to add annotation B3 over the en-
To create an annotation table over a given user relation, the tire GSequence column in Table DB2 Gene (as illustrated in
A-SQL command CREATE ANNOTATION TABLE (Fig- Figure 2), we execute the following ADD ANNOTATION
ure 4) is used. CREATE ANNOTATION TABLE allows command:
users to design and categorize their annotations at the stor- ADD ANNOTATION
age level. This categorization will also facilitate annotation TO DB2 Gene.GAnnotation
propagation (discussed in Section 3.4), where users may re- VALUE ’< Annotation >
quest propagating a certain type of annotations. To drop an obtained from GenoBase
annotation table, the DROP ANNOTATION TABLE com- < /Annotation >’
mand (Figure 4) is used. ON (Select G.GSequence
To efficiently store the annotations, we are investigating From DB2 Gene G);
ADD ANNOTATION SELECT [DISTINCT] Ci [PROMOTE (Cj, Ck, …)], …
TO <annotation_table_names> FROM Relation_name [ANNOTATION(S1, S2, …)], …
VALUE <annotation_body>
ON <SQL_statement>
[WHERE <data_conditions>]
[AWHERE <annotation_condition>]
(a)
[GROUP BY <data_columns>
ARCHIVE ANNOTATION RESTORE ANNOTATION [HAVING <data_condition>]
FROM <annotation_table_names> FROM <annotation_table_names> [AHAVING <annotation_condition>] ]
[BETWEEN <time1> AND <time2>] [BETWEEN <time1> AND <time2>]
ON <SELECT_statement> ON <SELECT_statement>
[FILTER <filter_annotation_condition>]
(b) (c)
Figure 7: The A-SQL SELECT command
Figure 6: The A-SQL commands ADD, ARCHIVE,
and RESTORE
of uncertainty and old values may turn out to be the correct
values. Archiving annotations gives users the flexibility to
In this case, the annotation is attached to the entire GSe- restore the annotations back if needed. Unlike other anno-
quence column because no WHERE clause is specified. The tations, archived annotations are not propagated to users
annotation is stored in the annotation table GAnnotation. along with the query answers. However, if archived annota-
Notice that < Annotation > is the XML tag that encloses tions are restored, then they will be propagated normally.
the annotation information. To archive and restore annotations, we introduce the
Similarly, to annotate an entire tuple, e.g., annotation B5, ARCHIVE ANNOTATION (Figure 6(b)) and RESTORE
we execute the following ADD ANNOTATION command: ANNOTATION (Figure 6(c)) commands, respectively. The
ADD ANNOTATION FROM clause specifies from which annotation table(s) the
TO DB2 Gene.GAnnotation annotations will be archived/restored. The optional clause
VALUE ’< Annotation > BETWEEN specifies a time range over which the anno-
This gene has an unknown function tations will be archived/restored. This time corresponds
< /Annotation >’ to the times-tamp assigned to each annotation when it is
ON (Select G.* first added to the database. The output from the SE-
From DB2 Gene G LECT statement specifies the data on which the annotations
WHERE GID = ’JW0080’); will be archived/restored. In addition, the output from the
SELECT statement can be at multiple granularities, as ex-
In this case, the annotation is attached to the entire tuples plained in the ADD ANNOTATION command.
returned by the query since all the attributes in the table
are selected. 3.4 Annotation Propagation and Annotation-
To allow users to link annotations to database operations, based Querying
i.e., INSERT, UPDATE, or DELETE, the SQL statement To support the propagation of annotations and querying
will be an INSERT, UPDATE or DELETE statement. For of the data based on their annotations, we introduce the
example, instead of inserting a new tuple and then anno- A-SQL command SELECT, given in Figure 7. A-SQL SE-
tating it by issuing a separate ADD ANNOTATION com- LECT extends the standard SELECT by introducing new
mand, users can insert and annotate the new tuple instantly operators and extending the semantics of the standard op-
by enclosing the insert statement inside the ADD ANNO- erators. We introduce the new operators ANNOTATION,
TATION command. For the delete operation, the deleted PROMOTE, AWHERE, AHAVING, and FILTER.
tuples will be stored in separate log tables along with the an- The ANNOTATION operator allows users to specify
notation that specifies why these tuples have been deleted. which annotation table(s) to consider in the query. Using
Notice that the standard system recovery log cannot be used the ANNOTATION operator, users can propagate their an-
for this purpose as the users need the freedom to structure notations transparently. That is, users do not have to know
their annotation schemas the way they want, which system how or where annotations are stored. Instead, users only
recovery logs do not support. specify which annotations are of interest.
We plan to add a visualization tool to allow users to anno- The PROMOTE operator allows users to copy annota-
tate their data in a transparent way. The visualization tool tions from one or more columns, possibly not in the projec-
displays users’ tables as grids or spreadsheets where users tion list, to a projected column. For example, if column GID
can select one or more cells to annotate. Oracle address is projected from Table DB1 Gene, then Annotation A3 will
the integration of database tables with Excel spreadsheets not be propagated unless the annotations over GSequence
to make use of Excel visualization and analysis power [2]. In are copied to GID.
bdbms, we plan to add this integration feature to facilitate The AWHERE and AHAVING clauses are analogous to
adding and visualizing annotations. the standard WHERE and HAVING clauses except that the
conditions of AWHERE and AHAVING are applied over the
3.3 Archiving and Restoring Annotations annotations. That is, AWHERE and AHAVING pass a tu-
Archival of annotations allows users to isolate old or in- ple along with all its annotations only if the tuple’s annota-
valid annotations from recent and valuable ones. In bdbms, tions satisfy the given AWHERE and AHAVING conditions.
we support archival of annotations instead of permanently On the other hand, the FILTER clause passes all the data
deleting them because biological data usually has a degree tuples of the input relation (keeps user’s data intact) but it
Lab experiment
What is the source of JW0055 yabP ATGAAAGTATC… yabP JW0055 MKVSVPGM… Hypothetical protein
Source copy this value at time T?
Gene Protein
S2
Prediction tool P
Local insertion
Where do these (a)
update overwrite values come from?
Program Source
P1 S3
Gene1 Gene2 Evalue
Similarity matching procedure
Figure 8: Data provenance at multiple granularities ATCCCGGTT… ATCCTGGTT… 3e-20 BLAST-2.2.15
TTTGCCGGA… TAAACCGGC… 1e-102
ATTTCCCAC… TTAAGCCCG… 2e-04
filters the annotations attached to each tuple. That is, any
annotation that does not satisfy filter annotation condition GeneMatching
(b)
is dropped.
The standard operators, e.g., projection, selection, and
duplicate elimination, are also extended to process the an- Figure 9: Local dependency tracking
notations attached to the tuples. For example, the projec-
tion operator selects some user attributes from the input
In bdbms, we treat provenance data as a kind of anno-
relation and passes only the annotations attached to those
tations where all the requirements and functionalities dis-
attributes. For example, projecting column GID from Ta-
cussed in Section 3 are also applicable to provenance data.
ble DB2 Gene (Figure 2) results in reporting GID data along
However, provenance data has special requirements and
with annotations B1, B4, and B5 only. The selection oper-
characteristics that need to be addressed including:
ators in WHERE and HAVING select tuples from the input
relation based on conditions applied over the data values. • Structure of provenance data: Unlike annotations
The selected tuples are passed along with all their annota- that can be free text, provenance data usually has
tions. For example, selecting the gene with GID = JW0080 well-defined structure. For example, the names of the
from Table DB2 Gene results in reporting the first tuple in source database and the source table draw their values
DB2 Gene along with annotations B1, B3, and B5. Opera- from a list of pre-defined values. Supporting XML-
tors that group or combine multiple tuples into one tuple, formatted annotations can be beneficial in structur-
e.g., duplicate elimination, group by, union, intersect, and ing provenance data. For example, provenance data
difference, are also extended to handle the annotations at- can follow a predefined XML schema that needs to be
tached to the tuples. These operators union the annotations stored and enforced by the database system.
over the grouped or combined tuples and attach them to the
output tuple that represents the group. • Authorization over provenance data: End-users
While defining the above commands and operators is only are usually not allowed to insert or update the prove-
the first step in supporting annotations and other features nance data. Provenance data needs to be automat-
within bdbms, we need to define for each A-SQL operator ically inserted and maintained by the system. For
its algebraic definition, cost estimate function, and algebraic example, integration tools that copy the data from
properties that can be used by the query optimizer to gen- one database to another can be the only tools that
erate efficient query plans. insert the provenance information. End-users can
only retrieve or propagate this information. There-
fore, we need to provide an access control mechanism
4. PROVENANCE MANAGEMENT over the provenance data (and annotations in general)
Biologists commonly interact and exchange data with each to restrict the annotation operations, e.g., addition,
other. Tracking the provenance (lineage) of data is very archival, and propagation, to certain users or programs
important in assessing the value and credibility of the data. as required.
Similar to annotations, data provenance can be attached
to the database at multiple granularities, i.e., at the table,
column, tuple levels, or any sub-groupings and subsets of the 5. LOCAL DEPENDENCY TRACKING
data. Also, biological data can be queried by its provenance. Biological databases are full of dependencies and deriva-
For example (refer to Figure 8), one table may contain data tions among data items. In many cases, these dependencies
from multiple sources, e.g., S1 and S2, or data that is locally and derivations cannot be automatically computed using
inserted. Then, some values may be updated by a certain coded functions, e.g., stored procedures or functions inside
program, e.g., P1, and some columns may be overwritten the database. Instead, they may involve prediction tools,
by data from another source, e.g., S3. Then, users may be lab experiments, or instruments to derive the data. Us-
interested to know the source of some values at a certain ing integrity constraints and triggers to maintain the con-
moment in time. sistency of the data is limited to computable dependencies,
i.e., dependencies that can be computed via coded functions. depend on a specific procedure. We can also derive new
However, non-computable dependencies cannot be directly rules, for example, based on rules (1) and (2) above, we can
handled using integrity constraints and triggers. In Figure 9, derive the following rule:
we give an example of the dependencies that can be found
in biological databases. In Figure 9(a), protein sequences Prediction tool P,
Gene.GSequence lab experiment Protein.PFunction (4)
are derived from the gene sequences using a prediction tool (non-executable,
P, whereas the function of the protein is derived from the non-invertible)