0% found this document useful (0 votes)

15 views11 pages

Biology Bdbms System

The document introduces bdbms, an extensible prototype database management system designed specifically for biological data, addressing the limitations of current database technologies. Key features of bdbms include annotation and provenance management, local dependency tracking, content-based update authorization, and novel access methods for biological data. The paper outlines the architecture of bdbms and discusses its capabilities to enhance the management and processing of biological databases, which are critical for scientific research.

Uploaded by

kkoteswararaocse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

Biology Bdbms System

Uploaded by

kkoteswararaocse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

bdbms – A Database Management System for Biological

Data

Mohamed Y. Eltabakh Mourad Ouzzani Walid G. Aref

Department of Computer Science
Purdue University
West Lafayette, IN
{meltabak,mourad,aref}@cs.purdue.edu

ABSTRACT In many cases, biologists tend to store their data in flat

Biologists are increasingly using databases for storing and files or spreadsheets mainly because current database sys-
managing their data. Biological databases typically consist tems lack several functionalities that are needed by biolog-
of a mixture of raw data, metadata, sequences, annotations, ical databases, e.g., efficient support for sequences, anno-
and related data obtained from various sources. Current tations, and provenance. Once the data resides outside a
database technology lacks several functionalities that are database system, it loses effective and efficient manageabil-
needed by biological databases. In this paper, we intro- ity. Consequently, many of the advantages and functionali-
duce bdbms, an extensible prototype database management ties that database systems offer are nullified and bypassed.
system for supporting biological data. bdbms extends the It is thus important to break this inefficient and ineffective
functionalities of current DBMSs with: (1) Annotation and cycle by empowering database engines to operate directly
provenance management including storage, indexing, ma- on the data from within its natural habitat; the database
nipulation, and querying of annotation and provenance as system.
first class objects in bdbms, (2) Local dependency track- Biological databases evolve in an environment with
ing to track the dependencies and derivations among data rapidly changing experimental technologies and semantics
items, (3) Update authorization to support data curation via of the information content and also in a social context that
content-based authorization, in contrast to identity-based lacks absolute authority to verify correctness of information.
authorization, and (4) New access methods and their sup- Furthermore, because the only authority is the scientific
porting operators that support pattern matching on vari- community itself, biological databases often require some
ous types of compressed biological data types. This paper form of community-based curation. These characteristics
presents the design of bdbms along with the techniques pro- make it difficult, even using good design strategies, to com-
posed to support these functionalities including an exten- pletely foresee the kinds of additional information (termed
sion to SQL. We also outline some open issues in building annotations) that, over time, may become necessary to at-
bdbms. tach to data in the database.
In this paper, we propose bdbms, an extensible proto-
type database engine for supporting and processing biolog-
1. INTRODUCTION ical databases. While there are several functionalities of
Biological databases are essential to biological experimen- interest, we focus on the following key features: (1) Anno-
tation and analysis. They are used at different stages of life tation and provenance management, (2) Local dependency
science research to deposit raw data, store interpretations of tracking, (3) Update authorization, and (4) Non-traditional
experiments and results of analysis processes, and search for and novel access methods. bdbms will make fundamental
matching structures and sequences. As such, they represent advances in the use of biological databases through new
the backbone of life sciences discoveries. However, current native and transparent support mechanisms at the DBMS
database technology has not kept pace with the prolifer- level.
ation and specific requirements of biological databases [25, Annotations and provenance data are treated as first-class
37]. In fact, the limited ability of database engines to furnish objects inside bdbms. bdbms provides a framework that al-
the needed functionalities to manage and process biological lows adding annotations/provenance at multiple granulari-
data properly has become a serious impediment to scientific ties, i.e., table, tuple, column, and cell levels, archiving and
progress. restoring annotations, and querying the data based on the
annotation/provenance values. In bdbms, we introduce an
extension to SQL, termed Annotation-SQL, or A-SQL for
short, to support the processing and querying of annotation
and provenance information. A-SQL allows annotations and
provenance data to be seamlessly propagated with query an-
swers with minimal user programming.
In bdbms, we propose a systematic approach for tracking
This publication is licensed under a Creative Commons Attribution 2.5 dependencies among database items. As a result, when a
License; see http://creativecommons.org/licenses/by/2.5/ for further details. database item is modified, bdbms can track and mark any
CIDR 2007 Asilomar, CA USA
… Gene_seq Protein_seq Enzymes ager. A-SQL is bdbms’s extended SQL that supports anno-
Chemical Product tation (Section 3) and authorization commands (Section 6).
reaction
bdbms’s annotation manager is responsible for handling the
•Complex procedures Proteins
•Lab experiments annotations in an annotation storage space (Section 3). The
•Complex procedures dependency manager is responsible for handling the depen-
•Prediction tools
•Lab experiments
•Use of equipments
dencies and derivations among database items. These de-
pendencies are stored in a dependency storage space (Sec-
tion 5). The authorization manager handles content-based
Figure 1: Local dependencies authorizations as well as the standard GRANT/REVOKE
authorizations over the database (Section 6). Index struc-
tures are available in bdbms in support of the multidimen-
other item that is affected by this modification and that sional and compressed data (Section 7).
needs to be re-verified. This feature is particularly desirable
in biological databases because many dependencies cannot
be computed using coded functions. For example (refer to
3. ANNOTATION MANAGEMENT
Figure 1), protein sequences are derived from gene (DNA) Annotations are extra information linked to data items
sequences. If a gene sequence is modified, the correspond- inside a database. They usually represent users’ com-
ing protein sequence(s), derived calculated quantities (such ments, experiences, related information that is not mod-
as molecular weight), and annotations may become invalid. eled by the database schema, or the provenance (lineage)
Similarly, we may store descriptions of chemical reactions, of the data. Adding and retrieving annotations represent
e.g., substrates, reaction parameters, and products. If any an important way of communication and interaction among
of the substrates in the reaction are modified, the products database users. In biological databases, annotations are
of the reaction are likely to require re-evaluation. However, used extensively to allow users to have a better understand-
since these dependencies are complex and involve lab ex- ing of the data, e.g., how a piece of data is obtained, why
periments and external analysis, database systems cannot some values are being added or modified, and which exper-
systematically re-compute the other affected items. Lack iments or analyses are being performed to obtain a set of
of system support to automatically track such dependencies values. Annotations can be also used to track the prove-
raises significant concerns on the quality and the consistency nance of the data, e.g., from which source a piece of data
of the data maintained in biological databases. is obtained or which program is used to generate the data.
Authorizing database operations is also one of the features Tracking the provenance of the data is very important in
that we extended in bdbms. Current database systems sup- assessing the value and credibility of the data and in giving
port the GRANT/REVOKE access models that depend only credit to the original data generators.
on the identity of the user. In bdbms, we propose the con- Users can annotate the data at multiple granularities, e.g.,
cept of content-based authorization, i.e., the authorization annotating an entire table, an entire column, a subset of the
is based not only on the identity of an updater but also on tuples, a few cells, or a combination of these.
the content of the updated data. For example, lab members Despite their importance, annotations are not systemat-
may have the authority to update a given data set. However, ically supported by most database systems. While anno-
for credibility and reliability of the data, these updates have tation management has been addressed in previous works,
to be revised by the lab administrator. The lab administra- e.g., [7, 8, 10, 35], most of the proposed techniques usu-
tor can then approve or disapprove the operations based on ally assume simple annotation schemes and focus mainly on
their contents. In the mean time, users may be allowed to annotation propagation, i.e., propagating annotations along
view the data pending its approval/disapproval. with the query answers. Other aspects of annotations man-
The other key feature in bdbms is to provide access meth- agement, e.g., mechanisms for their insertion, archival, and
ods for supporting various types of biological data. Our goal indexing as well as more efficient annotation schemes such
is to design and integrate non-traditional and novel access as multi-granularity schemes, have not been addressed.
methods inside bdbms. For example, sequences and multi- In bdbms, we address several challenges and requirements
dimensional data are very common in biological databases of annotation management. We highlight these challenges
and hence there is a need to integrate new types of index and requirements through the following example. We con-
structures such as SP-GiST [3, 4, 16, 22] and the SBC- sider two gene tables, DB1 Gene and DB2 Gene that have been
tree [17] inside bdbms along with their supporting operators. obtained from two different databases (Refer to Figures 2
SP-GiST is an extensible indexing framework for support- and 3 for illustration). Each table has a set of annotations
ing multi-dimensional data while the SBC-tree is an index attached to it. We assume a straightforward storage scheme
structure for indexing and querying compressed sequences for storing the annotations, e.g., the one used in [7], where
without decompressing them. each column in the database has an associated annotation
The rest of the paper is organized as follows. In Section 2, column to store the annotations (Figure 3).
we present the overall architecture of bdbms. In Sections 3– Adding annotations: Users want to annotate their data
7, we present each of the bdbms key features. Section 8 at various granularities in a transparent way. That is, how
overviews the related work, and Section 9 contains conclud- or where the annotations are stored should be transparent
ing remarks. from end-users. However, current database systems do not
provide a mechanism to facilitate annotating the data. For
example, to add annotation A2 over table DB1 Gene (Fig-
2. BDBMS SYSTEM ARCHITECTURE ure 2), the user has to know that the annotations are stored
The main components of bdbms are the annotation man- in the same user table in columns Ann GID, Ann GName,
ager, the dependency manager, and the authorization man- and Ann GSequence. Then, the user issues an UPDATE
A1: These genes are published in … B1: Curated by user admin
A3: Involved in methyltransferase activity B5: This gene has an unknown function

GID GName GSequence GID GName GSequence

JW0080 mraW ATGATGGAAAA… JW0080 mraW ATGATGGAAAA…
JW0082 ftsI ATGAAAGCAGC… JW0041 fixB ATGAACACGTT…
JW0055 yabP ATGAAAGTATC… JW0037 caiB ATGGATCATCT…
JW0078 fruR GTGAAACTGGA… JW0055 yabP ATGAAAGTATC… B4: pseudogene

DB1_Gene JW0027 ispH ATGCAGATCCT…

A2: These genes were obtained from RegulonDB
DB2_Gene
B2: possibly split by frameshift
B3: obtained from GenoBase

Figure 2: Annotating tables DB1 Gene and DB2 Gene

GID Ann_GID GName Ann_GName GSequence Ann_GSequence GID Ann_GID GName Ann_GName GSequence Ann_GSequence
JW0080 mraW ATGATGGAAAA… A3 JW0080 B1, B5 mraW B1, B5 ATGATGGAAAA… B3, B5
JW0082 A1 ftsI A1 ATGAAAGCAGC… JW0041 B1 fixB B1 ATGAACACGTT… B3
JW0055 A1, A2 yabP A1, A2 ATGAAAGTATC… A2 JW0037 B1, B4 caiB B1, B4 ATGGATCATCT… B3, B4
JW0078 A2 fruR A2 GTGAAACTGGA… A2 JW0055 yabP B2 ATGAAAGTATC… B3

DB1_Gene JW0027 ispH B2 ATGCAGATCCT… B3

DB2_Gene

Figure 3: Simple annotation storage scheme: Every data column has a corresponding annotation column

statement to update these columns by adding A2 to the de- users along with query answers. For example, annotation
sired annotation cells (Figure 3). B5 in Figure 2 states that gene JW0080 has an unknown
To support data annotation, the system has to provide function. But if the function of this gene becomes known
new mechanisms for seamlessly adding annotations at var- and gets added to the database, then B5 becomes invalid
ious granularities. It is essential to provide new expressive and users do not want to propagate this annotation along
commands as well as visualization tools that allow users to with query answers. Without providing a mechanism for
add their annotations graphically. archiving annotations, the archival operation may not be an
Storing annotations at multiple granularities: As easy task. For example, to archive annotation B5, the user
in Figure 2, users may annotate a single cell, e.g., A3, few needs to find out which tuples/cells in the database has B5,
cells, e.g., A1 and B2, entire rows, e.g., A2 and B4, or entire then the contents of each of these cells are parsed to archive
columns, e.g., B3. Multi-granularity annotations motivate then delete B5.
the need for efficient storage and indexing schemes. Other- Propagating annotations: A key requirement in
wise, storing and processing the annotations can be very ex- allowing annotation propagation is to simplify users’
pensive. For example, annotations A2 and B3 are repeated queries. This can be only achieved by providing database
in the annotation columns 6 and 5 times, respectively. The system support for annotation propagation; for example,
need for such efficient schemes is especially important in the by extending the query operators. Otherwise, users’ queries
context of provenance where a single provenance record can may become complex and user-unfriendly. For example,
be attached to many tuples or even entire columns or tables. consider a simple query that retrieves the genes that
Categorizing annotations: Although all annotations are common in DB1 Gene and DB2 Gene along with their
are metadata, they may have different importance, mean- annotations (Figure 3). To answer this query, the user has
ing, and creditability. For example, annotations that are to write the following SELECT statements (a–c):
added by a certain user or group of users can be more im-
portant than annotations added by the public or unknown (a) R1 (GID, GN ame, GSequence) =
users. Moreover, annotations that represent the lineage of SELECT GID, GName, GSequence
the data have different purpose and importance from the FROM DB1 Gene
annotations that represent users’ comments. For example, INTERSECT
annotations A2 and B3 represent the lineage of some data, SELECT GID, GName, GSequence
and users may be interested only in these annotations. As FROM DB2 Gene;
will be discussed later in the paper, the different types of
annotations will also have an impact on the storage mech- In Step (a), the user selects only the data columns
anism adopted for each type. This diversity in annotations from both gene tables, i.e., GID, GName, GSequence, and
motivates the need for separating or categorizing the anno- performs the intersection operation.
tations. bdbms provides a mechanism that allows users to
categorize their annotations at the storage, query process- (b) R2 (GID, GN ame, GSequence, Ann GID,
ing, and annotation propagation levels. Ann GN ame, Ann GSequence) =
Archiving annotations: Users may need to archive or SELECT R.GID, R.GName, R.GSequence,
delete annotations as they become obsolete, old, or simply G.Ann GID, G.Ann GName, G.Ann GSequence
invalid. Archived annotations should not be propagated to FROM R 1 R, DB1 Gene G
Time
CREATE ANNOTATION TABLE <ann_table_name>
ON <user_table_name>
DROP ANNOTATION TABLE <ann_table_name> (B5, T5)

ON <user_table_name>
(B3, T3)
Columns
Figure 4: The A-SQL commands CREATE and (B4, T4)
DROP
(B1, T1)

(B2, T2)
WHERE R.GID = G.GID;

In Step (b), the user joins the output from Step (a) back Tuples
with Table DB1 Gene in order to retrieve the annotations
from this table. Notice that we cannot select the annotation Figure 5: Compact storage for annotations
columns in Step (a) because, since the annotation values
in the annotation attributes may vary in the two tables, in
this case the intersection operation would not return any several storage and indexing schemes. One possible direc-
tuples. tion is to consider compact representation of annotations
that would improve the system performance with respect to
(c) R3 (GID, GN ame, GSequence, Ann GID, storage overhead, I/O cost to retrieve the annotations, and
Ann GN ame, Ann GSequence) = the query processing time. For example, instead of stor-
SELECT R.GID, R.GName, R.GSequence, ing the annotations at the cell level, we may store some of
R.Ann GID+G.Ann GID, the annotations at coarser granularities. For instance, the
R.Ann GName+G.Ann GName, annotations over Table DB2 Gene (Figure 2) can be repre-
R.Ann GSequence+G.Ann GSequence sented as rectangles attached to groups of contiguous cells
FROM R 2 R, DB2 Gene G as illustrated in Figure 5, where DB2 Gene is viewed as two-
WHERE R.GID = G.GID; dimensional space, e.g., columns represent the X-axis and
tuples represent the Y-axis. In this case, an annotation over
In Step (c), a join is performed between R2 and DB2 Gene any group of contiguous cells can be represented by a sin-
to consolidate the annotations from DB2 Gene with R2 ’s an- gle annotation record. So, in general, an annotation over a
notations, where + is the annotation union operator. subset of a table will map to multiple rectangular regions.
The main reason for the complexity of querying and prop- Other annotation characteristics that may need to be taken
agating the annotations is that users view annotations as into account include whether the annotation is linked to
metadata, whereas the DBMSs view annotations as normal multiple data items in different tables or is linked to very
data. For example, from a user’s view point, the two tu- few specific cells.
ples corresponding to genes JW0080 and JW0055 in Table
DB1 Gene are identical to those in Table DB2 Gene (Fig- 3.2 Adding Annotations at Multiple Granu-
ure 3). They only have different annotations. Whereas, larities
from the database view point, these tuples are not identical To add annotations using A-SQL, we propose the ADD
because annotations are viewed as normal attribute data. ANNOTATION command (Figure 6(a)). The annota-
As a result, users’ queries may become complex in order to tion table names specifies to which annotation table(s) the
overcome the mismatch in interpreting the annotations. added annotation will be stored. The annotation body spec-
In the following subsections, we introduce our initial in- ifies the annotation value to be added. The output of the
vestigations through bdbms to address the challenges and SQL statement specifies the data to which the annotation is
requirements highlighted above along with some preliminary attached. Since annotations may contain important infor-
results. mation that users want to query, we plan to support XML-
formatted annotations. That is, annotation body is an XML-
3.1 Storing and Indexing Annotations formatted text. In this case, users can (semi-)structure their
bdbms allows a user relation to have multiple annotation annotations and make use of XML querying capabilities over
tables attached to it. For example, table DB1 Gene may have the annotations. The output of the SQL statement can be at
an annotation table that stores the provenance information various granularities, e.g., entire tuples, columns, or group
and another annotation table that stores users’ comments. of cells. For example, to add annotation B3 over the en-
To create an annotation table over a given user relation, the tire GSequence column in Table DB2 Gene (as illustrated in
A-SQL command CREATE ANNOTATION TABLE (Fig- Figure 2), we execute the following ADD ANNOTATION
ure 4) is used. CREATE ANNOTATION TABLE allows command:
users to design and categorize their annotations at the stor- ADD ANNOTATION
age level. This categorization will also facilitate annotation TO DB2 Gene.GAnnotation
propagation (discussed in Section 3.4), where users may re- VALUE ’< Annotation >
quest propagating a certain type of annotations. To drop an obtained from GenoBase
annotation table, the DROP ANNOTATION TABLE com- < /Annotation >’
mand (Figure 4) is used. ON (Select G.GSequence
To efficiently store the annotations, we are investigating From DB2 Gene G);
ADD ANNOTATION SELECT [DISTINCT] Ci [PROMOTE (Cj, Ck, …)], …
TO <annotation_table_names> FROM Relation_name [ANNOTATION(S1, S2, …)], …
VALUE <annotation_body>
ON <SQL_statement>
[WHERE <data_conditions>]
[AWHERE <annotation_condition>]
(a)
[GROUP BY <data_columns>
ARCHIVE ANNOTATION RESTORE ANNOTATION [HAVING <data_condition>]
FROM <annotation_table_names> FROM <annotation_table_names> [AHAVING <annotation_condition>] ]
[BETWEEN <time1> AND <time2>] [BETWEEN <time1> AND <time2>]
ON <SELECT_statement> ON <SELECT_statement>
[FILTER <filter_annotation_condition>]
(b) (c)
Figure 7: The A-SQL SELECT command
Figure 6: The A-SQL commands ADD, ARCHIVE,
and RESTORE
of uncertainty and old values may turn out to be the correct
values. Archiving annotations gives users the flexibility to
In this case, the annotation is attached to the entire GSe- restore the annotations back if needed. Unlike other anno-
quence column because no WHERE clause is specified. The tations, archived annotations are not propagated to users
annotation is stored in the annotation table GAnnotation. along with the query answers. However, if archived annota-
Notice that < Annotation > is the XML tag that encloses tions are restored, then they will be propagated normally.
the annotation information. To archive and restore annotations, we introduce the
Similarly, to annotate an entire tuple, e.g., annotation B5, ARCHIVE ANNOTATION (Figure 6(b)) and RESTORE
we execute the following ADD ANNOTATION command: ANNOTATION (Figure 6(c)) commands, respectively. The
ADD ANNOTATION FROM clause specifies from which annotation table(s) the
TO DB2 Gene.GAnnotation annotations will be archived/restored. The optional clause
VALUE ’< Annotation > BETWEEN specifies a time range over which the anno-
This gene has an unknown function tations will be archived/restored. This time corresponds
< /Annotation >’ to the times-tamp assigned to each annotation when it is
ON (Select G.* first added to the database. The output from the SE-
From DB2 Gene G LECT statement specifies the data on which the annotations
WHERE GID = ’JW0080’); will be archived/restored. In addition, the output from the
SELECT statement can be at multiple granularities, as ex-
In this case, the annotation is attached to the entire tuples plained in the ADD ANNOTATION command.
returned by the query since all the attributes in the table
are selected. 3.4 Annotation Propagation and Annotation-
To allow users to link annotations to database operations, based Querying
i.e., INSERT, UPDATE, or DELETE, the SQL statement To support the propagation of annotations and querying
will be an INSERT, UPDATE or DELETE statement. For of the data based on their annotations, we introduce the
example, instead of inserting a new tuple and then anno- A-SQL command SELECT, given in Figure 7. A-SQL SE-
tating it by issuing a separate ADD ANNOTATION com- LECT extends the standard SELECT by introducing new
mand, users can insert and annotate the new tuple instantly operators and extending the semantics of the standard op-
by enclosing the insert statement inside the ADD ANNO- erators. We introduce the new operators ANNOTATION,
TATION command. For the delete operation, the deleted PROMOTE, AWHERE, AHAVING, and FILTER.
tuples will be stored in separate log tables along with the an- The ANNOTATION operator allows users to specify
notation that specifies why these tuples have been deleted. which annotation table(s) to consider in the query. Using
Notice that the standard system recovery log cannot be used the ANNOTATION operator, users can propagate their an-
for this purpose as the users need the freedom to structure notations transparently. That is, users do not have to know
their annotation schemas the way they want, which system how or where annotations are stored. Instead, users only
recovery logs do not support. specify which annotations are of interest.
We plan to add a visualization tool to allow users to anno- The PROMOTE operator allows users to copy annota-
tate their data in a transparent way. The visualization tool tions from one or more columns, possibly not in the projec-
displays users’ tables as grids or spreadsheets where users tion list, to a projected column. For example, if column GID
can select one or more cells to annotate. Oracle address is projected from Table DB1 Gene, then Annotation A3 will
the integration of database tables with Excel spreadsheets not be propagated unless the annotations over GSequence
to make use of Excel visualization and analysis power [2]. In are copied to GID.
bdbms, we plan to add this integration feature to facilitate The AWHERE and AHAVING clauses are analogous to
adding and visualizing annotations. the standard WHERE and HAVING clauses except that the
conditions of AWHERE and AHAVING are applied over the
3.3 Archiving and Restoring Annotations annotations. That is, AWHERE and AHAVING pass a tu-
Archival of annotations allows users to isolate old or in- ple along with all its annotations only if the tuple’s annota-
valid annotations from recent and valuable ones. In bdbms, tions satisfy the given AWHERE and AHAVING conditions.
we support archival of annotations instead of permanently On the other hand, the FILTER clause passes all the data
deleting them because biological data usually has a degree tuples of the input relation (keeps user’s data intact) but it
Lab experiment

Source copy GID GName GSequence PName GID PSequence PFunction

S2 JW0080 mraW ATGATGGAAAA… mraW JW0080 MMENYKHT… Exhibitor

JW0082 ftsI ATGAAAGCAGC… ftsI JW0082 MKAAAKTQ… Cell wall formation

What is the source of JW0055 yabP ATGAAAGTATC… yabP JW0055 MKVSVPGM… Hypothetical protein
Source copy this value at time T?
Gene Protein
S2
Prediction tool P
Local insertion
Where do these (a)
update overwrite values come from?
Program Source
P1 S3
Gene1 Gene2 Evalue
Similarity matching procedure
Figure 8: Data provenance at multiple granularities ATCCCGGTT… ATCCTGGTT… 3e-20 BLAST-2.2.15
TTTGCCGGA… TAAACCGGC… 1e-102
ATTTCCCAC… TTAAGCCCG… 2e-04
filters the annotations attached to each tuple. That is, any
annotation that does not satisfy filter annotation condition GeneMatching
(b)
is dropped.
The standard operators, e.g., projection, selection, and
duplicate elimination, are also extended to process the an- Figure 9: Local dependency tracking
notations attached to the tuples. For example, the projec-
tion operator selects some user attributes from the input
In bdbms, we treat provenance data as a kind of anno-
relation and passes only the annotations attached to those
tations where all the requirements and functionalities dis-
attributes. For example, projecting column GID from Ta-
cussed in Section 3 are also applicable to provenance data.
ble DB2 Gene (Figure 2) results in reporting GID data along
However, provenance data has special requirements and
with annotations B1, B4, and B5 only. The selection oper-
characteristics that need to be addressed including:
ators in WHERE and HAVING select tuples from the input
relation based on conditions applied over the data values. • Structure of provenance data: Unlike annotations
The selected tuples are passed along with all their annota- that can be free text, provenance data usually has
tions. For example, selecting the gene with GID = JW0080 well-defined structure. For example, the names of the
from Table DB2 Gene results in reporting the first tuple in source database and the source table draw their values
DB2 Gene along with annotations B1, B3, and B5. Opera- from a list of pre-defined values. Supporting XML-
tors that group or combine multiple tuples into one tuple, formatted annotations can be beneficial in structur-
e.g., duplicate elimination, group by, union, intersect, and ing provenance data. For example, provenance data
difference, are also extended to handle the annotations at- can follow a predefined XML schema that needs to be
tached to the tuples. These operators union the annotations stored and enforced by the database system.
over the grouped or combined tuples and attach them to the
output tuple that represents the group. • Authorization over provenance data: End-users
While defining the above commands and operators is only are usually not allowed to insert or update the prove-
the first step in supporting annotations and other features nance data. Provenance data needs to be automat-
within bdbms, we need to define for each A-SQL operator ically inserted and maintained by the system. For
its algebraic definition, cost estimate function, and algebraic example, integration tools that copy the data from
properties that can be used by the query optimizer to gen- one database to another can be the only tools that
erate efficient query plans. insert the provenance information. End-users can
only retrieve or propagate this information. There-
fore, we need to provide an access control mechanism
4. PROVENANCE MANAGEMENT over the provenance data (and annotations in general)
Biologists commonly interact and exchange data with each to restrict the annotation operations, e.g., addition,
other. Tracking the provenance (lineage) of data is very archival, and propagation, to certain users or programs
important in assessing the value and credibility of the data. as required.
Similar to annotations, data provenance can be attached
to the database at multiple granularities, i.e., at the table,
column, tuple levels, or any sub-groupings and subsets of the 5. LOCAL DEPENDENCY TRACKING
data. Also, biological data can be queried by its provenance. Biological databases are full of dependencies and deriva-
For example (refer to Figure 8), one table may contain data tions among data items. In many cases, these dependencies
from multiple sources, e.g., S1 and S2, or data that is locally and derivations cannot be automatically computed using
inserted. Then, some values may be updated by a certain coded functions, e.g., stored procedures or functions inside
program, e.g., P1, and some columns may be overwritten the database. Instead, they may involve prediction tools,
by data from another source, e.g., S3. Then, users may be lab experiments, or instruments to derive the data. Us-
interested to know the source of some values at a certain ing integrity constraints and triggers to maintain the con-
moment in time. sistency of the data is limited to computable dependencies,
i.e., dependencies that can be computed via coded functions. depend on a specific procedure. We can also derive new
However, non-computable dependencies cannot be directly rules, for example, based on rules (1) and (2) above, we can
handled using integrity constraints and triggers. In Figure 9, derive the following rule:
we give an example of the dependencies that can be found
in biological databases. In Figure 9(a), protein sequences Prediction tool P,
Gene.GSequence lab experiment Protein.PFunction (4)
are derived from the gene sequences using a prediction tool (non-executable,
P, whereas the function of the protein is derived from the non-invertible)

protein sequence using lab experiments. If a gene sequence

is modified, then all protein sequences that depend on that Rule 4 specifies that Column PFunction in Table Protein
gene have to be marked as outdated until their values are re- depends on Column GSequence in Table Gene through a
verified. Moreover, the function of the outdated proteins has chain of two procedures, a perdition tool P and a lab ex-
to be marked as outdated until their values are re-evaluated. periment. This chain is non-executable by the database and
In Figure 9(b), we present another type of dependency is non-invertible. Notice that the chain is non-executable
where the value of the data in the database depends on the because at least one of the procedures, namely the lab ex-
procedure or program that generated that data. For exam- periment, is non-executable.
ple, the values in the Evalue column (Figure 9(b)) depend on In bdbms, we address the following functionalities to track
Procedure BLAST-2.2.15. If a newer version of BLAST is local dependencies:
used or BLAST is replaced with another procedure, then
we need to re-evaluate the values in the Evalue column. • Modeling dependencies: We use Procedural De-
These values can be automatically evaluated if BLAST can pendencies to allow users to model the dependencies
be modeled as a database function. Otherwise, the values among the database items as well as for bdbms to rea-
have to be marked as Outdated. son about these dependencies, for example, to detect
In bdbms, we propose to extend the concept of Func- conflicts and cycles among dependency rules, and to
tional Dependencies [5, 13] to Procedural Dependencies. In compute the closure of procedures.
Procedural Dependencies, we not only track the dependency
among the data, but also the type and characteristics of the • Storing dependencies: Dependencies among the
dependency, e.g., the procedure on which the dependency is data can be either at the schema level, i.e., the en-
based, whether or not that procedure can be executed by the tity level, or at the instance level, i.e., the cell level.
database, and whether or not that procedure is invertible. Schema-level dependencies can be modeled using for-
For example, we can model the dependencies in Figure 9 eign key constraints, e.g., protein sequences depend
using the following rules. on gene sequences and they are linked by a foreign
key. Instance-level dependencies are more complex to
Prediction tool P
model because they are on a cell-by-cell basis. In this
Gene.GSequence Protein.PSequence (1) case, we can use dependency graphs to model such de-
(Executable,
non-invertible)
pendencies.

• Tracking outdated data: When the database is

Lab experiment
Protein.PSequence Protein.PFunction (2) modified, bdbms uses the dependency graphs to fig-
(non-executable, ure out which items, termed the outdated items, may
non-invertible) be affected by this modification. Outdated items need
to be marked such that these items can be identified
BLAST-2.2.15
GeneMatching.Gene1, GeneMatching.Gene2 GeneMatching.Evalue (3) in any future reference. We propose to associate a
(Executable, bitmap with each table in the database. A cell in
non-invertible)
the bitmap is set to 1 if the corresponding cell in the
data table is outdated, otherwise the bitmap cell is set
Rule 1 specifies that Column PSequence in Table Protein to 0. For example, assume that the sequences corre-
depends on Column GSequence in Table Gene through the sponding to genes JW0080 and JW0082 (Figure 9(a))
prediction tool P that is executable by the database and are modified, then the bitmap associated with Table
is non-invertible. Rule 2 specifies that column PFunction Protein will be as illustrated in Figure 10. Notice
in Table Protein depends on Column PSequence through a that the bits corresponding to PSequence are not set
lab experiment that is not executable by the database and to 1 because PSequence is automatically updated by
is non-invertible. Rule 3 specifies that Column Evalue in executing Procedure P. In contrast, PFunction cannot
Table GeneMatching depends on both columns Gene1 and be automatically updated, therefore its corresponding
Gene2 through Program BLAST-2.2.15 that is executable bits are set to 1 to indicate that these values are out-
by the database and is non-invertible. For example, from dated. To reduce the storage overhead of the main-
Rule 2, we infer that when Column PSequence changes, the tained bitmaps, data compression techniques such as
database can only mark PFucntion as Outdated. In con- Run-Length-Encoding [23] can be used to effectively
trast, based on Rule 3, when either of the Gene1 or Gene2 compress the bitmaps.
columns or Procedure BLAST-2.2.15 change, the database
can automatically re-evaluate Evalue. • Reporting and annotating outdated data: The
In addition, the notion of Procedural Dependencies allows main objective of tracking local dependencies is that
us to reason about the dependency rules. For example, in the database should be able to report at all times the
addition to the closure of an attribute, we can compute the items that need to be verified or re-evaluated. More-
closure of a procedure, i.e., all data in the database that over, when a query executes over the database and
PName GID PSequence PFunction PName GID PSeq. PFun. atically track the changes over the database. The pro-
posed content-based approval mechanism works with, not
mraW JW0080 MKENYKNM… Exhibitor 0 0 0 1
in replacement to, existing GRANT/REVOKE mechanisms.
ftsI JW0082 MTATTKTQ… Cell wall formation 0 0 0 1 The content-based approval mechanism maintains a log of all
yabP JW0055 MKVSVPGM… Hypothetical protein 0 0 0 0 update operations, i.e., INSERT, UPDATE, and DELETE,
that occur in the database. The database administrator can
Protein Protein-Bitmap turn the content-based approval feature ON or OFF for a
certain table or columns using a Start Content Approval and
Figure 10: Use of bitmaps to mark outdated data End Content Approval commands (Figure 11), respectively.
The table name value specifies the user table on which the
update operations will be monitored. The optional clause
START CONTENT APPROVAL STOP CONTENT APPROVAL COLUMNS specifies which column(s) in table name to mon-
ON <table_name> ON <table_name> itor. For example, we can monitor the update operations
[COLUMNS <column_names>] [COLUMNS <column_names>] over only Column GSequence of Table Gene (Figure 9(a)).
APPROVED BY <user/group> The APPROVED BY clause specifies the user or group of
users who can approve or disapprove the update operations.
If the content-based approval feature is turned ON over Ta-
Figure 11: Content-based approval ble T, then bdbms stores all update operations over T in
the log along with an automatically generated inverse state-
ment that negates the effect of the original statement. More
involves outdated items, the database should propa- specifically, for INSERT, a DELETE statement will be gen-
gate with those items an annotation specifying that erated, for DELETE, an INSERT statement will be gener-
the query answer may not be correct. Detecting the ated, and for UPDATE, another UPDATE statement that
outdated items at query execution time is a challenging restores the old values will be generated. The log stores
problem as it requires retrieving and propagating the also the user identifier who issued the update operation and
status of each item, i.e., whether it is outdated or not, the issuing time. The person in charge of the database,
in the query pipeline. A proposed solution is to con- e.g., the lab administrator, can then view the maintained
sider the status of the database items as annotations log and revise the updates that occurred in the database.
attached to those items. These annotations will be au- If an operation is disapproved, then bdbms executes the in-
tomatically propagated along with the query answers verse statement of that operation to remove its effect from
as discussed in Section 3. the database. Executing the inverse statement may affect
other elements in the database, e.g., elements that depend
• Validating outdated data: bdbms will provide a on the currently existing values. It is the functionality of
mechanism for users to validate outdated items. An the Local Dependency Tracking feature (Section 5) to track
outdated item may or may not need to be modified and invalidate these elements.
to become valid. For example, a modification to a
gene sequence may not affect the corresponding pro-
tein sequence. In this case, the protein sequence will
7. INDEXING AND QUERY PROCESSING
be revalidated without modifying its value. Biological databases warrant the use of non-traditional in-
dexing mechanisms beyond B+-trees and hash tables. To
enable biological algorithms to operate efficiently on the
6. UPDATE AUTHORIZATION database, we propose integrating non-traditional indexing
Changes over the database may have important conse- techniques inside bdbms. We focus on two fronts: (1) Sup-
quences, and hence, they should be subject to authorization porting multidimensional datasets via multidimensional in-
and approval by authorized entities before these changes dexing techniques (suitable for protein 3D structures and
become permanent in the database. Update authoriza- surface shape matching), and (2) Supporting compressed
tion (also termed approval enforcement) in current database datasets via novel external-memory indexes that work over
management systems is based on GRANT/REVOKE access the compressed data without decompressing it (suitable for
models [18, 24], where a user may be granted an authoriza- indexing large sequences).
tion to update a certain table or attribute. Although widely In bdbms, we focus on introducing non-traditional index
acceptable, these authorization models are based only on structures for supporting biological data. For example, com-
the identity of the user not on the content of the data be- pressing the data inside the database is proven to improve
ing inserted or updated. In biological databases, it is of- the system performance, e.g., C-store [33]. It reduces signif-
ten the case that a data item can make it permanently to a icantly the size of the data, the number of I/O operations
database based on its value not on the user who entered that required to retrieve the data, and the buffer requirements.
value. For example, a lab administrator may allow his/her In bdbms, we investigate how we can store biological data
lab members to perform insert and update operations over in compressed form and yet be able to operate, e.g., index,
the database. However, for reliability, these operations have search, and retrieve, on the compressed data without de-
to be revised by the lab administrator. If the lab admin- compressing it.
istrator is the only user who has the right to update the
database, then this person may become a bottleneck in the 7.1 Indexing Multi-dimensional Data
process of populating the database. Space-partitioning trees are a family of access methods
In bdbms, we introduce an approval mechanism, termed that index objects in a multi-dimensional space, e.g., pro-
content-based approval, that allows the database to system- tein 3D structures. In [3, 4, 16, 22], we introduce an exten-
Protein secondary structure: of the data significantly.
LLLEEEEEEEHHHHHHHHHHHHHHHHHHHHHHEEEEEELLEEELHHHHHHHHHHLL
LLLLLLLLHHHHHHHHHHHHHHHHLLLLEEEEEEEHHHHHHHHHHHHEEEEEEEEEE In bdbms, as a first step, we investigate the processing,
LLLLHHHHHHHLLLLHHHHHHHHHHHHHHEEEEEEEEEEHHHHHHHEEEEEEEEHH
HHHHHHHHEEEELEEEEEEEEEELLLEEEEEEEELLLLHHHHHHHHHHHHHHHEEEE e.g., indexing and querying, of Run-Length-Encoded (RLE)
EELLEEEELLLLLLLLHHHHHHHHHHHHHHHHHHHHEEEELEEEEEEEEEELEEEEEL sequences. RLE [23] is a compression technique that re-
LLLLLLLLEEEEELLLLLLEEEEEEEELEEEEEEEEELLLEEEEHHHHHHHHHHHHHHH
HHHEEEEELLLEEEEEEEEELLLHHHHHHHHHHHHHHHHHHHHLHHHHHHHHHHHH places the consecutive repeats of a character C by one oc-
EEEEELEEEEHHHHHHHHHHHHHHHHHEEEEEELLLLLEEEEEEELLLLEEEEEEEEE
EEEELEEEEEEEEEEEEEEHHHHHHHHHHHHHHLLLLLEEEEEEEEEEHHHHHHHEE
currence of C followed by C’s frequency. One of the main
EEEEHHHHHHHHHHLLLLLLHHHHHHHHHHHEEEEEEEEEEEHHHHHHHHHHHHHL challenges is how to operate on the compressed data with-
LEEEEELLLLLLLLLLHHHHHHHHHHHHHHHHHHLLLEEEEEEEHHHHHHHHHHLLLL
EEEEEEEEEEEEEEEEEELLLLEEELLHHHHHHHHHLLLLLLLLLLLHHHHHHHHHHHH out decompressing it. In [17], we proposed an index struc-
HHHHHHHHEEEEEEEEEEELEEEEHHHHHHHHHHHHLHHHHHHHHHHHHHHLLEE
EEEEEELLLLEEEEEEEEELLLLLEEEEELLLLLEEEEEEEEELLLEEEEEEEEELLLEEE
ture, termed the SBC-tree (String B-tree for Compressed
HHHHHHHHHHHHHLLLL sequences), for indexing and searching RLE-compressed se-
Sequence compression quences of arbitrary length. In Figure 12, we illustrate how
protein secondary structure sequences are stored in bdbms.
RLE compressed form: We first compress the sequences using RLE, and then build
L3E7H22E6L2E3L1H10L10H16L4E7H12E10L4H7L4H14E10H7E8H10E4L1E10L3E8L
4H15E6L2E4L8H20E4L1E10L1E5L9E5L6E8L1E9L3E4H18E5L3E9L3H20L1H12E5L1E an SBC-tree index over the compressed sequences. Queries
4H17E6L5E7L4E13L1E14H14L5E10H7E6H10L6H11E11H13L2E5L10H18L3E7H9L4E
18L4E3L2H9L11H20E11L1E4H12L1H14L2E8L4E9L5E5L5E9L3E9L3E3H13L4
over the sequences will use the index to retrieve the desired
data without decompression. The SBC-tree is a two-level
Indexing compressed sequences
index structure based on the well-known String B-tree and
a 3-sided range query structure. The SBC-tree supports
SBC-tree Index substring as well as prefix matching, and range search oper-
ations over RLE-compressed sequences. The SBC-tree has
an optimal external-memory space complexity as well as op-
Figure 12: Indexing and querying RLE-compressed timal search time for substring matching, prefix matching,
sequences and range search queries. More interestingly, SBC-tree has
shown to be very practical to implement. The SBC-tree in-
dex is prototyped in PostgreSQL with an R-tree in place of
sible indexing framework, termed SP-GiST, that broadens the 3-sided structure. Preliminary performance results il-
the class of supported indexes to include disk-based versions lustrate that using the SBC-tree to index RLE-compressed
of space-partitioning trees, e.g., disk-based trie variants, protein sequences achieves up to an order of magnitude re-
quadtree variants, and kd-trees. As an extensible indexing duction in storage, up to 30% reduction in I/Os for the
framework, SP-GiST allows developers to instantiate a vari- insertion operations, and retains the optimal search perfor-
ety of index structures in an efficient way through pluggable mance achieved by the String B-tree over the uncompressed
modules and without modifying the database engine. The sequences.
SP-GiST framework is implemented inside PostgreSQL [34] In bdbms, we plan to address the following challenges re-
and we use it in bdbms. Several index structures have been garding the processing of compressed data:
instantiated using SP-GiST, e.g., variants of the trie [11,
• Full integration of the SBC-tree index: To fully
20], the kd-tree [6], the point quadtree [19], and the PMR
integrate the SBC-tree index inside bdbms we plan to
quadtree [29]. We implemented several advanced search op-
address several query processing and optimization is-
erations, e.g., k-nearest-neighbor search, regular expression
sues including: (1) supporting subsequence matching,
match search, and substring searching. The experimental
and (2) providing accurate cost functions for estimat-
results in [16] demonstrate the performance potential of the
ing the cost of the index. Subsequence matching is an
class of space-partitioning tree indexes over the B+-tree and
important operation over biological sequences as it is
R-tree indexes, for the operations above. In addition to the
used in many algorithms such as sequence alignment
performance gains and the advanced search functionalities
algorithms. We plan to extend the supported oper-
provided by SP-GiST indexes, it is the ability to rapidly pro-
ations of the SBC-tree index to include subsequence
totype these indexes inside bdbms that is most attractive.
matching.
A key challenge is to integrate SP-GiST indexes inside bi-
ological analysis algorithms such as protein structure align- • Processing various formats of compressed data:
ment algorithms. Providing the index structures is the first Currently, bdbms supports indexing and querying
step to improve the querying and processing capabilities of RLE-compressed sequence data. RLE is effective in
the analysis algorithms. the case of sequences where characters have long re-
peats in tandem. Compression techniques like gzip
7.2 Indexing Compressed Data and Burrows-Wheeler Transform (BWT) can be more
Biological databases consist of large amounts of sequence effective in compressing the other kinds of data. Our
data, e.g., genes, alleles, and protein primary and secondary plan is to investigate indexing and querying other
structures. These sequences need to be stored, indexed, and formats of compressed data in addition to RLE-
searched efficiently. In bdbms, we propose to investigate compressed sequences to efficiently support these data
new techniques for compressing biological sequences and op- inside bdbms.
erating over the compressed data without decompressing
it. Sequence compression has been addressed recently in
the C-Store database management system [33], where some 8. RELATED WORK
operators, e.g., aggregation operators, can operate directly Periscope [30, 36] is an ongoing project that aims at defin-
over the compressed data. Sequence compression is demon- ing a declarative query language for querying biological data.
strated to improve system performance as it reduces the size Periscope/SQ [36], a component of Periscope, introduces
new operators and data types that facilitate the process- [4] W. G. Aref and I. F. Ilyas. Sp-gist: An extensible
ing and querying of sequence data. While the main focus of database index for supporting space partitioning trees.
Periscope is on defining and supporting a new declarative Journal of Intelligent Information Systems,
query language, bdbms focuses on other functionalities that 17(2-3):215–240, 2001.
are required by biological databases, e.g., annotation and [5] W. Armstrong. Dependency structures of database
provenance management, local dependency tracking, update relationships. In International Federation for
authorization, and non-traditional access methods. Information Processing (IFIP), pages 580–583, 1974.
Several annotation systems have been built to manage [6] J. L. Bentley. Multidimensional binary search trees
annotations over the web, e.g., [1, 26, 27, 28, 31, 32]. Bio- used for associative searching. Communications of the
das (Biological Distributed Annotation System) [1, 32] and ACM, 18(9):509–517, 1975.
Human Genome Browser [27] are specialized biological an- [7] D. Bhagwat, L. Chiticariu, W. Tan, and
notation systems to annotate genome sequences. They allow G. Vijayvargiya. An annotation management system
users to integrate genome annotation information from mul- for relational databases. pages 900–911, 2004.
tiple web servers. Managing annotations and provenance in [8] P. Buneman, A. P. Chapman, and J. Cheney.
relational databases has been addressed in [7, 8, 10, 12, 21, Provenance management in curated databases. In
35]. In these techniques provenance data is pre-computed ACM SIGMOD International Conference on
and stored inside the database as annotations. The main Management of Data , 2006.
focus of these techniques is to propagate the annotations
[9] P. Buneman, S. Khanna, and W.-C. Tan. Why and
along with the query answer. Other aspects of annotation
where: A characterization of data provenance. Lecture
management, e.g., insertion, storage, and indexing, have not
Notes in Computer Science, 1973:316–333, 2001.
been addressed. Another approach for tracking provenance,
termed the lazy approach, has been addressed in [9, 14, 15, [10] P. Buneman, S. Khanna, and W.-C. Tan. On
38], where provenance data is computed at query time. Lazy propagation of deletions and annotations through
approach techniques require that the derivation steps of the views. In Principles of Database Systems (PODS),
data to be known and to be invertible such that the prove- pages 150–158, 2002.
nance information can be computed. In bdbms, we treat [11] W. A. Burkhard. Hashing and trie algorithms for
provenance data as a kind of annotations because the deriva- partial match retrieval. ACM Transactions Database
tion of biological data is usually ad-hoc and does not neces- Systems, 1(2):175–187, 1976.
sarily follow certain functions or queries. [12] L. Chiticariu, W.-C. Tan, and G. Vijayvargiya.
The access control and authorization process in cur- Dbnotes: a post-it system for relational databases
rent database systems is based on the GRANT/REVOKE based on provenance. In ACM SIGMOD International
model [18, 24]. Although widely acceptable, this model lacks Conference on Management of Data, pages 942–944,
being content-based, i.e., the authorization is based only on 2005.
the identity of the user. In bdbms, we propose the content- [13] E. Codd. A relational model for large shared data
based approval model that is based on the data as well as on banks. In Communications of the ACM 13:6, pages
the identity of the user. 377–387, 1970.
[14] Y. Cui and J. Widom. Practical lineage tracing in
9. CONCLUDING REMARKS data warehouses. In International Conference on Data
Engineering, pages 367–378, 2000.
Two applications have been driving the bdbms project:
[15] Y. Cui and J. Widom. Lineage tracing for general
building a database resource for the Escherichia coli (E. coli)
data warehouse transformations. In International
model organism and a protein structure database project.
Conference on Very Large Data Bases, pages 471–480,
Through these two projects, we realized the need for the
2001.
functionalities that we address in bdbms, namely (1) Anno-
[16] M. Y. Eltabakh, R. H. Eltarras, and W. G. Aref.
tation and provenance management, (2) Local dependency
Space-partitioning trees in postgresql: Realization and
tracking, (3) Update authorization, and (4) Non-traditional
performance. In International Conference on Data
and novel access methods.
Engineering, pages 100–111, 2006.
bdbms is currently being prototyped using PostgreSQL.
In parallel work, we have extended relational algebra to op- [17] M. Y. Eltabakh, W.-K. Hon, R. Shah, W. G. Aref,
erate on “annotated” relations. The A-SQL language and and J. S. Vitter. The sbc-tree: An index for
the content-based authorization model are currently under run-length compressed sequences. Technical Report
development in PostgreSQL. The SP-GIST and SBC-tree CSD TR05-030, 2005.
access methods are already integrated inside PostgreSQL. [18] R. Fagin. On an authorization mechanism. ACM
We are currently studying several optimizations, cost esti- Transactions on Database Systems (TODS),
mates, and complex operations over these indexes. 3(3):310–319, 1978.
[19] R. A. Finkel and J. L. Bentley. Quad trees: A data
structure for retrieval on composite keys. Acta
10. REFERENCES Information, 4:1–9, 1974.
[1] biodas.org. http://biodas.org. [20] E. Fredkin. Trie memory. Communications of the
[2] Exploiting the power of oracle using microsoft excel. ACM, 3(9):490–499, 1960.
Oracle White Paper, December 2004. [21] F. Geerts, A. Kementsietsidis, and D. Milano.
[3] W. G. Aref and I. F. Ilyas. An extensible index for Mondrian: Annotating and querying databases
spatial databases. In Statistical and Scientific through colors and blocks. In International Conference
Database Management, pages 49–58, 2001.
on Data Engineering, page 82, 2006. Engineering, pages 91–102, 1997.
[22] T. M. Ghanem, R. Shah, M. F. Mokbel, W. G. Aref,
and J. S. Vitter. Bulk operations for
space-partitioning trees. In International Conference
on Data Engineering, pages 29–40, 2004.
[23] S. W. Golomb. Run-length encodings. IEEE
Transactions on Information Theory, 12:399–401,
1966.
[24] P. P. Griffiths and B. W. Wade. An authorization
mechanism for a relational database system. ACM
Transactions on Database Systems (TODS),
1(3):242–255, 1976.
[25] H. V. Jagadish and F. Olken. Database management
for life sciences research. SIGMOD Record,
33(2):15–20, 2004.
[26] J. Kahan and R. S. M. Koivunen, E. Prud’Hommeaux.
Annotea: An open rdf infrastructure for shared web
annotations. WWW10, pages 623–632, 2001.
[27] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin,
T. H. Pringle, A. M. Zahler, and D. Haussler. The
human genome browser at ucsc. Genome Research,
12(5):996–1006, 2002.
[28] D. LaLiberte and A. Braverman. A protocol for
scalable group and public annotations. WWW3, pages
911–918, 1995.
[29] R. C. Nelson and H. Samet. A population analysis for
hierarchical data structures. In ACM SIGMOD
International Conference on Management of Data,
pages 270–277, 1987.
[30] J. M. Patel. The role of declarative querying in
bioinformatics. 7(1):89–92, 2003.
[31] M. A. Schickler, M. S. Mazer, and C. Brooks.
Pan-browser support for annotations and other
meta-information on theworld wide web. WWW5,
pages 1063–1074, 1996.
[32] L. Stein, S. Eddy, and R. Dowell. Distributed
sequence annotation system (das). Washigton
University, Technical Report WUCS-01-07, 2001.
[33] M. Stonebraker, D. Abadi, A. Batkin, X. Chen,
M. Cherniack, M. Ferreira, E. Lau, A. Lin,
S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran,
and S. Zdonik. C-store: A column oriented dbms. In
International Conference on Very Large Data Bases,
2005.
[34] M. Stonebraker and G. Kemnitz. The postgres next
generation database management system.
Communications of the ACM, 34(10):78–92, 1991.
[35] W.-C. Tan. Containment of relational queries with
annotation propagation. In International Symposium
on Database Programming Languages, 2003.
[36] S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop.
Declarative querying for biological sequences. In
International Conference on Data Engineering, pages
87–96, 2006.
[37] T. Topaloglou. Biological data management:
Research, practive and opportunities. In International
Conference on Very Large Data Bases, pages
1233–1236, 2004.
[38] A. Woodruff and M. Stonebraker. Supporting
fine-grained data lineage in a database visualization
environment. In International Conference on Data

Ib Past Paper Biology
100% (3)
Ib Past Paper Biology
9 pages
Unit 1
No ratings yet
Unit 1
43 pages
Distributed Database and Big Data
No ratings yet
Distributed Database and Big Data
72 pages
Bioinformatics Presentation 2024 (Object Oriented Databases)
No ratings yet
Bioinformatics Presentation 2024 (Object Oriented Databases)
21 pages
04 Computer Applications in Pharmacy Full Unit IV
No ratings yet
04 Computer Applications in Pharmacy Full Unit IV
14 pages
Triggers
0% (1)
Triggers
10 pages
Introduction To Biological Databases
No ratings yet
Introduction To Biological Databases
5 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
30 Pyspark Coding Questions
No ratings yet
30 Pyspark Coding Questions
9 pages
Bioinformatics Lecture 1
No ratings yet
Bioinformatics Lecture 1
48 pages
Ajava1 To 23prac
No ratings yet
Ajava1 To 23prac
82 pages
Question Bank DBMS
No ratings yet
Question Bank DBMS
7 pages
Lecture PPT 1
No ratings yet
Lecture PPT 1
23 pages
02-A-Introduction To Biological Databases
No ratings yet
02-A-Introduction To Biological Databases
52 pages
1 - DML and DDL Homework
No ratings yet
1 - DML and DDL Homework
2 pages
Ajol File Journals - 314 - Articles - 242956 - Submission - Proof - 242956 3745 584187 1 10 20230306
No ratings yet
Ajol File Journals - 314 - Articles - 242956 - Submission - Proof - 242956 3745 584187 1 10 20230306
17 pages
Day 1
No ratings yet
Day 1
38 pages
Lesson 01 Intro DataBases V2
No ratings yet
Lesson 01 Intro DataBases V2
38 pages
Engineer Pros Backend Level 2 Course
No ratings yet
Engineer Pros Backend Level 2 Course
12 pages
Synopsis Todolist Project
No ratings yet
Synopsis Todolist Project
12 pages
Knowledge Management Encyclopedia
No ratings yet
Knowledge Management Encyclopedia
17 pages
Information Technology Bca 4th Sem Model Paper
No ratings yet
Information Technology Bca 4th Sem Model Paper
18 pages
MMSegmentation
No ratings yet
MMSegmentation
2 pages
Biological Databases
No ratings yet
Biological Databases
41 pages
ATRG - Application Control
No ratings yet
ATRG - Application Control
11 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Tugas 4 - 825189201 (Saskia Febe Fedhora)
No ratings yet
Tugas 4 - 825189201 (Saskia Febe Fedhora)
3 pages
Cleanup of Versioned Related Tables Description
No ratings yet
Cleanup of Versioned Related Tables Description
3 pages
Oracle Database Administration I Exam Number: 1Z0-082
No ratings yet
Oracle Database Administration I Exam Number: 1Z0-082
2 pages
Computational Validation and Analysis of Semi-Quantitative Data Using In-Silico Approaches
No ratings yet
Computational Validation and Analysis of Semi-Quantitative Data Using In-Silico Approaches
5 pages
4d Ch2 Data Collector QSG (Ed5) 2 - 1
No ratings yet
4d Ch2 Data Collector QSG (Ed5) 2 - 1
7 pages
Role of Computers in Research
100% (1)
Role of Computers in Research
5 pages
Databases
No ratings yet
Databases
34 pages
Entity Set (Extension) : Bahir Dar) P (Water, P002, Gonder) P (Construction, P003, Gonder) - . - One Entity Instance
No ratings yet
Entity Set (Extension) : Bahir Dar) P (Water, P002, Gonder) P (Construction, P003, Gonder) - . - One Entity Instance
7 pages
SQP 11 - QP
No ratings yet
SQP 11 - QP
12 pages
Lecture 1-2 Intro
No ratings yet
Lecture 1-2 Intro
24 pages
Capture D'écran . 2023-03-14 À 00.15.22
No ratings yet
Capture D'écran . 2023-03-14 À 00.15.22
54 pages
CR Micro
No ratings yet
CR Micro
2 pages
Introduction A La Bioinformatique
100% (1)
Introduction A La Bioinformatique
165 pages
First Lab Class Handouts - To Upload - 24-03-2021
No ratings yet
First Lab Class Handouts - To Upload - 24-03-2021
30 pages
Container Coding
No ratings yet
Container Coding
39 pages
Bioinfo U2 KD 2
No ratings yet
Bioinfo U2 KD 2
3 pages
Lecture 4 Biological Databases
No ratings yet
Lecture 4 Biological Databases
29 pages
Datamining in Bioinformatics-1
No ratings yet
Datamining in Bioinformatics-1
15 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
Databases For Microarrays: Vidhya Jagannathan SIB, Lausanne
No ratings yet
Databases For Microarrays: Vidhya Jagannathan SIB, Lausanne
49 pages
Unit 2.4: Bioinformatics and Databases
No ratings yet
Unit 2.4: Bioinformatics and Databases
55 pages
2024.HF BioInformatics Lec3p
No ratings yet
2024.HF BioInformatics Lec3p
11 pages
RAJU
No ratings yet
RAJU
24 pages
#1 L1 BioDatabases
No ratings yet
#1 L1 BioDatabases
89 pages
Bioinformatics Database Systems (Kevin Byron, Katherine G. Herbert Etc.) (Z-Library)
No ratings yet
Bioinformatics Database Systems (Kevin Byron, Katherine G. Herbert Etc.) (Z-Library)
49 pages
Unit Ii
No ratings yet
Unit Ii
23 pages
Lecture 5 - DataBase
No ratings yet
Lecture 5 - DataBase
18 pages
Lecture 2 Introduction To The Computational Tools
No ratings yet
Lecture 2 Introduction To The Computational Tools
15 pages
Peace BMCB Seminar
No ratings yet
Peace BMCB Seminar
13 pages
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
No ratings yet
Bioinformatics Database Resources: Icxa Khandelwal Pavan Kumar Agrawal Rahul Shrivastava
46 pages
Biological Database Design and Implementation: Keywords: Database Design, Software Engineering, Source Code Control
No ratings yet
Biological Database Design and Implementation: Keywords: Database Design, Software Engineering, Source Code Control
8 pages
المحاضرة 2
No ratings yet
المحاضرة 2
16 pages
Database Design Section 9 Quiz
No ratings yet
Database Design Section 9 Quiz
8 pages
DBMS Syllabus
No ratings yet
DBMS Syllabus
2 pages
FE - BME - 400 - BI - Week 05 - Lec
No ratings yet
FE - BME - 400 - BI - Week 05 - Lec
10 pages
Lecture 1 - Biological Database
No ratings yet
Lecture 1 - Biological Database
14 pages
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
No ratings yet
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
5 pages
Ijbt 1 (1) 101-116
No ratings yet
Ijbt 1 (1) 101-116
16 pages
Bioinformatics Day1
No ratings yet
Bioinformatics Day1
5 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
4.2 NoSQL Databases UNIT-1
No ratings yet
4.2 NoSQL Databases UNIT-1
35 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
What Is Bioinformatics
No ratings yet
What Is Bioinformatics
10 pages
Ananya Jaiswal
No ratings yet
Ananya Jaiswal
20 pages
How To Write A Literature Review Revised1
100% (2)
How To Write A Literature Review Revised1
64 pages
Bioinformatics Overview
100% (1)
Bioinformatics Overview
18 pages
ML Project Movie Recommendation System
No ratings yet
ML Project Movie Recommendation System
2 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
Introduction To Databases
No ratings yet
Introduction To Databases
7 pages
Unit 5-Introduction To Biological Databases
No ratings yet
Unit 5-Introduction To Biological Databases
14 pages
Exploring Database and Analyzing Protein Sequence
No ratings yet
Exploring Database and Analyzing Protein Sequence
70 pages
Practical Guide to Portable Batch System: Definitive Reference for Developers and Engineers
From Everand
Practical Guide to Portable Batch System: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BLAST: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
B-Tree Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
B-Tree Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PostgreSQL Foundations: Definitive Reference for Developers and Engineers
From Everand
PostgreSQL Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
From Everand
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
KeyDB Administration and Performance Tuning: Definitive Reference for Developers and Engineers
From Everand
KeyDB Administration and Performance Tuning: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PrestoDB in Practice: Definitive Reference for Developers and Engineers
From Everand
PrestoDB in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Debezium in Action: Definitive Reference for Developers and Engineers
From Everand
Debezium in Action: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Biology Bdbms System

Uploaded by

Biology Bdbms System

Uploaded by

bdbms – A Database Management System for Biological

Mohamed Y. Eltabakh Mourad Ouzzani Walid G. Aref

ABSTRACT In many cases, biologists tend to store their data in flat

GID GName GSequence GID GName GSequence

DB1_Gene JW0027 ispH ATGCAGATCCT…

Figure 2: Annotating tables DB1 Gene and DB2 Gene

DB1_Gene JW0027 ispH B2 ATGCAGATCCT… B3

Source copy GID GName GSequence PName GID PSequence PFunction

JW0082 ftsI ATGAAAGCAGC… ftsI JW0082 MKAAAKTQ… Cell wall formation

protein sequence using lab experiments. If a gene sequence

• Tracking outdated data: When the database is

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.