Griffiths v3
Griffiths v3
Metabolic
and
regulatory
pathway
induction
Development of Drugs, Vaccines, Diagnostics
Differing types of Drugs, Vaccines, and Diagnostics
• Small molecules
• Protein therapeutics
• Gene therapy
• In vitro, In vivo diagnostics
Development requires
• Preclinical research
• Clinical trials
• Long-term clinical research
cDNA
genbank proprietary Medline
µArraydbOligo TxP DB
Integrated Application
cDNA
Medline
genbank proprietary µArraydb
Oligo TxP DB
EXPRESSION LITERATURE
SEQUENCE
Data Warehousing for Integration
Data warehousing is a process as much as it is
a repository. There are a couple of primary
concepts behind data warehousing:
• ETL (Extraction, Transformation, Load)
• Component-based (datamarts)
• Typically utilizes a dimensional model
• Metadata-driven
Data Warehousing
Source Data
Data Warehouse
(integrated Datamarts)
E
(Extraction)
T
(Transformation)
L
(Load)
Data-level Integration Through
Data Warehousing
SEQUENCE EXPRESSION LITERATURE
SeqWeb TxP App PubMed Proprietary App
cDNA
µArray DB
genbank proprietary Oligo TxP DB Medline
Metadata layer
Presentation
Application
Data Warehouse Presentation
Application
(Q1Q2Q3)
“Show me all genes in the public literature that
are putatively related to hyperprolactinemia,
have more than 3-fold expression differential
Presentation between hyperprolactinemia and normal
Application pituitary cells, and are homologous to known
transcription factors.”
Data Staging
Storage area and set of processes that
• extracts source data
• transforms data
• cleans incorrect data, resolves missing elements, standards
conformance
• purges fields not needed
• combines data sources
• creates surrogate keys for data to avoid dependence on legacy keys
• builds aggregates where needed
• archives/logs
• loads and indexes data
Does not provide query or presentation services
Data Staging (cont.)
• Sixty to seventy percent of development is here
• Engineering is generally done using database
automation and scripting technology
• Staging environment is often an RDBMS
• Generally done in a centralized fashion and as often as
desired, having no effect on source systems
• Solves the integration problem once and for all, for
most queries
Warehouse Development
and Deployment
Two development paradigms:
I I I
Sequence indexed GxP indexed SNP
data sources data sources information
CORBA
Memory Maps: Pros and Cons
Advantages Disadvantages
• typically does not put non-
• true “analytical” data analytical data (gene names,
tissue types, etc.) through the
integration ETL process
• not easily extensible when
• quick access adding new databases with
descriptive information
• cleans analytical data • performance hit when accessing
anything outside of memory
• simple matrix (tough to optimize)
representation • scalability restricted by memory
limitations of machine
• difficult to mine due to
complicated architecture
The Need for Metadata
For all of the previous approaches, one underlying
concept plays a critical role to their success: Metadata.
Metadata is a concept that many people still do not
fully understand. Some common questions include:
• What is it?
• Where does it come from?
• Where do you keep it?
• How is it used?
Metadata
• EMBL format
• ASN.1
• XML
XML Emerges
• Allows uniform description of data and metadata
– Metadata described through DTDs
– XML::DOM - a Perl extension to XML::Parser. It adds a new 'Style' to XML::Parser,called 'Dom', that allows XML::Parser to build an Object Oriented data structure with a DOM Level 1
compliant interface.
– XML::Dumper - a simple package to experiment with converting Perl data structures to XML and converting XML to perl data structures.
– XML::Grove - provides simple objects for parsed XML documents. The objects may be modified but no checking is performed.
– XML::QL - an early implementation of a note published by the W3C called "XML-QL: A Query Language for XML".
– XML::XQL - a Perl extension that allows you to perform XQL queries on XML object trees.
XML in Life Sciences
• Lots of momentum in Bio community
• …
• Distributed • Distributed
• Component-based architecture • Level of abstraction is
• Promotes reuse sometimes not useful
• Doesn’t require knowledge of • Can be slow to broker objects
implementation • Different ORBS do different
• Platform independent things
• Unreliable?
• OMG website is brutal
Modeling Techniques
E-R Modeling
• Optimized for transactional data
• Eliminates redundant data
• Preserves dependencies in UPDATEs
• Doesn’t allow for inconsistent data
• Useful for transactional systems
Dimensional Modeling
• Optimized for queryability and performance
• Does not eliminate redundant data, where appropriate
• Constraints unenforced
• Models data as a hypercube
• Useful for analytical systems
Illustrating Dimensional Data Space
Sample problem: monitoring a temperature-sensitive room for fluctuations
y
x
x, y, z, and time uniquely determine a temperature value:
(x,y,z,t) temp
Independent variables Dependent variables
Nomenclature:
“x, y, z, and t are dimensions”
“temperature is a fact”
“the data space is a hypercube of size 4”
Dimensional Modeling Primer
• Represents the data domain as a collection of
hypercubes that share dimensions
– Allows for highly understandable data spaces
– Direct optimizations for such configurations are provided
through most DBMS frameworks
– Supports data mining and statistical methods such as multi-
dimensional scaling, clustering, self-organizing maps
– Ties in directly with most generalized visualization tools
– Only two types of entities - dimensions and facts
Dimensional Modeling Primer -
Relational Representation
• Contains a table for each dimension
• Contains one central table for all facts, with a multi-part key
• Each dimension table has a single part primary key that corresponds
to exactly one of the components of the multipart key in the fact table.
X Dimension
PK
Temperature Fact
The Star Y Dimension
FK
Schema: the PK FK
CK
basic component
of Dimensional Z Dimension
FK FK
Modeling PK
PK
Time Dimension
Dimensional Modeling Primer -
Relational Representation
• Each dimension table most often contains descriptive textual information
about a particular scientific object. Dimension tables are typically the entry
points into a datamart. Examples: “Gene”, “Sample”, “Experiment”
• The fact table relates the dimensions that surround it, expressing a many-to-
many relationship. The more useful fact tables also contain “facts” about the
relationship -- additional information not stored in any of the dimension tables.
X Dimension
PK
basic PK FK
CK
component of
Dimensional FK FK
Z Dimension
PK
Modeling
PK
Time Dimension
Dimensional Modeling Primer -
Relational Representation
• Dimension tables are typically small, on the order of 100 to 100,000 records.
Each record measures a physical or conceptual entity.
• The fact table is typically very large, on the order of 1,000,000 or more
records. Each record measures a fact around a grouping of physical or
conceptual entities.
X Dimension
PK
Temperature Fact
The Star Y Dimension
FK
Schema: the PK FK
CK
basic
component of Z Dimension
FK FK
Dimensional PK
Modeling PK
Time Dimension
Dimensional Modeling Primer -
Relational Representation
Neither dimension tables nor fact tables are necessarily normalized!
• Normalization increases complexity of design, worsens performance with joins
• Non-normalized tables can easily be understood with SELECT and GROUP BY
• Database tablespace is therefore required to be larger to store the same data - the gain
in overall performance and understandability outweighs the cost of extra disks!
X Dimension
PK
Temperature Fact
The Star Y Dimension
FK
Schema: the PK FK
CK
basic
component of Z Dimension
FK FK
Dimensional PK
Modeling PK
Time Dimension
Case in Point:
Sequence Clustering
ParamSet
paramset_id
Sequence Membership Subcluster
seq_id start subcluster_id
bases length
length orientation
Parameters
param_name SELECT SEQ_ID
param_value FROM SEQUENCE, MEMBERSHIP, SUBCLUSTER
WHERE SEQUENCE.SEQKEY = MEMBERSHIP.SEQKEY
AND MEMBERSHIP.SUBCLUSTERKEY = SUBCLUSTER.SUBCLUSTERKEY
PROBLEMS AND SUBCLUSTER.CLUSTERKEY = (
• not browsable (confusing) SELECT CLUSTER.CLUSTERKEY
FROM SEQUENCE, MEMBERSHIP, SUBCLUSTER, CLUSTER, RESULT, RUN
• poor query performance WHERE SEQUENCE.RESULTKEY = RESULT.RESULTKEY
• little or no data mining support AND RESULT.RUNKEY = RUN.RUNKEY
AND SEQUENCE.SEQKEY = MEMBERSHIP.SEQKEY
AND MEMBERSHIP.SUBCLUSTERKEY = SUBCLUSTER.SUBCLUSTERKEY
AND SUBCLUSTER.CLUSTERKEY = CLUSTER.CLUSTERKEY
AND SEQUENCE.SEQID = ‘XA501’
AND RESULT.RUNID = ‘my last run’
)
Dimensionally Speaking…
Sequence Clustering “Show me all sequences
in the same cluster as
CONCEPTUAL sequence XA501 from
IDEA - The Star Membership Facts my last run.”
seq_id
Schema: cluster_id
Sequence
subcluster_id
Parameters seq_id
run_id
A historical, paramset_id
paramset_id
bases
param_name length
denormalized, param_value type
run_date
subject-oriented run_initiator
view of scientific seq_start
seq_end
facts -- the data seq_orientation SELECT SEQ_ID
mart. cluster_size FROM MEMBERSHIP_FACTS
subcluster_size WHERE CLUSTER_ID IN (
SELECT CLUSTER_ID
A centralized fact
FROM MEMBERSHIP_FACTS
table stores the WHERE SEQ_ID = ‘XA501’
single scientific fact AND RUN_ID = ‘my last run’
of sequence Run )
run_id
membership in run_date
cluster and a run_initiator
run_purpose
subcluster. run_remarks
Smaller dimensional
tables around the
Benefits
fact table represent • Highly browsable, understandable model for scientists
key scientific • Vastly improved query performance
objects (e.g., • Immediate data mining support
sequence). • Extensible “database componentry” model
Dimensional Modeling -
Strengths
• Predictable, standard framework allows database systems and
end user query tools to make strong assumptions about the data
• Star schemas withstand unexpected changes in user behavior --
every dimension is equivalent: symmetrically equal entry points
into the fact table.
• Gracefully extensible to accommodate unexpected new data
elements and design decisions
• High performance, optimized for analytical queries
The Need for Standards
In order for any integration effort to be
successful, there needs to be agreement on
certain topics:
• Ontologies: concepts, objects, and their
relationships
• Object models: how are the ontologies
represented as objects
• Data models: how the objects and data are stored
persistently
Standard Bio-Ontologies
Currently, there are efforts being undertaken
to help identify a practical set of technologies
that will aid in the knowledge management
and exchange of concepts and representations
in the life sciences.
GO Consortium: http://genome-www.stanford.edu/GO/
The third annual Bio-Ontologies meeting is
being held after ISMB 2000 on August 24th.
Standard Object Models
Currently, there is an effort being undertaken
to develop object models for the different
domains in the Life Sciences. This is primarily
being done by the Life Science Research
(LSR) working group within the OMG (Object
Management Group). Please see their
homepage for further details:
http://www.omg.org/homepages/lsr/index.html
In Conclusion
• Data integration is the problem to solve to support human and
computer discovery in the Life Sciences.
• There are a number of approaches one can take to achieve data
integration.
• Each approach has advantages and disadvantages associated
with it. Particular problem spaces require particular solutions.
• Regardless of the approach, Metadata is a critical component for
any integrated repository.
• Many technologies exist to support integration.
• Technologies do nothing without syntactic and semantic
standards.
Accessing Integrated Data
Once you have an integrated repository of
information, access tools enable future
experimental design and discovery. They can
be categorized into four types:
– browsing tools
– query tools
– visualization tools
– mining tools
Browsing
One of the most critical requirements that is
overlooked is the ability to “browse” the integrated
repository since users typically do not know what is
in it and are not familiar with other investigator’s
projects. Requirements include:
• ability to view summary data
• ability to view high level descriptive information on
a variety of objects (projects, genes, tissues, etc.)
• ability to dynamically build queries while browsing
(using a wizard or drag and drop mechanism)
Querying
Along with browsing, retrieving the data from the repository
is one of the most underdeveloped areas in bioinformatics.
All of the visualization tools that are currently available are
great at visualizing data. But if users cannot get their data into
these tools, how useful are they? Requirements include:
• ability to intelligently help the user build ad-hoc queries
(wizard paradigm, dynamic filtering of values)
• provide a “power user” interface for analysts (query
templates with the ability to edit the actual SQL)
• should allow users to iterate over the queries so they do not
have to build them from scratch each time
• should be tightly integrated with the browser to allow for
easier query construction
Visualizing
There are a number of visualization tools currently
available to help investigators analyze their data.
Some are easier to use than others and some are
better suited for either smaller or larger data sets.
Regardless, they should all provide the ability to:
• be easy to use
• save templates which can be used in future visualizations
• view different slices of the data simultaneously
• apply complex statistical rules and algorithms to the data
to help elucidate associations and relationships
Data Mining
Life science has large volumes of data that, in its
rawest form, is not easy to use to help drive new
experimentation. Ideally, one would like to automate
data mining tools to extract “information” by allowing
them to take advantage of a predicable database
architecture. This is more easily attainable using
dimensional modeling (star schemas), however, since
E-R schemas are very different from database to
database and do not conform to any standard
architecture.
Database Schemas for 3 independent Genomics systems
SEQUENCE_DATABASE ORGANISM Homology Data
Seq_DB_Key Organism_Key
Seq_DB_Name Seq_DB_Key
Species
QUALIFIER GE_RESULTS
SCORE
Qualifier_Key Results_Key
Score_Key
PARAMETER_SET
SEQUENCE Map_Key PARAMETER_SET Analysis_Key
Alignment_Key Parametet_Set_Key
Sequence_Key Chip_Key Parameter_Set_Key
P_Value Parameter_Set_Key
Algorithm_key Gene_Name Qualifier_Key
Map_Key Score
RNA_Source_Key
Qualifier_Key Percent_Homology
Expression_Level
Seq_DB_Key GENOTYPE Absent_Present
Type RNA_SOURCE
ALIGNMENT Genotype_Key Fold_Change
Name RNA_Source_Key Type
Alignment_Key ALGORITHM Name
Treatment_Key
Algorithm_key Algorithm_key Genotype_Key
Sequence_Key Algorithm_Name Cell_Line_Key
Tissue_Key TREATMENT
CELL_LINE
Disease_Key Treatmemt_Key
Cell_Line_Key Species
Name Name
ALLELE
CHIP
MAP_POSITION Allele_Key TISSUE DISEASE ANALYSIS
Chip_Key
Map_Key Map_Key Tissue_Key Disease_Key Analysis_Key
Chip_Name
Allele_Name Name Name Analysis_Decision
Species
Base_Change SNP_FREQUENCY
STS_SOURCE Frequency_Key
PCR_PROTOCOL
Source_Key Protocol_Key Linkage_Key
Method_Key
Population_Key
Allele_Key
Gene Expression
Source_Key Allele_Frequency
SNP_METHOD Buffer_Key
Method_Key
Linkage
SNP_POPULATION
PCR_BUFFER Linkage_Key
Population_Key
Buffer_Key Disease_Link
Sample_Size
Linkage_Distance
SNP Data
The Warehouse
Three star schemas of heterogenous data joined through a conformed dimension
GENE_EXPRESSION_RESULT
RNA_Source_Key_Exp (FK)
RNA_SOURCE RNA_Source_Key_Bas
RNA_Source_Key Sequence_Key (FK)
Parameter_Set_Key (FK)
Treatment GENECHIP_PARAMETER_SET
Disease Expression_Level_Exp
Parameter_Set_Key
Tissue Expression_Level_Bas
Cell_Line Absent_Present_Exp
Genotype Absent_Present_Bas
Species Analysis_Decision
Gene Expression
Chip_Type
Fold_Change
HOMOLOGY_PARAMETER_SET
STS_SOURCE
Parameter_Set_Key
STS_Source_Key
Algorithm_Name
SEQUENCE SEQUENCE_HOMOLOGY_RESULT
SNP_RESULT
Sequence_Key Query_Sequence_Key (FK)
Sequence_Key (FK) Target_Sequence_Key
Sequence
STS_Source_Key (FK) Parameter_Set_Key (FK)
Seq_Type
STS_Protocol_Key (FK) Database_Key (FK)
Seq_ID
Allele_Frequency Seq_Database Score
Sample_Size Map_Position P_Value
Allele_Name Species Alignment
Base_Change Gene_Name Percent_Homology
Disease_Link Description
Linkage_Distance Qualifier
HOMOLOGY_DATABASE
STS_PROTOCOL Database_Key
STS_Protocol_Key Seq_DB_Name
Species
PCR_Protocol
PCR_Buffer
Conformed Last_Updated
“sequence” dimension