Survey On ETL Processes
Survey On ETL Processes
ABSTRACT
219
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
• DW: is a central repository to save data are to access available targets and to write the
produced by ETL layer. DW is a database outcome data (transformed and integrated data) into
including fact tables besides dimension the targets. This step can be a very time-consuming
tables. Together these tables are combined task due to indexing and partitioning techniques
in a specific schema which may be star used in data warehouses [6]. Finally, according to
schema or snowflake schema. figure 2, this step is performed over target.
• Reporting and Analysis layer has the Transformation step is the most laborious one
mission to catch end-user request and where ETL adds value [3]. Indeed during this step,
translate it to DW. Collected data are ETL carries out the logic of business process
served to end-users in several formats. For instanced as business rules (sometime called
example data is formatted into reports, mapping rules) and deals with all types of conflicts
histograms, dashboard...etc (syntax, semantic conflicts … etc). This step is
associated with two words: clean and conform. In
ETL is a critical component in DW environment. one hand, cleaning data aims to fix erroneous data
Indeed, it is widely recognized that building ETL and to deliver clean data for end users (decisions
processes, during DW project, are expensive makers). Dealing with missing data, rejecting bad
regarding time and money. ETL consume up to data are examples of data cleaning operations. In
70% of resources [3], [5], [4], [2]. Interestingly [2] other hand, conforming data aims to make data
reports and analyses a set of studies proving this correct, in compatibility with other master data.
fact. In other side, it is well known too, that the Checking business rules, checking keys and lookup
accuracy and the correctness of data, which are of referential data are example of conforming
parts of ETL responsibility, are key factors of the operations. At technical level and in order to
success or failure of DW projects. perform this step, ETL should supplies a set of data
Given the fact expressed above, about ETL transformations or operators like filter, sort, inner
importance, the next section presents ETL missions join, outer joins…etc. Finally this step involves
and its responsibility. flow schema management because the structure of
processed data is changing and modified step by
1.2 ETL Mission step, either by adding or removing attributes.
As its name indicates, ETL performs three
operations (called also steps) which are Extraction, We refer interested readers to [3] and [5] for
Transformation and Loading. Upper part of figure 2 more details and explanation of each step.
shows an example of ETL processes with Talend
Open source tool. The second part of figure 2
shows the palette of Talend tool that is a set of
components to build ETL processes. In what
follows, we present each ETL step separately.
Extraction step has the problem of acquiring data
from a set of sources which may be local or distant.
Logically, data sources come from operational
applications, but there is an option to use external
data sources for enrichment. External data source
means data coming from external entities. Thus
during extraction step, ETL tries to access available
sources, pull out the relevant data, and reformat
such data in a specified format. Finally, this step
comes with the cost of sources instability besides
their diversity. Finally, according to figure 2, this
step is performed over source.
Loading step conversely to previous step, has the
problem of storing data to a set of targets. During
this step, ETL loads data into targets which are fact
Figure 2: ETL Example and Environment with Talend
tables and dimension in DW context. However, Open source
intermediate results will be written to temporary
data stores. The main challenges of a loading step
220
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
The remaining of this paper is organized as offers two versions in its ETL solution: DataSatge
follows. In section 2, we review some open source SERVER version and DataSatge PX version. This
and some commercial ETL tools, along with some last version has the advantage to manage the
ETL prototypes coming from academic world. partitioning in data processing. Finally, DataStage
Section 3 deals with research works in the field of generates OSH code from ETL job built with stages
ETL modeling and design while section while placement.
section 4 approaches ETL maintenance issue. SSIS stands for Sql Server Integration Services.
Finally, in section 4, we present and outline This is the ETL solution of Microsoft editor [10].
challenges and research opportunities around ETL As mentioned above, SSIS is free with any DBMS
processes. We conclude and present our future SQL SERVER license which includes two extra
works in section 6. tools that are SSRS and SSAS (respectively for
reporting and OLAP analysis). The atomic element
2. ETL TOOLS AND RESEARCH
or the basic element in SSIS is called a "task".
PROTOTYPES
Thus, for this tool an ETL process is a combination
In Section 1, we present the technique of of tasks. More precisely, SSIS imposes two levels
material views originally created to refresh the DW. of tasks combination. The first level is called "Flow
In Section 2, we review ETL tools with examples Control" and the second level controlled by the
of commercial tools as open source tools. first, is called "Data flow." Indeed, the first level is
dedicated to prepare the execution environment
2.1 Commercial ETL Tools (deletion, control, moving files, etc ...) and supplies
A variety of commercial tools overwhelms the tasks for this purpose. The second level (data flow)
ETL market which is a promising market. A study which is a particular task of the first level performs
[7] conducted by TDWI, identifies a list of classical ETL mission. Thus, the Data-Flow task
indicators and criteria for their comparison and offers various tasks for data extraction,
evaluation. Each commercial ETL tool adopts its transformation and loading.
own notation and its won formalism. Consequently, In conclusion of this section, we have presented
metadata between these tools are not exchangeable. commercial ETL, along with some examples of
In contrast, among their commonalities, they all theme. In next section, we present another category
offer a graphical language for the implementation of ETL, open sources tools.
of ETL processes.
2.2 Open Source ETL
We distinguish two subfamilies in the Some open source tools are leaders in their
commercial ETL field. On the one hand, there is area of interest; for example, Linux in operating
subfamily of payable ETL DataStage [8] and system area and Apache server in web servers’
Informatica [9]. On the other hand, the second area. But the trend is not the same for open source
subfamily of commercial ETL comes with no business intelligence (BI) tools. They are less used
charge. In fact, they are free under certain in the industrial world [11].
conditions. Indeed, despite the ETL licenses are To understand this restriction of use of
expensive, major DBMS (Database Management open source BI tools, Thomsen and Pedersen study
System) editors like Oracle and Microsoft, offer this fact in [11] work. In this perspective, the
there ETL solution freely for each DBMS license authors start by counting BI tools available in open
purchased. In other words, ETL solution is included source with illustration of features of each tool.
in DBMS package license. In the following we Then the criteria adopted in this study were defined.
present an example of each subfamily. Thus, in ETL scope the criteria taken into account
DataStage [8] is the ETL solution of IBM editor. are:
Its basic element for data manipulation is called
"stage." Thus, for this tool an ETL process is a
combination of "stages." Thus we speak about • ROLAP or MOLAP aspect of the tool that
transformation stages and stages for extracting and is whether the tool can load relational
loading data (called connectors since release 8) databases or multidimensional cubes.
which are interconnected via links. The IBM • Incremental mode of the tool that is the
solution DataStage provides two other tools: features of loading modified data or newly
Manager and Administrator. They are designed, created ones.
respectively, for supervising the execution of ETL • The manner of using the tool; via
processes and for dealing with ETL project graphical interface, via xml file
configuration. It should also be noted that IBM configuration…etc.
221
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
• The richness of offered features. 2.3 Framework and Tools from Research
• The parallelism or the partitioning useful In this section, we will see some projects,
for the treatment of massive data. from academic world, dealing with the problem of
ETL. These projects are: SIRIUS, ARKTOS,
In summary, ten open source ETLs are DWPP and PYGRAMETL.
presented and this study concludes that the most
notables are Talend and Kettle because of their SIRIUS (Supporting the Incremental
large users’ community, their literature, their Refreshment of Information Warehouses) is a
features and their inclusion in BI suites. Finally, let project developed at information technology
note that another version of this study is available department in Zurich University. SIRIUS, [14]
in [12]. looks at the refreshment of data warehouses. It
develops an approach metadata oriented that allows
In the following we suggest to look closely the modeling and execution of ETL processes. It is
at one of these tools. More precisely, we will limit based on SIRIUS Meta model component that
to Talend tool. represents metadata describing the necessary
operators or features for implementing ETL
processes. In other words, SIRIUS provides
Talend is an open source, offering a wide
functions to describe the sources, targets
range of middleware solutions that meet the needs
description and the mapping between these two
of data management and application integration
parts. At the end, this description leads to the
[13]. Among these solutions there are two solutions
generation of global schema. At execution level,
for data integration: Talend Open Studio for Data
SIRIUS generates Java code ready for execution
Integration and Talend Enterprise Data
after a successful compilation. Finally, SIRIUS
Integration. Respectively, they are a free
includes techniques for detecting changes in the
development open source tool and non free
operational sources (databases exactly) besides
development tool that comes with advanced
supervision support, for the execution of defined
deployment and management features. Talend data
ETL processes.
integration solution is based on the main following
modules:
ARKTOS is another framework that
focuses on the modeling and execution of ETL
1. Modeler, a tool for creating diagrams. To
processes. Indeed, ARKTOS provides primitives to
this end, it offers a range of components
capture ETL tasks frequently used [15]. More
independent of any technology.
exactly, to describe a certain ETL process, this
2. Metadata Manager that is a repository for
framework offers three ways that are GUI and two
storing and managing metadata.
languages XADL (XML variant) and SADL (SQL
3. Job Designer, graphical development
like language). An extension of this work is
environment for ETL jobs creation. This
available in [16] where authors discuss the meta-
tool in turn provides a rich palette of
model of their prototype. The basic element in
components for data extraction,
ARKTOS is called an activity whose logic is
transformation and loading.
described by a SQL query. At execution level,
ARKTOS manages the execution of each activity
From the point of view of a developer, the and associate it with the action to take when an
implementation of a ETL process with Talend Open error occurs. Six types of errors are taken into
Studio Data Integration consists on the insertion of account. Firstly, they are 1) violation of the
components from the palette offered by Job primary key, 2) violation of the uniqueness and 3)
Designer (more than 400 components). From the violation of reference. These three types are special
point of view of a designer, the use of Talend cases of integrity constraints violation. The other
consists on modeling with Modeler module. But type of error is 4) NULL EXISTENCE for the
the relationship between these two levels is not elimination of missing values. Finally, the two
available. In other words, transition from ETL remaining errors types are 5) FIELD MISMATCH
diagrams built with Modeler to ETL job written and 6) FORMAT MISMATCH related respectively
with Job Designer is not offered. to domain errors and data format errors.
In next section, we see another type of
ETL tools, those coming from research world. PYGRAMETL (Python based
framework) is a programming framework for ETL
222
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
processes [17]. In this work, the authors attempt to section, we deal with ETL design and modeling.
prove that ETL development based on a graphical We present related works to these issues.
tool are less efficient than conventional ETL
developments (code editing). In other words, 3. ETL MODELING AND DESIGN
instead of using a graphical palette of components,
and then proceed by inserting and personalize the
Research in data warehouse field is
elements of this palette, it is more appropriate to
dominated by the DW design and DW modeling.
make codes and create scripts via a programming
ETL field is not an exception to this rule. Indeed, in
language. In this perspective, PYGRAMETL
the literature one can note several research efforts
suggests coding with Python language. Also it
that treat DW population performed by ETL
offers a set of functions most commonly used in the
processes. (ETL) are areas with high added value
context of populating data warehouses such as
labeled costly and risky. In addition, software
filtering data and feeding slowly changing
engineering requires that any project is doomed to
dimensions. Finally, the authors of PYGRAMETL
switch to maintenance mode. For these reasons, it is
present, as illustration, a comparison between
essential to overcome the ETL modeling phase with
PYGRAMETL (their tool) and PENTAHOO, an
elegance in order to produce simple models and
open source ETL tool.
understandable. Finally, as noted by Booch et al in
[19], we model a system in order to:
Another interesting work that touches the
ETL is available in [18]. This work which focuses
• Express its structure and behavior.
on ETL performance that is a fundamental aspect
among others of ETL, presents DWPP (Data • Understand the system.
Warehouse Population Platform). DWPP is a set of • View and control the system.
modules designed to solve the typical problems that • Manage the risk.
occur in any ETL project. DWPP is not a tool but it
is a platform. Exactly, it is C functions library In the following we go through research
shared under UNIX operating system for the works associated with ETL design.
implementation of ETL processes. Consequently,
DWPP provides a set of useful features for data 3.1 Meta-models based Works on ETL Design
manipulation. These features are structured in two During DOLAP 2OO2 conference,
levels. Indeed, the first level is based on the Vassiliadis et al [20] present an attempt to
features of operating system and those of the Oracle conceptual modeling of ETL processes. The authors
DBMS (the chosen target for DWPP) while the based their argument on the following two points.
second level regroups developed features which are At the beginning of a BI project, the designer needs
specific and useful in ETL treatment. Finally, as to:
this work comes from a feedback on the
development and deployment of DWPP based 1. Analyze the structure and the content of
large-scale ETL projects, it does not address the sources.
variety of sources nor targets. Indeed, DWPP is 2. Define mapping rules between sources and
limited to flat files as source and Oracle DBMS as targets.
target. Another work associated to DWPP is
available in [6] where authors give more details on The proposed model, based on meta-
the functional and technical architecture of their model, provides a graphical notation to meet this
solutions. In addition, the authors discuss some of need. Also, a range of activities commonly used by
the factors impacting the performance of ETL the designer is introduced.
processes. Thus, the authors address the physical
implementation of the database with focus on data
The model in question continues by
partitioning as well as pipelining and parallelism.
These aspects are addresses using UNIX operating defining two concepts: the Concept and
system functionalities. Relationship-Provider. Respectively, it is a source
or a target with their attributes and relationships
between attributes (fields of source or target).
During this section we have presented Transformations are dissociated with mapping rules
open source ETL tools as commercial tools, along (Relationship-Provider). Indeed, in order to
with some prototype from academic world. In next transform attributes such as removing spaces or
concatenating attributes, Provider Relationship uses
223
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
an extra node (element of the model). Ignoring model of an ETL processes using a graphical
which source to prioritize to extract data, the model notation presented previously.
introduces candidate relationship to designate all
sources likely to participate in DW population. The 3.2 UML Based Works
selected source is denoted active relationship. In 2003 Trujillo [24] proposes a new
approach, UML based for the design of ETL
The authors complement their model via processes. Finding that the model of Vassiliadis
an extensive paper [21] by proposing a method for [20] is based on ad-hoc method, the authors attempt
the design of ETL processes. Indeed, in this work, to simplify their model and to base it on a standard
the authors expose a method to establish a mapping tool. In this model, an ETL process is considered as
between the sources and targets of ETL process. class diagram. More precisely, a basic ETL activity
This method is spread over four steps: which can be seen as a component is associated
with a UML class and the interconnection between
1. Identification of sources classes is defined by UML dependencies. Finally,
2. Distinction between candidates’ sources the authors have decided to restrict their model to
and active sources. ten types of popular ETL activities such as
3. Attributes mapping. Aggregation, Conversion, Filter and Join, which in
4. Annotation of diagram (conceptual model) turn are associated to graphical icons.
with execution constraints.
In 2009 (DOLAP 2009), Munoz et al
Works around this model are reinforced by an modernize this model through a new publication
attempt to transition from conceptual model to [25] dealing with the automatic generation of code
logical model. This is the subject of [22] paper for ETL processes from their conceptual models. In
where authors define a list of steps to follow to fact, the authors have presented a solution oriented
insure this transition. In addition, this work MDA. It is structured as follows:
proposes an algorithm that groups transformations
and controls (at conceptual level) into stages that • For PIM (Platform Independent Model)
are logical activities. Finally, a semi-automatic layer, which corresponds to conceptual
method determining the order execution of logical level, ETL models are designed using
activities is defined too. UML formalism, more precisely the result
Another work around ETL presents of the previous effort.
KANTARA, a framework for modeling and • For PSM (Platform Specific Model) layer
managing ETL processes [4]. This work exposes which corresponds to the logical level, the
different participants that an ETL project involves platform chosen is Oracle platform.
particularly designer and developer. After the • For the Transformation layer that allows
analysis of interaction between main participants, the transition from PIM model to PSM
authors conclude that designer needs helpful tool, model is made with QVT (Query View
which will makes easy the design and maintenance Transformation) language. These
of ETL processes. In response, authors introduce transformations can bind PIM meta-model
new graphical notation based on a meta-model. It elements to PSM meta-model elements.
consists mainly on three core components which
are, Extract, Transform and Load components. Another research team presents in [26]
These components are chosen according to ETL another research effort about ETL and UML based.
steps. Besides, each component manages a set of But this work is restricted to extraction phase
specific meta-data close to its associate step. This omitting transformations and loading phases. Thus,
work is consolidated with another work presenting this work presents an approach object-oriented for
a method for modeling and organizing ETL modeling extraction phase of an ETL process using
processes [23]. Authors start by showing functional UML 2. To this end, authors present and illustrate
modules that should be distinguished in each ETL the mechanism of this phase as three diagrams.
process. This leads to distinguish several modules, These diagrams are class diagram, sequence
especially Reject Management module for which a diagram and use case diagram of extraction phase.
set of metadata to manage are defined. The Finally, six classes which are data staging area,
proposed approach takes five inputs (namely data sources, wrappers, monitors, integrator and
mapping rules, conforming rules, cleaning rules, source identifier are shown. These classes are used
and specific rules) and produces a conceptual and in above diagrams.
224
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
3.3 Web Technologies Based Works BPEL is mentioned. In fact, this work considers
The advent of the web has disrupted that the mapping between BPMN and BPEL is
information systems. ETL in turn are affected by acquired and provided by multiple tools. Finally,
this wave. Indeed, semantic web technologies are this effort provides an extensible palette of useful
used in the development of data warehouses features for the ETL design using BPMN notation.
refreshment processes. Thus, in 2006 Skoutas et al
present an ontology based approach for the design Another interesting work related to the design of
of ETL processes [27]. Applied to a given domain ETL based on RDF (Resource Description
whose concepts are identified, OWL (web ontology Framework) and OWL, two web technologies, is
language) is used to construct the domain ontology detailed in [30]. This work has the advantage of
in question. Then the OWL ontology constructed is presenting a method over several steps ranging
exploited to annotate and mark schemas of both from the creation of models for ROLAP cubes until
sources and DW (target). Finally, a reasoning the creation of the ETL layer. The basic idea behind
technique is put into action to induce connections this work is to convert the data source to a RDF file
and conflicts between the two extremes (between format that is consistent with OLAP ontology
source and DW). This approach is semi-automatic which is constructed in turn. Then target tables are
and allows getting a part of necessary elements for populated with data extracted via queries generated
the design of ETL. Namely, they are mapping rules over OLAP ontology.
and transformations between sources and target
attributes. However, this work suffers from two The scientific community has enriched the field of
following drawbacks: modeling ETL with different approaches presented
above. These proposals differ in the formalism and
the technology used. But they have the same
1. The scope of the data is limited to
drawback that is the lack of support and
relational sources.
approaches to change management in ETL. In next
2. Ontology construction is not efficient.
section, we cover this issue.
Therefore, the same authors enhance their 4. MAINTENANCE OF ETL PROCESSES
proposal in 2007 via an extension of their work
[28]. The scope of data handled is expanded to ETL process can be subject of changes for
include structured data and unstructured data. several reasons. For instance, data sources changes,
Finally, a prototype of the solution is implemented new requirements, fixing bugs…etc. When changes
in Java. happen, analyzing the impact of change is
mandatory to avoid errors and mitigate the risk of
It is clear that the contribution of this breaking existent treatments.
approach is relevant, but it is still insufficient.
Indeed, a conceptual model is not limited to the Generally, change is neglected although it is a
mapping rules, but there is also (among other) the fundamental aspect of information systems and
Data Flow to manage which consists on activities database [31]. Often, the focus is on building and
orchestration and arrangement. Finally, let note that running systems. Less attention is paid to the way
results accuracy of applying this approach is closely of making easy the management of change in
related to the degree of matching between ontology systems [32]. As a consequence, without a helpful
and schemas of sources and DW. Furthermore, in tool and an effective approach for change
reality, the schemas of sources are not expressive or management, the cost of maintenance task will be
non-existent [3]. high. Particularly for ETL processes, previously
judged expensive and costly.
Another approach based on web
technologies, more precisely on marriage between Research community catches this need and
BPMN (Business Process Modeling Notation) and supplies, in response, few solutions. The one can
BPEL (Business Process Execution Language) is starts with a general entry point to change issue in
presented in [29]. This work aims to align to MDA [31] that is a report on evolutions and data changes
standard. Indeed, it proposes to use BPMN for the management. Indeed, authors summarize the
PIM layer and to use BPEL for PSM layer. problems associated with this issue as well as a
Nevertheless transformation layer is not well categorization of these problems. Finally, the
developed in spite of the transition from BPMN to authors discuss the change issue according to six
225
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
aspects which are: What, Why, Where, When, Who possible to know “which attributes are involved in
and How. the population of a certain attribute” and which
attributes are the “customers” of a given one. In
Regarding data warehouse layer, change can addition, an algorithm to detect affected part of
occur at two levels, either in schemas or in data ETL process when a change deletion event occurs,
stored from the first load of data warehouse. either in sources or targets or inside ETL, have
Managing data evolution, contrarily to schema been presented. Finally, this matrices based
evolution, over time is a basic mission of DW. approach constitutes a sub module of whole
Consequently, research efforts in DW evolution and solution that is a framework called KANTARA.
changes are oriented to schema versioning. In this These proposals dealing with change
perspective, authors present in [33] an approach to management in ETL are interesting and offer a
schema versioning in DW. Based on graph solution to detect changes impact on ETL
formalism; they represent schema parts as a graph processes. However change absorption is not
and define an algebra to derive new schemas of addressed. In next section, we present research
DW, given a change event. The formulation of works taking in account performance aspect.
queries invoking multiple schema versions is
sketched. Same authors rework their proposal in 5. OPTIMIZATION AND INCREMENTAL
[34] by investigating more data migration. Finally, ETL
X-Time [35] is a prototype resulting from these ETL feeds DW with data. In this mechanism and
efforts. depending on the context, ETL performance will be
critical fact. For example, [2] reports a study for
Using ETL terminology, above previous research mobile network traffic data where a tight time
efforts focus on the target unlike the proposal of window is allowed for ETL to perform its missions
[36] which focuses on changes in the sources. In (4 hours to processes about 2 TB of data where the
this work, the authors consider the ETL process as main target fact table contains about 3 billion
a set of activities (a kind of component). Thus, they records).
represent the ETL parts as graphs which are
annotated (by the designer) with actions to perform In situations like one described above, ETL
when a change event occurs. An algorithm to optimization is much appreciated. Concerning this
rehabilitate the graph, given a change event in issue at research level, unfortunately, works and
sources is provided too. However, this approach is proposals are little. The first work dealing with this
difficult to implement, because of enormous amount issue treats the logical optimization of ETL
of additional information required in nontrivial processes [40]. In this proposal, authors model the
cases [37]. The authors extend their work in [38] by problem of ETL optimization at logical level, as
detailing the above algorithm for graph adaptation. state space search problem. In particular an ETL
The architecture of prototype solution has been process is conceived as a state and a state space is
introduced too. It has a modular architecture generated via a set of defined transitions. The
centralized around the component Evolution approach is independent of any model cost.
Manager. This prototype is called HECACTUS However the discrimination criterion for choosing
[39] and aims to enable the regulation of schema the optimal state is based on total cost. The total
evolution of relational database. In other words, this cost of a state is obtained by summarizing the costs
approach is does not take in account other kinds of of all its activities.
data stores or sources like flat files. Another solution to achieve performance consists
of extracting and processing only modified or new
Another approach dealing with change data. This is called incremental mode of ETL
management in ETL is available in [32].In this processes. More precisely, this style of ETL is
paper, authors present a matrices based approach suitable to contexts where the request of fresh data,
for handling impact of change analysis in ETL from end users, is very pressing.
processes. Indeed, ETL parts are represented as
By definition incremental ETL has two
matrices and a new matrix called K matrix is
challenges. They are:
derived by applying multiplication operations. Also
authors expose how this K matrix summarizes the 1. To detect modified data or new data at
relationship between the input fields and the output sources level.
fields and how it synthesizes the attributes
2. Integrate data of previous step.
dependency. Particularly, the K matrix makes
226
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
Incremental ETL is associated to near real time consider that in order to use efficiently this
ETL. In one hand, the historical and business delivery, it is desirable to express these
contexts of this category of ETL are sketched in [3] rules in a standard language compatible
where authors motivate and explain why this new with ETL constrains and aspects.
category of ETL. In other hand, authors of [41] • Big Data and ETL: Big data technologies
formalize the problem of incremental ETL under arrive with exciting research opportunities.
the following assumption: for a given target, there Particularly, performance issue seems
are two types of ETL jobs feeding this target. solvable with this novelty. Thus, works
Namely: initial ETL job, for first load and second and proposals using these technologies and
ETL job for, recurrent load in incremental mode. taking in account ETL specificities; like
Thus, this work presents an approach to derive partitioning, data transformation
incremental ETL jobs (second type) from the first operations, orchestration…etc; are
one (initial load) based on equational reasoning. desirable.
Transformation rules for this purpose are defined • Testing: Tests are fundamentals aspects of
too. As input, they take an ETL job as an software engineering. In spite of this
expression E described according to a defined importance, and regarding ETL, they are
grammar G. In output, four ETL expression E ins , neglected. Thus, an automatic or even a
E del , E un and E uo , are produced dealing with change semi automatic approach for validating or
events occurring at sources (insertion, deletion, getting data for tests is very hopeful.
update…etc). Another work dealing with • Unstructured data and Meta data:
incremental ETL is available in [42] where authors theses two topics are not specific to ETL
present an approach based on existing method of processes. They are common issues to data
incremental maintenance of materialized views to integration area. Thus, they are open
get automatically incremental ETL processes from challenges which can be addressed in ETL
existing ones. context.
Approaches presented above, both require the • Change absorption: As we said in
existence of the initial ETL processes to transform previous sections, only few approaches
them into incremental mode. Therefore they are handling changes impacts on ETL exist.
suitable for existing ETL projects wishing to But it is more challenging to automatically
migrate and take profit from incremental mode. or semi automatically absorbing changes
Thus they are not suitable for new ETL projects once they are detected. In other words, an
starting from scratch. approach is needed to adapt running ETL
jobs according to changes occurring either
All previous sections review the literature in ETL in sources, targets or in business rule
field. In next section we present main research (transformation rule).
opportunities in ETL processes.
6. RESEARCH OPPORTUNITIES 7. CONCLUSION
227
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
we cover modeling and design works in ETL field. [9] Informatica, http://www.informatica.com
Thus several works using different formalism or [10] Microsoft SSIS, http://msdn.microsoft.com/en-
technologies like UML and web technologies, are us/library/ms169917.aspx.
reviewed. Then this survey continues by [11] C. Thomsen and T,B Pedersen, "A Survey of
approaching ETL maintenance issue. Namely, after Open Source Tools for Business Intelligence",
problem definition, we review works dealing with DB Tech Reports September 2008. homepage:
changes in ETL processes using either graph www.cs.aau.dk/DBTR.
formalism or matrices formalism. Before
[12] C. Thomsen and T,B Pedersen, "A Survey of
conclusion, we have given an illustration of
Open Source Tools for Business
performance issue along review of some works
Intelligence”.International Journal of Data
dealing with this issue, particularly, ETL
Warehousing and Mining.Volume 5, Issue 3,
optimization and incremental ETL. Finally, this
2009.
surveys ends with presentation of main challenges
and research opportunities around ETL processes. [13] Talend Open Studio, www.talend.com
At the end of this survey and as a conclusion, [14] A.VAVOURAS,"A Metadata-Driven Approach
we believe that research in ETL area is not dead for DataWarehouse Refreshment", Phd Thesis,
but it is alive. Each issue addressed above is open DER UNIVERSITÄT ZÜRICH,ZÜRICH, 2002.
to review and investigation. [15] P. Vassiliadis, Z.Vagena, S. Skiadopoulos, and
N.Karayannidis, "ARKTOS: A Tool For Data
REFRENCES: Cleaning and Transformation in
DataWarehouse Environments", Bulletin of the
[1] W. Inmon D. Strauss and G.Neushloss, “DW
Technical Committee on Data Engineering,
2.0 The Architecture for the next generation of
23(4), 2000.
data warehousing”, Morgan Kaufman, 2007.
[16] P.Vassiliadis, Z.Vagena, S.Skiadopoulos, ,N.
[2] A. Simitisis, P. Vassiliadis, S.Skiadopoulos and
Karayannidis.N and T.Sellis, "ARKTOS:
T.Sellis, “DataWarehouse Refreshment”, Data
towards the modeling, design, control and
Warehouses and OLAP: Concepts,
execution of ETL processes" Information
Architectures and Solutions, IRM Press, 2007,
Systems, Vol. 26, 2001, pp 537-561.
pp 111-134.
[17] C.Thomsen and T,B Pedersen, "pygrametl: A
[3] R. Kimball and J. Caserta. “The Data
Powerful Programming Framework for Extract–
Warehouse ETL Toolkit: Practical Techniques
Transform–Load Programmers”. Proceeding
for Extracting, Cleaning, Conforming, and
DOLAP09, 2009.
Delivering Data”, Wiley Publishing, Inc, 2004.
[18] Adzic.J and Fiore.V, "Data warehouse
[4] A. Kabiri, F. Wadjinny and D. Chiadmi,
population platform", Proceedings of the
"Towards a Framework for Conceptual
International Workshop on Design and
Modelling of ETL Processes ", Proceedings of
Management of Data Warehouses (DMDW03),
The first international conference on Innovative
2003
Computing Technology (INCT 2011),
Communications in Computer and Information [19] G.Booch, J.Rumbaugh and I.Jacobson, "The
Science Volume 241, pp 146-160. unified modeling language user guide",
Addison-Wesley, 1998.
[5] P. Vassiliadis and A. Simitsis, “EXTRACTION,
TRANSFORMATION, AND LOADING“, [20] P. Vassiliadis, A. Simitsis and S. Skiadopoulos.
http://www.cs.uoi.gr/~pvassil/publications/2009 “Conceptual modeling for ETL processes”,
_DB_encyclopedia/Extract-Transform-Load.pdf Proceedings of the 5th ACM Int Workshop on
Data Warehousing and OLAP, DOLAP02,
[6] J. Adzic, V. Fiore and L. Sisto, “Extraction,
2002.
Transformation, and Loading Processes”, Data
Warehouses and OLAP: Concepts, [21] A. Simitsis and P. Vassiliadis. “A methodology
Architectures and Solutions, IRM Press, 2007, for the conceptual modelling of ETL
pp 88-110. processes”, Proceedings of DSE, 2003.
[7] W. Eckerson and C. White, “Evaluating ETL [22] A. Simitsis. “Mapping conceptual to logical
and Data Integration Platforms”, TDWI models for ETL processes”, Proceeding of the
REPORT SERIES, 101communications LLC, 8th ACM International Workshop on Data
2003. Warehousing and OLAP, DOLAP'05, 2005.
[8] IBM InfoSphere DataStage, http://www- [23] A. Kabiri and D. Chiadmi, "A Method for
01.ibm.com/software/data/infosphere/datastage/ Modelling and Organazing ETL Processes".
228
Journal of Theoretical and Applied Information Technology
20th August 2013. Vol. 54 No.2
© 2005 - 2013 JATIT & LLS. All rights reserved.
Proceedings of The second international [35] S. Rizzi and M Golfarelli, "X-time: Schema
conference on Innovative Computing versioning and cross-version querying in data
Technology (INTECH 2012). Communications warehouses", International Conference on Data
in IEEE Volume 241, 2012, pp 146-160. Engineering (ICDE), 2007, pp.1471–1472.
[24] J. Trujillo and S. Lujan-Mora, “A UML Based [36] G. Papastefanatos, P. Vassiliadis, A. Simitsis
Approach for Modeling ETL Processes in Data and Y. Vassiliou, "What-If Analysis for Data
Warehouses”, In I.-Y. Song, S. W. Liddle, T. W. Warehouse Evolution", Proceedings of DaWaK
Ling,and P. Scheuermann, editors, ER, volume conference, LNCS 4654, 2007, pp 23–33.
2813 of Lecture Notes in Computer science, [37] Dolnik A. (2009). “ETL evolution from data
Springer, 2003. sources to data warehouse using mediator data
[25] L.Muñoz, J.Mazón and J.Trujillo. “Automatic storage”, MANAGING EVOLUTION OF DATA
Generation of ETL processes from Conceptual WAREHOUSES MEDWa Workshop, 2009
Models”. Proceedings of ACM International. [38] G. Papastefanatos, P. Vassiliadis, A. Simitsis
Workshop on Data Warehousing and OLAP, and Y. Vassiliou, "Policy-Regulated
DOLAP'09, 2009. Management of ETL Evolution", Journal on
[26] Mrunalini M., Suresh Kumar T.V, Evangelin D, Data Semantics XIII, LNCS 5530, 2009, pp
Geetha and Rajanikanth K, "Modelling of Data 146–176.
Extraction in ETL Processes Using UML 2.0". [39] G. Papastefanatos, P. Vassiliadis, A. Simitsis
DESIDOC Bulletin of Information and Y. Vassiliou, " HECATAEUS: Regulating
Technology, Vol. 26, No. 5, 2006, pp 3-9. Schema Evolution. Data Engineering",
[27] D. Skoutas and A. Simitsis. “Designing ETL International Conference on Data Engineering
processes using semantic web technologies”. (ICDE), 2010, pp 1181-1184
Proceedings of DOLAP06, 2006. [40] A. Simitsis, P. Vassiliadis, and T. Sellis.
[28] D. Skoutas and A. Simitsis. “Ontology-based "Logical optimization of ETL workflows",
conceptual design of ETL processes for both Proceedings of International Conference on
structured and semi-structured data”, Data Engineering (ICDE), 2010, pp 1181-1184.
International Journal on Semantic Web and [41] T. Jorg and S. Debloch. “Formalizing ETL Jobs
Information Systems, 2007. for Incremental Loading of Data Warehouses”,
[29] Z, ElAkkaoui and E.Zimanyi, “Defining ETL Proceedings of der 12, 2009.
Workflows using BPMN and BPEL”. [42] X. Zhang, W. Sun, W. Wang, Y. Feng, and B.
Proceedings of DOLAP09, 2009, pp 41-48. Shi, “Generating Incremental ETL Processes
[30] Marko Niinimaki, Tapio Niemi, "An ETL Automatically”, Proceedings of the First
Process for OLAP Using RDF/OWL International Multi-Symposiums on Computer
Ontologies", Journal on semantic web and and Computational Sciences (IMSCCS'06),
information Systems, Vol 3, No 4, 2007. 2006.
[31] J.F. Roddick et al, "Evolution and Change in
Data Management - Issues and Directions",
SIGMOD Record 29, Vol. 29, 2000, pp 21-25.
[32] A. Kabiri, F. Wadjinny and D. Chiadmi,
"Towards a Matrix Based Approach for
Analyzing the Impact of Change on ETL
Processes", IJCSI international journal, Volume
8, issue 4, July 2011.
[33] M. Golfarelli, J. Lechtenborger, S. Rizzi and G.
Vossen, "Schema Versioning in Data
Warehouses", ER Workshops, LNCS 3289,
2004, pp 415–428.
[34] M. Golfarelli, J. Lechtenborger, S. Rizzi and G.
Vossen, "Schema versioning in data
warehouses: Enabling cross version querying
via schema augmentation", Data and
Knowledge Engineering, 2006, pp 435–459.
229