0% found this document useful (0 votes)
328 views16 pages

Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution

This document summarizes a research paper that proposes a Hadoop-based architecture and data model for a medical big data warehouse. The paper begins with an introduction that describes the challenges of managing and analyzing large and growing volumes of medical data using traditional data warehousing approaches. It then reviews related work on previous medical data warehouses and their limitations. The paper proposes a system architecture and conceptual data model for a medical big data warehouse built on Hadoop. It also describes a solution to address issues related to fact table size and lack of keys in Hive. Finally, the paper applies this solution by implementing a medical big data warehouse to improve healthcare resource distribution in a region of Algeria.

Uploaded by

amin sudrajat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
328 views16 pages

Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution

This document summarizes a research paper that proposes a Hadoop-based architecture and data model for a medical big data warehouse. The paper begins with an introduction that describes the challenges of managing and analyzing large and growing volumes of medical data using traditional data warehousing approaches. It then reviews related work on previous medical data warehouses and their limitations. The paper proposes a system architecture and conceptual data model for a medical big data warehouse built on Hadoop. It also describes a solution to address issues related to fact table size and lack of keys in Hive. Finally, the paper applies this solution by implementing a medical big data warehouse to improve healthcare resource distribution in a region of Algeria.

Uploaded by

amin sudrajat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Journal of Medical Systems (2018) 42: 59

https://doi.org/10.1007/s10916-018-0894-9

TRANSACTIONAL PROCESSING SYSTEMS

Medical Big Data Warehouse: Architecture and System Design,


a Case Study: Improving Healthcare Resources Distribution
Abderrazak Sebaa 1 & Fatima Chikh 2 & Amina Nouicer 1 & AbdelKamel Tari 1

Received: 23 September 2016 / Accepted: 8 January 2018 / Published online: 19 February 2018
# Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract
The huge increases in medical devices and clinical applications which generate enormous data have raised a big issue in
managing, processing, and mining this massive amount of data. Indeed, traditional data warehousing frameworks can not be
effective when managing the volume, variety, and velocity of current medical applications. As a result, several data warehouses
face many issues over medical data and many challenges need to be addressed. New solutions have emerged and Hadoop is one
of the best examples, it can be used to process these streams of medical data. However, without an efficient system design and
architecture, these performances will not be significant and valuable for medical managers. In this paper, we provide a short
review of the literature about research issues of traditional data warehouses and we present some important Hadoop-based data
warehouses. In addition, a Hadoop-based architecture and a conceptual data model for designing medical Big Data warehouse are
given. In our case study, we provide implementation detail of big data warehouse based on the proposed architecture and data
model in the Apache Hadoop platform to ensure an optimal allocation of health resources.

Keywords Data warehouse . Hadoop . Big data . Decision support . Medical resources allocation

Introduction traditional data management systems can not handle unstruc-


tured data of medical applications, they are limited to handle
Nowadays, the volume of data generated from different only data in the order of megabytes, their performances go
sources such as hospital information systems, medical sen- down when data increases and they fail in scaling-up.
sors, medical devices, websites and medical applications in- Indeed, several issues and problems arising over big data
crease rapidly by terabytes and petabytes. Moreover, users of when using OLAP and data warehousing as detailed in [2].
such data may be distant or inside of hospitals. Integrating and Among these problems, we quote:
handling this huge amount of data known as BBig Data^ is a
real challenge. Since health data involves large collections of 1 The fact table can easily grow in size leading to severe
structured and unstructured datasets, and as demonstrated in computational issues.
[1], integration, storing and managing efficiently this huge 2 Complexity and hardness of designing methodologies for
amount of data in a single hard drive using the traditional data OLAP and data warehousing.
warehousing platforms is very difficult. As well, such 3 Issue in computing methodologies of OLAP data cube.
4 Complications of the integration process of models, tech-
niques, algorithms and computational platforms on OLAP
This article is part of the Topical Collection on Transactional Processing over big data with classical platforms.
Systems
5 Query languages and optimization.
* Abderrazak Sebaa
balzak.sebaa@gmail.com An alternative emerged solution that is able to handle and
manage such volume of the structured and unstructured
1
amount of data is the Hadoop ecosystem. It is the best alter-
LIMED Laboratory, Faculty of Exact Sciences, University of Bejaia,
Bejaia, Algeria
native since it scales up well by distributing processing across
2
several host servers which are not necessary high-
Department of Computer Science, Faculty of Exact Sciences,
University of Bejaia, Bejaia, Algeria
performance computer; it is based on a distributed file system
59 Page 2 of 16 J Med Syst (2018) 42: 59

[3]. Indeed, a major advantage of Hadoop framework apart its studies and projects have been conducted on medical DWs.
scalability and strength, is its implementation low cost, by Among the first ones, Ewen and al. [5] highlighted the need
utilizing only multiple existing ordinary computer servers in- for a DW in the health sector. Authors in [6] shown the main
stead of one high-performance cluster. Hadoop pushes pro- differences between conventional business oriented DWs and
cessing to the data instead of data to the processing. clinical DWs, they also identified key research issues in clin-
Furthermore, it supports ETL (Extract, transform, and Load) ical data warehousing. GEDAW [7] is a bioinformatics project
processes in parallel. However, several issues arise when in- consisted of building a DW on the hepatic transcriptome. This
tegrating and warehousing of medical big data such as the lack project aimed to bring together within a single knowledge
of a standard and powerful architecture for integration, which base, complex, varied, and many data of liver genes for anal-
makes implementing big data warehouse a big issue. ysis. Its objective was to provide researchers with a compre-
Moreover, former design methodologies for data models can hensive tool for integrating transcriptome and provide deci-
not meet all needs of the new data analysis applications over sion support to guide biological research. DWs of different
big data warehousing given the new constraints such as the health insurance organizations in Austria were merged in an
nature and complexity of medical big data. evidence-based medicine collaboration project [8] called
The purpose of this paper is to address the arising problems HEWAF (Healthcare Warehouse Federation). Kerkri and al.
when using OLAP and data warehousing over big data espe- [9] presented EPIDWARE architecture to build DWs for epi-
cially medical big data. The contributions of this paper can be demiological and medical studies. It improved the medical
summarized as follows: care of the patients in care units. Pavalam and al. [10] pro-
posed a DW based architecture for HER (Electronic Health
& Firstly, we give an overview of some previous traditional Record) to the Rwandan health sector. It can be accessed from
medical DWs (Data Warehouses), their limitations, and different applications and allow data analysis for a swift
some recent Hadoop-based DWs. decision-making process and also in alerting the epidemic. A
& Secondly, we propose a system architecture and a conceptual data warehouse which gives information on epidemiology and
data model for a MBDW (Medical Big Data Warehouse). public health, about various diseases, their concentration, and
& Thirdly, we offer a solution to overcome both the growing resources repartition in Bejaia department is proposed in
of fact table size and the lack of primary and foreign keys [11–13]. The aim of this DW is to improve medical resources
in the framework Apache Hive required in the conceptual allocation of Bejaia region.
data model. This solution is based on nested partitioning
according to the dimension tables keys. Issues and limitations
& Finally, we apply our solution to implement a MBDW to
improve medical resources distribution for health sector in The dramatic increases in devices and applications generating
Bejaia region (in Algeria). enormous medical data have raised big issues in managing,
processing, and mining these massive amounts of data. New
The remainder of this paper is organized as follows: A brief solutions are needed since traditional frameworks can no lon-
review related to the medical DWs traditional and modern ger be used to handle the volume, variety, and velocity of
ones is given in Section 2. Section 3 details some concepts current medical applications. As a result, the previously pre-
and tools which are called to be used in the rest of this work. A sented DWs and several others face many common issues and
system architecture of the MBDW and a conceptual data mod- limitations over medical big data. Among these issues, we
el which exploits partitioning and bucketing are presented in highlight the most important ones. Firstly, impact of the huge
sections 4 and 5. The latest section discusses our implemen- volume and fast growth of medical data. Indeed, in the med-
tation and experimental results of our case study. ical field, a large amount of information about patients’ med-
ical histories, symptomatology, diagnoses and responses to
treatments and therapies are generated, which necessitates
Related works their collection. Secondly, the unstructured nature of medical
data is another important issue. In fact, unstructured docu-
In this section, we introduce some medical DWs, then we give ments, images, and clinical notes represent over 80% of cur-
their limitations and their common drawbacks. rent medical data [14], such unstructured data should be con-
verted into analysis-ready datasets. Thirdly, the complexity of
Traditional medical data warehouses data modeling, it is not easy to deal with computing OLAP
data cubes over big data, mainly due to the explosive size of
Despite the widespread interest in data integration and data sets and the complexity of multidimensional data models
historization in the medical field [4], this field has been slow [2]. For instance, if we have 10.000 diagnoses, this would
to adopt data warehousing. However, once adopted, several amount to 210.000 dimensions, therefore, this solution is
J Med Syst (2018) 42: 59 Page 3 of 16 59

Specific to run spatial queries in a data warehousing system.

No evaluation of the proposed solution has been performed.


practically unusable [6]. Fourthly, the integration of very com-

Specific to design of hospital information intelligently.


plex medical data as this aspect not only conveys in typical

There is no to what, and how to use their framework/


Manual translation of the AQL to Hadoop-querying
integration problems, mainly coming from the literature on
data and schema integration issues, also it has deep conse-
quences for the kind of analytics to be designed [15], since

Specific to traumatic brain injury field


effective large-scale analysis often needs the collection from

Suitable for medical data integration


multiple strongly heterogeneous data sources. For instance,
obtaining overall health view of a patient requires integrating

Application field /Limits


and analyzing medical health record along with readings from

Specific to diabetes field.


Specific to epilepsy field
multiple meters types such as accelerometers, glucose meters,
heart meters, etc. [14]. Finally, the temporal aspects of medical
data known as valid time and transaction time must both be
supported to provide bi-temporal support. Indeed, it is impor-
tant to know when data is considered to be valid in the real
world and when it is stored and changed in the database [6].
For instance, a blood test can be made multiple times for the

Infer knowledge from distributed, multivariate data of healthcare centers in different


Improve prediction of traumatic brain injury survival rates using data classification.
same patient, however, it considered being valid only for a

Propose a cloud storage platform with HBase of Hadoop for integrating, storing
Implementing and testing of an archetype-aware Dewey encoding optimization.
spatial datasets. Also, Map-Reduce is improved by spatial data partitioning

Propose a predictive analysis algorithm over Hadoop environment to predict


small period of time. For more details about medical big data

Unlimited query capacity & efficiency of unstructured medical image data


warehousing issues see [16].

and R*-tree indexes and Hive is extended with spatial query support.
Provide a scalable solution for analytical spatial queries over large scale
Hadoop-based data warehousing

Aside from the above issues and limitations of traditional data

Solve big data collection, storage & analysis problems

Knowledge discovery utilizing Apache Hadoop Hive


warehousing technologies, health care systems have rapidly
adopted electronic health record which has dramatically in-

and analyzing big data of medical records.


creased the size of clinical data. As a result, there are oppor-
tunities to use a big data framework such as Hadoop to process
and store medical big data since the most important aspect of
big data analysis is to process the data and obtain the result
within a time constraint. Researchers and industrials tried to

geographic locations
follow such new platforms. Thus, a small number of studies
the type of diabetes

were undertaken as depicted in Table 1. Yao and al. [17] built a


Main objectives

five-node Hadoop cluster to execute distributed Map-Reduce


algorithms to study user behaviors regarding various data pro-
duced by different hospital information systems for daily
work. They have shown that medical big data makes the de-
sign of hospital information systems more intelligent and eas-
Hadoop with Hive, Hbase and Sqoop
Hadoop with Hive, sqoop & Mahout

Hadoop, Archetype, and Electronic

ier to use by making personalized recommendations. In order


to interact with structured and unstructured medical data in the
epilepsy field in an efficient way to fulfill related queries,
Hadoop and Map-reduce
Horton-works Hadoop

Istephan and Siadat [18] proposed a framework which dynam-


HBase and Hadoop
Hadoop with Hive,
Hadoop with Hive

ically integrates user-defined modules. Hadoop-GIS [19] is a


Health Record
Big data medical frameworks

warehousing system of spatial data, built over Map-Reduce


Used tools

for medical image processing. It supports various analytical


queries. Saravanakumar and al. [20] developed a predictive
analysis algorithm over Hadoop environment to predict and
classify the type of diabetes and the type of treatment to be
Saravanakumar et al. [20]

Raja and Sivasankar [23]

provided. According to the authors, this system helps to make


Sundvall et al. [22]

an effective cure and care the patients with enhancing out-


Istephan et al. [18]

Yang et al. [24]

comes like affordability and availability. Rodger [21] used


Yao et al. [17]

Aji et al. [19]

Rodger [21]

Hadoop and Hive as a data warehouse infrastructure to im-


Table 1

prove prediction of traumatic brain injury survival rates using


Study

data classification into predefined classes. Indeed, to


59 Page 4 of 16 J Med Syst (2018) 42: 59

determine how the set of collected variables relates to the body characteristics. Therefore new platforms and technologies
injuries, he collected data on three ship variables (Byrd, were developed.
Boxer, and Kearsage) and injuries to different body regions
such as head, torso, extremities, and abrasions. He proposed a Big data technologies
hybrid approach on multiple ship databases where various
scalable machine-learning algorithms have been employed. To manage the volume, velocity, variety, and variability of big
Raja and Sivasankar [23] proposed a framework based on data, several technologies, tools, and framework were devel-
Hadoop to modernize the healthcare informatics systems to oped. The most important ones are The Map-reduce based
inferring knowledge from various healthcare centers in differ- system, for instance, BigTable, HadoopDB, Hadoop ecosys-
ent geographic locations. tem and its most important components such as HBase, Pig,
Yang et al. [24] build a cloud-based storage and distributed and Hive. The NoSQL databases, for instance: MongoDB,
processing platform to store and handle medical records data Cassandra, and VoldemortDB. The In-memory Frameworks,
using HBase of the Hadoop ecosystem providing diverse func- for instance: Apache Spark. The Graph databases, for in-
tions such as keyword search, data filtering, and basic statistics. stance: Neo4J, AllegroGraph, In_niteGraph,
Based on the size of data, authors used the Put with the single- HyperGraphDB, InfoGrid and Google Pregel. RDBMS-
threaded method or Complete-Bulkload mechanism to improve based system, for example, Greenplum, Aster, and Vertica.
the performance of data import into a database. The stream data processing frameworks, for instance:
The different listed works [17–24] used high-performance Apache S4, Apache Kafka, Apache Flume, Twitter Storm
computers and very powerful IT infrastructures that required a and Spark Streaming.
very substantial investment to use the great processing capabil-
ities of Hadoop technology. In this work, our goal is to re-use Map-reduce paradigm
their own IT system infrastructure and workstations of the health
institution without having to make a great investment in its The Map-Reduce paradigm [26] is a programming model
material capabilities and also to adopt data warehousing and allowing parallel computation of tasks using two phases prim-
OLAP methodologies. However, as indicated in [25], data itive: map and reduce phases defined by users. In the first
warehousing using Hadoop and hive is faced with many chal- BMap^ phase, a mapper is defined by the user to load data
lenges. Firstly, Hadoop does not support OLAP operations and chunk from DFS and transforms it into a list of intermediate
the multi-dimensional model can not meet all needs of the new key/value (ki, vi) pairs. Then, the key-value pairs are buffered
data analysis applications. Secondly, the limitation of Apache as (r) of files, and all key-value files are sorted by keys. The
Hive query language called HiveQL which does not support second BReduce^ phase, a reducer is defined by the user to
many standard SQL queries. Overcoming such limitations and combine files from different mappers together. The final re-
challenges in the medical field is the objective of this work. sults are written back to DFS [27].

Hadoop ecosystem
Background review
Hadoop ecosystem consists of several projects and libraries
This section details some concepts and tools called to be used such as massive scale database management solution called
in this study. HBase, data warehousing solution called Hive, the machine
learning library called Mahout, the job tracking and schedul-
Big data concept ing suite called ZooKeeper and Pig and other related libraries
and packages for massively parallel and distributed data
The ‘big data’ as a concept is often described by the three V management.
then, recently by the six V, which mean Volume, Velocity,
Variety, and Veracity, Variability, Value of data. Indeed, the Hadoop
high-volume of data generation refers to the volume (1st V).
The rate of data generation and the speed at which it should be Apache Hadoop [28] is an open source software allowing
processed refers to the velocity (2nd V). The heterogeneity and large-scale parallel distributed data analysis. It replicates and
the diversity of the data types refer to the variety (3rd V). The collocates automatically data across multiple different nodes,
degree of reliability and quality of sources of data refers to the then allows parallel processing across clusters. It is an open-
Veracity (4th V). The disparity and variation of the data flow source implementation of Google’s Map-Reduce computing
rates refer to the variability (5th V). Finally, value (6th V) of model. It is based on HDFS (Hadoop Distributed File system)
data depends on the volume of analyzed data. Traditional data which provides high-throughput access to application data.
management systems can not handle data with such Hadoop provides the robust, fault-tolerant HDFS, and
J Med Syst (2018) 42: 59 Page 5 of 16 59

reliability. Doug Cutting implemented the first version of capture data. During this phase, all tables’ sources are ex-
Hadoop in 2004 and becomes an Apache Software traction from appropriate data sources then converted into
Foundation project in 2008 [29]. On November 17, 2017, columns structure (CSV format). The transformation phase
the release 2.9.0 of Apache Hadoop is available [28]. involves data cleansing to comply with the target schema
based on the description of the multi-dimensional data
Hive model (partitioned schema) which is stored as meta-data
in an HDFS file). Finally loading phase, the data is propa-
Apache Hive [30] is initially a subproject developed by the gated into the regional storage system (database server). As
Facebook team. Hive is used in conjunction with the map- we will see in the next section, the data model is
reduce model of Hadoop to structure the data, to run queries, partitioned according to the dimension tables (in our case
allowing the creation of a variety of reports and summaries, study keys of ‘Region’ dimension table, then ‘Health-es-
and to perform historical analysis over this data. Over Hive, tablishment’ dimension table and then ‘Date’ dimension
tables are used to store data, such structure consists of a num- table). The integrated and stored data of database server
ber of rows and each row comprises a specified number of are then replicated in DataNodes of Hadoop cluster accord-
columns. In Hive, queries are expressed in an SQL-like de- ing to a replication strategy.
clarative language called HiveQL, these queries are compiled New database servers and automatically new data-node on
into map-reduce jobs and executed using Hadoop. It supports Hadoop cluster will be created for new sources (for instance
primitive and complex type. It processes data for analysis but new hospitals). Such feature ensures the scalability of this
not to serve users, so it does not need ACID guarantees (as for architecture.
traditional relational database) for data storage.
(3) Hadoop ecosystem

System architecture and data flow The core of our architecture is the Apache Hadoop [28]
framework. It is an open-source implementation of Google’s
Implementing efficient medical data warehouse based on the Map-Reduce computing model. It is based on HDFS (Hadoop
Hadoop ecosystem requires a flexible and modular architec- Distributed File System) which provides high-throughput ac-
ture for big data warehousing. This section describes the pro- cess to application data.
posed architecture along with the functionalities of each layer.
The overall architecture is depicted in Fig. 1. It is a scalable, (3.1) Hadoop. Hadoop cluster is the corpus of server nodes
reliable, and distributed architecture to extract, store, analyze, with different physical locations within a group on
and visualize healthcare data extracted from various resources Hadoop [32]. It replicates and collocates automatically
HIS (Hospitals Information systems). In the following, we data across multiple nodes, then performs parallel pro-
give some details about components and levels of the pro- cessing across clusters using the Map-Reduce para-
posed architecture and data flow within these components. digm. Thus, Hadoop reduces infrastructure cost. Its
most important components:
(1) Medical data source
& NameNode act as a metadata repository describing the
In each region (provinces of a department, according to the location of data blocks for each file stored in HDFS.
geographical division), medical data is extracted from multi- This makes the massive data processing easier.
ple distributed computer of HIS (Hospitals Information & Secondary NameNode is in charge periodically to check
Systems), information systems, laboratories software, radiol- the persistent status of NameNode and to download its
ogy data and from the regional directorate of Health. current snapshot and log files.
& DataNode runs to every machine in Hadoop cluster sepa-
(2) Distributed ETL rately. In the proposed architecture, data servers in which
data integration is done (point 2 of section 3) are automat-
An ETL is the component responsible to collect data from ically replicated in one or more DataNode. Furthermore, it
different data sources, to transform, and to clean based on is responsible for storing the structured and unstructured
business rules and requirements defined by the final user [31]. data and to answer the data reading and writing queries.
Our ETL approach is based on the partitioning of the
input data horizontally according to the nested partitioning (3.2) Hive. Apache Hive [30] is a data warehousing struc-
technique (section 5.1). In order to run in a distributed and ture. It is used in conjunction with the map-reduce
parallel way, the extraction phase is achieved in each med- model of Hadoop to process the data. Indeed, Hive
ical establishment of each geographic division (region) to stores data in tables, which consists of a number of
59 Page 6 of 16 J Med Syst (2018) 42: 59

Fig. 1 Hadoop-based system


architecture of medical big data
warehousing
Doctors Administrators Managers Governments

(5) Analysis & reporting tools


Reporting Data mining Dashboards Planning

Web-based Access

(3) Hadoop ecosystem


Hive
Metastore Hive Engine

Hadoop NameNode Secondary


NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode n


(2) Distributed integration

Database Database
Database server
server server

HIS Logs HIS MR HIS Logs MR HIS Logs MR

rows, each row comprises a specified number of col- & Hive engine: it converts queries written in HiveQL into a
umns. HiveQL is used to express queries over Hive. map-reduce code and then executes on the Hadoop cluster.
The execution of user queries over Hive are compiled
into map-reduce jobs, then Hadoop (point 3.1 of sec- (4) Web-based Access
tion 4) execute these jobs. Using such queries allow
creating a variety of reports, summaries, and historical Our system allows trusted users (point 6 of section 4) ac-
analysis. The main functional components of Hive are: cess to data in order to create a variety of reports and data
summaries after authentication through the Web browser.
& Metastore: it contains metadata about tables stored in Hive
such as data formats, partitioning, and bucketing and storage (5) Analysis and reporting tools
information including the location of table’s data in the un-
derlying file system. It is specified during table creation and This includes several tools for reporting, planning,
reused every time the table is referenced in HiveQL. dashboards and data mining. These tools allow data to
J Med Syst (2018) 42: 59 Page 7 of 16 59

be analyzed statistically to track trends and patterns over the process which takes as input the diagram composed of data
time, and then produce regular reports tailored to the warehouse tables files (F, D1, D2,…, Dn) and outputs k diagrams:
needs of various users of the decision support system.  
F 1 ; D1 ; D2 ; …; Dn ; ::; F k ; D1 ; D2 ; …; Dn
(6) MBDW users
The Table F is fragmented in k fragments F1,.., Fk and they
are computed in the following way:
They can be doctors, medical searchers, hospital managers,
health administrators, and governments. All such users can
interact with the system.

Design and data model adaptation

It is necessary to design an efficient data model allowing


a better understanding of information set about health
system and providing better performance for the execu-
tion of complex analytical queries. In this section, we
describe an optimized strategy for data modeling and its
implementation in the Have framework.
The multi-dimensional modeling technique has been
adopted widely in data warehouse modeling [5, 12, 13,
20]. Such technique can be extended for big data
warehousing. Usually, the multi-dimensional data
models-based data warehouses constituted from fact and
dimension tables, whether for star, constellation or snow-
flake schema. Fact tables contain keys of dimension tables
-as foreign keys- and measurable facts to examine.
Dimension tables describe dimensions and contain attri- Such as the number of fragments k is: 1 ≤ k ≤ ∏ci¼1 ni . Where
butes and key of dimension. However, directly adopting ni is the number of values of keys of the partitionable dimen-
the multi-dimensional modeling technique in big data sce- sion tables and ⋉ is the semi-join operator.
narios is not very relevant because the tables and espe-
cially fact tables grow in size very large, with millions or
even billions of rows [2], and query performance become Data partitioning techniques in Hive
unacceptable to end users.
To overcome the previous issues: growing size of the fact Using Apache Hive implies that we don’t have primary and
table and query performance. We propose nested partitioning foreign keys during the implementation and management of
technique of the fact tables. data. This is largely due to the fact that Hive focuses on ana-
lytical aspects, bulkiness, and diversification of data nature. In
Nested partitioning technique addition, Hive is not intended to run complex relational
queries. However, it is used to get data in simple and efficient
Our technique consists to apply a nested partitioning where manner. Nevertheless, these two concepts (primary and for-
the data warehouse fact table file is split into several tables eign key) are important if we want to store and manipulate
having fewer sizes but keeping the same multi-dimensional data in many facets and points of view. It is essential to be able
diagram semantic. to uniquely identify each tuple, manage relationships between
Let (F, D1, D2,…, Dn) be a multi-dimensional diagram (in tables, and ensure data consistency.
star, snowflake or constellation) where F = (d1,…,;dn) be the To overcome the previous issues: lack of primary and for-
fact table, and D1 = (d11,…, d1m), Dn = (dn1,…, dnk) be the di- eign keys in Hive. We propose to use two Hive concepts:
mension tables, such as dj1 (1 ≤ j≤ n) be a primary key of the Partitions and buckets.
dimension table Dj, and di (1≤ i≤ n) be a foreign key of F Hive allows dividing tables into partitions and buckets. The
referring to dj1. Let c, such that 1 ≤ c ≤ n, the number of for- first technique called partition which is a way of decomposing a
eign keys with which the fact Table F is partitionable. table into parts based on a value of data. The second technique
We define Nested partitioning on fact Table F based on c called bucketing, where tables or partitions may further be
dimension tables Di (1≤ i≤ c) by using their primary keys as subdivided into buckets, an extra structure to the data [3].
59 Page 8 of 16 J Med Syst (2018) 42: 59

Partitioning In Hive, each table can have one or more parti- solution to a case of medical data. An MBDW is designed for
tions which determine the distribution of data within sub- the health sector of the Wilaya of Bejaia in Algeria to help the
directories of the table directory where each partition has its physicians and health managers to understand, predict and
own directory, according to the partitioning field and can have improve availability and distribution of healthcare resources.
its own columns and storage information [33]. Using parti- In this section, we first describe the study objectives and
tions can make queries faster. For instance, a table can be settings. Then we give the implementation detail of the pro-
partitioned by date, and records with the same date would be posed MBDW solution, we give some preliminary results and
stored in the same partition. finally a discussion about our solution compared to the previ-
ous ones.
Bucketing It is another technique to decompose hive table
(data-sets) in a defined number of parts, which will be stored
in files. Indeed, the data in each partition can be divided into Study objectives
buckets based on the hashing of a column in the table. Each
bucket is stored in the partition directory as a file. Metadata Decision-making about the alternative uses of healthcare re-
about bucketing information of each table are stored in the sources is an issue of critical concern for governments and ad-
Meta-store [33]. Among bucketing advantages, the number ministrators in all health care systems [34]. In the other hand,
of buckets is fixed so that it does not vary with data. data of medical sources can be used to identify healthcare trends,
As shown in Fig. 2, nested partitioning means to divide the prevent diseases, combat social inequality and so on [1]. Thus,
fact table into many levels of partitions based on the different using medical data to improve health resources distribution is an
dimensions. Indeed, fact table will be partitioned using Hive important challenge in emerging countries.
partition based on a first dimension (based on the attribute The purpose of this case study is to provide decision
which should be considered as the primary key of the first makers a clear view of Bejaia health sector. It will guide
dimension table). Then, the partitioned tables will be divided leaders to make better decisions leading for equity in the dis-
using buckets based on a second dimension, and so on. tribution of medical resources, and for a significant reduction
Using partitioned tables will allow to distribute and to re- of the cost to offer a more efficient method to accurate health
duce the data volume from a fact table to many distributed fact management, to improve the availability of clinical material
tables. Therefore, these partitioned fact tables optimize data and human resources and to increase the quality of services
warehouse resources use (distributed processing, memory) and patient safety. It will allow managing different orienta-
and improve queries execution performance. tions during the planning stages, to have a complete predictive
vision, using a better repartition of care offerings (the hospital
specialization, the number of health centers by town, alloca-
tion of hospital equipment, the number of specialist per hos-
Case study: Improving healthcare resources pital, the number of beds by specialty and service). These
distribution in Bejaia health sector facilities have to be regularly adapted according to the chang-
ing of demand (changing health techniques, diseases, health
In the two previous sections, we presented our proposal for structures, population age and geographical location of the
medical big data warehouse (MBDW) architecture and data population).
model built on Hadoop cluster to perform analytical process- Our decision support system gives information about: - the
ing. The purpose of this section is to apply and validate our place and date of health centers building and the required

Fig. 2 Nested partitioning


J Med Syst (2018) 42: 59 Page 9 of 16 59

specialty, − the best strategy for medical professionals’ recruit- administratively divided into 19 Daïras (a set of municipali-
ment for each hospital; − the required equipment and the most ties) and covers 51 municipalities which stretch over an area
appropriate for a hospital (e.g. CT scanner or radiological of 3268 KM2. As shown in Fig. 3, it includes the following
unit); − screening date and place (e.g., breast cancer screening, health structures: one university hospital, five public hospital
cervical screening, etc) and also, it aims to give information to institutions, and one specialized public hospital establish-
help medical research. ments, with a total of 1533 technical beds, 8 nearby public
establishments of health with 51 polyclinics, one paramedical
Study settings school, and several laboratories. It employs 33 hospital-
university practitioners, 245 specialist medical practitioners,
The plan of Bejaia Health Sector is a part of the Algerian 734 general medical practitioners, and 2742 paramedical staff.
Health master plan. The latter provides for the period 2009– To achieve analytical processing of medical data, we de-
2025 investments of 20 billion Euros for constructing of new veloped the MBDW by extracting, transferred, and collecting
health facilities and modernization of existing hospitals. Such data from different operating systems, and software of Bejaia
investments have been initiated for construction and mainte- health structures which includes: PATIENT 3.40 (patients data
nance of infrastructure and hospital equipment and education management software), Microsoft EXCEL, and EPIPHARM
of medical professionals [35]. The outline of this program (a software for the management of drug stock) and automati-
project is to achieve 172 new hospitals, 45 specialized health cally stored in the DW.
complex, 377 polyclinics, 1000 outpatient rooms, 17 para-
medical training schools, and more than 70 specialized insti- MBDW implementation
tutions for persons with disabilities.
Currently, the Bejaia Health Sector manages healthcare for The implementation of the Hadoop based architecture is built
a population of around 1 million people of the Wilaya under the following hardware resources of the computer
(department) of Bejaia. The Wilaya of Bejaia is located in equipments of the medical institutions: memory (RAM) ca-
the north of Algeria, on the Mediterranean coast. It is pacity ranging from 2 to 10 GB, processor speed ranging from

Fig. 3 Geographic location of Bejaia medical institutions


59 Page 10 of 16 J Med Syst (2018) 42: 59

Table 2 Software’s specification

Software Version Argument

Required software Java 7 Required to run the Hadoop jobs


Required software OpenSSH 6.3 Required to manage the different Hadoop cluster nodes and users access
Processing tool Hadoop 2.6.0 Framework for large-scale processing and storage of data on clusters.
Warehousing tool Hive 1.2.1 Compatible with version 2.6.0 of Hadoop
Storage system HDFS 2.6.0 Distributed file system that handles large data sets based on a block size of 64 MB
Synchronization tool Zookeeper 3.4.6 Provides an infrastructure for cross-node synchronization
Programming paradigm Map-Reduce 2.6.0 Two distinct tasks that Hadoop programs perform

1.6 GHz to 2.4 GHz, disk space ranging from 250 GB to 2 TB & Fact table ‘Hospitalization’. It represents all information
for data storing. Computer nodes are connecting and network- about the patients’ hospitalization period. It holds the pri-
ing with RJ45 LAN cable and switches. mary key Id-hospitalization, foreign keys of dimension
Several software were used to build MBDW platform by tables: Id-patient, Id-illness, Id-doctor, Id-service, Id-
deploying Apache Hadoop 2.6.0 and Hive 1.2.1, ZooKeeper equipment, Id-H-establishment, Id-region and Id-date,
HDFS under Ubuntu Linux operation system as depicted in and several measures.
the Table 2. & Fact table ‘Consultation’. It exposes all information about
Table 3 shows some the used Hadoop cluster configura- the outpatient visit in a health facility. It contains the pri-
tions, such configuration parameters are essential for our sys- mary key Id-consultation, foreign keys of dimension ta-
tem operation and performances. bles: Id-patient, Id-illness, Id-doctor, Id-C-center, Id-H-es-
tablishment, Id-region and Id-Date and also several mea-
sures about the outpatient visit.
MBDW data model

In this section, we detail the conceptual data model of our case


study and its optimized version based on nested partitioning. MBDW adapted data model
We explain the used data model considering two main sub-
jects: hospitalization of patients and outpatient visits. In our A nested partitioning is used in both fact tables: Consultation and
proposed data model, we used the constellation schema. Hospitalization. As shown in Fig. 4b, we have used the two
concepts that are offered by Hive: Partitioning and Bucketing.
Each partitioning level used to partition the fact tables cor-
MBDW data model responds to a foreign key (which is not reported in Hive) of the
multi-dimensional model given in the previous section.
Figure 4a describes most important dimensions of the constel- Thus, the first level of partitioning is applied according to
lation schema that consists of two fact tables (Hospitalization the values in the column of the foreign key Id-region of the
and Consultation) which share the following dimension ta- dimension table BRegion^ which is the most overall dimen-
bles: Patient, Illness, Date, Doctor, Region, Health- sion. The dimension with underneath level is BHealth-
Establishment, Equipment, Service, and Consultation- establishment^ which represent the second partitioning level
Center. The details of the two fact tables are: (i.e., partitioning is applied according to the values in the
column of the foreign key BId-H-establishment^). The dimen-
Table 3 Configurations of the Hadoop cluster sion with another underneath level is BDate^ represents the
third partitioning level. Indeed, we use bucketing with BDate^
Configuration parameters Values
dimension table. Especially bucketing is used on the values of
Replication factor 3 the column Id-date since it belongs to a fixed interval. At this
Block size 64 MB level, we can add a third level of partitioning according to
Local cache capacity 18 -24 MB another dimension table but only if necessary. In our case,
Buffer size 32 KB we have not applied partitioning based on the foreign keys
Cache block report interval 10 S of other tables: Patient, Doctor, Service, Equipment, and
Maximum number of map tasks 4 Consultation-center to avoid a large number of partitions with
The maximum number of reduce tasks 4 little data, which will produce a large number of subdirectory
and unnecessary overhead for NameNode of HDFS.
J Med Syst (2018) 42: 59 Page 11 of 16 59

a) Constellation schema without nested partitioning


Date
Id-date
Month
Year

Illness
Consultation Hospitalization
Id-illness
Id-consultation Id-Hospitalization
Family-illness
#Id-patient #Id-patient
Name-illness
#Id-doctor #Id-doctor
#Id-illness Patient #Id-illness
Consultation-Center Equipment
#Id-C-center #Id-equipment
Id-C-center Id-patient
#Id-service Id-equipment
Name-C-center #Id-H-estab Age
#Id-H-estab Name-equipment
#Id-H-estab #Id-Region Gender
#Id-Region #Id-service
#Id-Date ….
measures #Id-Date
Doctor measures
Id-doctor
Specialty
#Id-H-estab

Service
Region Id-service
Pole
Id-region
Name-service
Name-region
Service-capacity
Department
#Id-H-estab
City

Health-Establishment
Id-H-estab
Type
Name-H-estab
#Id-region

b) Constellation schema with nested partitioning

Region
H-Estab
Hospitalization
Date
Consultation

Illness Equipment

Patient
Consultation-Center Service
Doctor

Fig. 4 Constellation schema


59 Page 12 of 16 J Med Syst (2018) 42: 59

Fig. 7 BId-patient^ code signification

Fig. 5 BId-illness^ code signification & Chronic disease (δ): This part of the field identifies and
indicates the chronic diseases. Indeed, patients with chron-
In the next sub-section, we give some important informa- ic conditions benefit broader rights from Algerian social
tion about keys of dimension tables. Such informations are security.
essential for the system users.
3) Future use field: is a reserved field for future needs that
Keys of dimension tables may occur in the future.

BIllness^ dimension table key The proposed BId-illness^ code


is an attribute identifier of illness dimension table designed BDoctor^ dimension table key The proposed BId-doctor^ code
using group coding which involving several fields that possess illustrated in Fig. 6 is an attribute identifier of BDoctor^ dimen-
specific meaning. BId-illness^ consists of three fields: the sion table (medical staff information) designed also using group
technical use field (3 + 4 digits and characters), the adminis- coding. BId-doctor^ consists of four fields. The three fist fields
trative use field (3 characters), and a future-use field formed to (2 + 3 characters +4 digits) are used to indicate the category and
be used in the future (2 characters), as shown in Fig. 5. The sub-category of the medical staff and the fourth field indicates
detail of each field is as follows: the recruitment date (6 digits). Indeed, the Algerian medical
profession consists of several categories which are: university
1) Technical use field: This field matches with the current hospital, general medical practitioners, specialist medical practi-
medical classification advocated by the World Health tioners, paramedics, laboratory staff, administrative staff and
Organization ICD-10 to make it easier for specialists technical staff. We detail the most important:
and doctors to identify diseases.
2) Administrative use field: This field consists of three parts: 1) University hospital staff: they are physicians in a position
α, β and δ as shown in Fig. 5. of acting in public institutions of a scientific, nature pro-
viding training in the medical sciences and also in medical
& Occupational disease (α): allow identifying the occupa- institutions and hospital-university centers. There exist
tional disease which is a health problem that occurs during three sub-categories (assistant, lecturer, and university
working or occupational activity and contracted under hospital professor).
some conditions. The Algerian social security system 2) Public health general medical practitioners: they are
deals with the identification of these diseases. medical practitioners without a specialty. There are three
& Notifiable disease (β): identify notifiable diseases which
are diseases under national surveillance, subject to a com-
Table 4 Fact and dimension tables’ size
pulsory declaration to the national health authority in ac-
cordance with the procedure laid down in the order num- Tables Minimal size Maximal size
ber. 179 of 17 November 1990 and also diseases under
Source file of illness dim 75 MB 375 MB
international surveillance, subject to mandatory reporting
Source file of patient dim 15 MB 41 MB
to the national health authority and mandatory notifiable
Source file of equipment dim 0,5 MB 1,4 MB
to the WHO (World Health Organization). Any doctor
whatever his type of exercise is required to declare the Source file of date dim 9 MB 26 MB
notifiable disease. Source file of doctor dim 0,4 MB 3,6 MB
Source file of service dim 0,6 MB 1,4 MB
Source file of region dim 0,2 MB 0,6 MB
Source file of health-establishment dim 0.1 MB 0.3 MB
Source file of consultation-center dim 0.1 MB 1,8 MB
Source file of hospitalization fact table 1, 6 GB 12,7 GB
Source file of consultation fact table 4,8 GB 38,2 GB
Fig. 6 BId-doctor^ code signification
J Med Syst (2018) 42: 59 Page 13 of 16 59

Table 5 Data replication strategy

categories of general medical practitioners who are: gen- large clusters. To address this problem, we implemented
eral practitioners, general pharmacists, and general den- a data placement and replication strategies –depicted in
tists. For example, general practitioners include three (3) Table 5 to improve data reliability, availability, network
subcategories: the general practitioner, the primary gen- bandwidth utilization, and reduce the effect of certain
eral practitioner, the general practitioner-in-chief. They failures such as the single-node and whole-rack failures.
can also take senior positions. As shown in Table 5, each establishment data is stored
3) Public health specialist medical practitioners: they are in its own DataNode server and replicated in two others
specialized physicians. There are three (3) subcategories Data-Node servers. For instance, the data of CHU-BJA
of specialized medical practitioners which are: assistant establishment is stored in its own DataNode (DataNode
specialist, senior specialist, chief specialist). They can al- 1) and replicated in two others Data-Node servers
so take senior positions. (DataNode 9 and DataNode 15).

BPatient^ dimension table key The proposed BId-patient^ Preliminary results


code illustrated in Fig. 7 is an attribute identifier of BPatient^
dimension table designed using group coding. It consists of In this sub-section, we give example of how to use the
three fields. The first field Bpatient type^ (1 character) is used framework to address the problem of medical resources
to identify the category of the patient (insured, uninsured and distribution. Indeed, we used the set of data described
stranger), the second field indicates the social security regis- in Table 4. We give two reports; the first one is based
tration number (12 digits) if it exists otherwise a number is on ‘Hospitalization’ fact table, and the second one is
attributed and the third field (2 characters) is used to identify based on ‘Consultation’ fact table.
the insured’s rightful claimants. We note that 80% of the Figure 8 shows one of the first results of the reporting phase
Algerian population is covered by insurance and therefore which consists of a comparison between the daily average of
they have social security registration number. patients requiring hospitalization and the number of available
After the partitioning and bucketing operation over data hospitalization places (empty beds) in the university hospital
performed as explained in subsection 5.4.B. The dimension and the five public hospital institutions of different cities.
tables are created and loaded by using Hive, and then fact
tables are loaded by joining necessary dimension tables.
Table 4 shows the estimated size of tables stored during the
year of 2015. The nine first table rows show the minimum and
maximum (according to the storing DataNodes) size of di-
mension tables (Illness, Patient, Doctor, Region, Date,
Health-Establishment, Equipment, Service, and
Consultation-Center).The two last ones show the minimum
and maximum size of fact tables (Hospitalization and
Consultation).

Data replication strategy

Our proposed Hadoop-based warehouse must handle the Fig. 8 Graphic representation of the patients requiring hospitalization and
hospitals empty beds
case of node failures, which can be frequent in such
59 Page 14 of 16 J Med Syst (2018) 42: 59

Fig. 9 Graphic representation of


rates of the required medical visits
and outpatient centers capacity

From the report of Fig. 8, we notice the lack of sufficient Indeed, to ensure availability and equitable distribution of
empty beds in the hospitals of both cities: AMIZOUR and health resources to peoples, most countries use the WHO
AOKAS for patients requiring hospitalization (usually pa- guideline expressed by a ration of the resource by a number
tients’ hospitalization is postponed if possible or transferred of populations. For instance, the WHO recommended critical
to other bordering hospitals). From the previous result, man- threshold for personal health ratio (doctors and nurses) pro-
agers can generate analytical and informative reports to en- viding patient care per 1000 population is 2.5. Although this
hance the health sector in Bejaia yield and make right deci- option is very important, it is not sufficient for ensuring a
sions with the aim of ensuring optimal distribution of health greater equity in the distribution of medical resources, since
care system component. For instance, to increase the capacity it does not take into account the specificities of each region
of hospitals in both cities: AMIZOUR and AOKAS, by giving and each population as is well expressed in Fig. 10.
them a high priority in the regional health master plan. In our previous work [11–13, 16], we proposed a data
The report presented in Fig. 9 illustrates a comparison be- warehousing based framework to address the problem of medi-
tween the daily average of outpatient visits and capacity of the cal resources allocation. However, the framework fails in
health outpatient centers per city. scaling-up and does not consider unstructured medical data.
Figure 9 shows that in both cities: BEJAIA and TAZMALT, Through this work, we have demonstrated that using a
the rate of outpatients’ visits exceeds the outpatient centers Hadoop-based architecture combined with our nested
capacity. partitioning technique solve the scaling, heterogeneity, and data
This situation is an inequitable distribution of the health size issues. We proposed a scalable, cost-effective, high avail-
resources. To address this shortfall, the decision-makers to ability, and fault tolerance solution, through a scalable architec-
take the necessary action by increasing capacity of the outpa- ture in such a way, to allow extending the nodes of the cluster as
tient center to meet the need of both regions. Therefore, in this per requirement. Cost effective since nodes are not necessarily
situation, the decision to take is to add new consultation rooms high-performance computers so there is no need to invest much
in both cities: BEJAIA and TAZMALT. on the hardware. The availability and fault tolerance are guaran-
teed through a replication strategy.

Discussion and comparisons to previous work

Several studies have suggested to use medical data to ensuring


equity and equality in healthcare. For instance, Kuo et al. [1]
argue that medical data can be used to identify healthcare
trends, to prevent diseases, to combat social and health in-
equality, to unlock new sources of economic value, to provide
fresh insights into science and hold governments accountable.
However, few studies have actually attempted to address the
issue of an equitable and equal healthcare resources distribu-
tion using data warehousing and big data technologies. This
could be explained by the slow adoption of data warehousing
technology in the clinical field and that most of the studies on
clinical data warehousing are referred towards specific dis-
eases as detailed in section 2.
Fig. 10 Equality and equity of medical resources distribution
J Med Syst (2018) 42: 59 Page 15 of 16 59

Conclusion Conference on Data Warehousing and Knowledge Discovery (pp.


185–194). Springer Berlin Heidelberg, 2006. https://doi.org/10.
1007/11823728_18
The recent work and projects on Hadoop-based medical data 9. Kerkri, E.M., Quantin, C., Allaert, F.A., Cottin, Y., Charve, P.,
warehousing described in this study show that Hadoop commu- Jouanot, F., and Yétongnon, K., An approach for integrating het-
nity in the medical field is growing. This is essentially because of erogeneous information sources in a medical data warehouse.
J. Med. Syst. 25(3):167–176, 2001. https://doi.org/10.1023/A:
the cost-effectiveness of Hadoop-based solutions, which also ad-
1010728915998.
dressed the traditional medical DW issues. In this paper, we have 10. Pavalam, S.M., Jawahar, M., and Akorli, F.K., Data warehouse
developed a Hadoop-based architecture and a conceptual data based Architecture for Electronic Health Records for Rwanda. In
model for medical big data warehouse based on current research Education and Management Technology (ICEMT) International
Conference on (pp. 253–255). IEEE, 2010. https://doi.org/10.
on big data modeling and tools. We have shown that the problem
1109/ICEMT.2010.5657660
of primary and foreign keys in Apache Hive can be resolved 11. Sebaa, A., Nouicer, A., Tari, A., Ramtani, T., and Ouhab, A.,
using nested partitioning. The proposed solution was applied to Decision support system for health care resources allocation.
the presented case study by designing and implementing a DW Electron. Physician. 9(6):4661–4668, 2017. https://doi.org/10.
19082/4661.
platform to ensure equitable distribution of health resources.
12. Sebaa, A., Nouicer, A., Tari, A., Ramtani, T., and Ouhab, A.,
Decision support system for Health Care Resources allocation.
Acknowledgements This work was partially supported by the Ministry Abstracts Book of ICHSMT’16- International Conference on
of Higher Education and Scientific Research of Algeria and the Health Sciences and Medical Technologies; 2016 Sep 27-29;
University of Bejaia, under the project CNEPRU (Ref. Tlemcen, Algeria. Mehr publishing. p. 8, 2016. ISBN: 978-600-
B*00620140066/2015-2018). 96661-0-2.
13. Sebaa, A., Tari, A., Ramtani, T., and Ouhab, A., DW RHSB: A
Compliance with Ethical Standards framework for optimal allocation of health resources. Int. J.
Comput. Sci. Commun Inf. Technol. 2(1):12–17, 2015.
Conflict of Interest Authors declare that they have no conflict of 14. Wang, L., and Alexander, C.A., Big data in medical applications
interest. and health care. Am. Med. J. 6(1):1, 2015. https://doi.org/10.3844/
amjsp.2015.1.8.
15. Cuzzocrea, A., Song, I.Y., and Davis, K.C., Analytics over large-scale
Ethical Approval This article does not contain any studies with human
multidimensional data: the big data revolution. In Proceedings of the
participants or animals performed by any of the authors.
ACM 14th international workshop on Data Warehousing and OLAP.
pp. 101–104. ACM, 2011. https://doi.org/10.1145/2064676.2064695
16. Sebaa, A., Nouicer, N., Chikh, F., and Tari, A., Big Data
References Technologies to Improve Medical Data Warehousing. In
Proceedings of 2nd international conference on Big Data, Cloud
and Applications. ACM, 2017. https://doi.org/10.1145/3090354.
1. Kuo, M.H., Sahama, T., Kushniruk, A.W., Borycki, E.M., and 3090376
Grunwell, D.K., Health big data analytics: Current perspectives, 17. Yao, Q., Tian, Y., Li, P.F., Tian, L.L., Qian, Y.M., and Li, J.S.,
challenges and potential solutions. Int. J. Big Data Intell. 1(1–2): Design and development of a medical big data processing system
114–126, 2014. https://doi.org/10.1504/IJBDI.2014.063835. based on Hadoop. J. Med. Syst. 39(3):23, 2015. https://doi.org/10.
2. Cuzzocrea, A., Warehousing and Protecting Big Data: State-Of- 1007/s10916-015-0220-8.
The-Art-Analysis, Methodologies, Future Challenges. In 18. Istephan, S., and Siadat, M.R., Unstructured medical image query
Proceedings of the International Conference on Internet of things using big data–an epilepsy case study. J. Biomed. Inform. 59:218–
and Cloud Computing (p. 14). ACM, 2016. https://doi.org/10.1145/ 226, 2016. https://doi.org/10.1016/j.jbi.2015.12.005.
2896387.2900335 19. Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., and Saltz, J.,
3. White, T., Hadoop: The definitive guide (third edition). O’Reilly, Hadoop GIS: a high performance spatial data warehousing system
2012. ISBN: 978-1-449-322252-0. over Map-Reduce. VLDB Endowment. 6(11):1009–1020, 2013.
4. Sumathi, S., and Esakkirajan, S., Fundamentals of relational data- https://doi.org/10.14778/2536222.2536227.
base management systems (Vol. 47). Springer, 2007. ISBN: 978 3 20. Saravanakumar, N.M., Eswari, T., Sampath, P., and Lavanya, S.,
540 48397 7. Predictive methodology for diabetic data analysis in big data. In 2nd
5. Ewen, E.F., Medsker, C.E., and Dusterhoft, L.E., Data warehousing ISBCC. Procedia Computer Science. 50:203–208, 2015. https://
in an integrated health system: building the business case. In doi.org/10.1016/j.procs.2015.04.069.
Proceedings of the 1st ACM international workshop on Data 21. Rodger, J.A., Discovery of medical big data analytics: Improving
warehousing and OLAP (pp. 47–53). ACM, 1998. https://doi.org/ the prediction of traumatic brain injury survival rates by data mining
10.1145/294260.294271 patient informatics processing software hybrid Hadoop hive.
6. Pedersen, T.B., and Jensen, C.S., Research issues in clinical data Informatics in Medicine Unlocked. 1:17–26, 2015. https://doi.org/
warehousing. In Scientific and Statistical Database Management. 10.1016/j.imu.2016.01.002.
Proceedings. Tenth international conference on (pp. 43–52). 22. Sundvall, E., Wei-Kleiner, F., Freire, S.M., and Lambrix, P.,
IEEE, 1998. https://doi.org/10.1109/SSDM.1998.688110 Querying archetype-based electronic health records using Hadoop
7. Guérin, E., Moussouni, F., Courselaud, B., and Loréal, O., UML and Dewey encoding of openEHR models. Stud. Health Technol.
modeling of Gedaw: A gene expression data warehouse specialised Inform. 235:406, 2017. https://doi.org/10.3233/978-1-61499-753-
in the liver. In The 3rd French bioinformatics conference proceed- 5-406.
ing: JOBIM 2002 (pp. 319–334), Saint-Malo, France, 2002. 23. Raja, P.V., and Sivasankar, E., Modern Framework for Distributed
8. Banek, M., Tjoa, A.M., and Stolba, N., Integrating different grain Healthcare Data Analytics Based on Hadoop. In Information and
levels in a medical data warehouse federation. In International Communication Technology-EurAsia Conference (pp. 348–355).
59 Page 16 of 16 J Med Syst (2018) 42: 59

Springer Berlin Heidelberg, 2014. https://doi.org/10.1007/978-3- 30. Apache Hive: https://hive.apache.org/, Viewed in 02/2015.
642-55032-4_34 31. Liu, X., Thomsen, C., and Pedersen, T.B., ETLMR: a highly scal-
24. Yang, C.T., Liu, J.C., Chen, S.T., and Lu, H.W., Implementation of able dimensional ETL framework based on mapreduce. In
a big data accessing and processing platform for medical records in Transactions on Large-Scale Data-and Knowledge-Centered
cloud. J. Med. Syst. 41(10):149, 2017. https://doi.org/10.1007/ Systems VIII (pp. 1–31). Springer Berlin Heidelberg, 2013.
s10916-017-0777-5. https://doi.org/10.1007/978-3-642-37574-3_1
25. Sebaa, A., Chick, F., Nouicer, A., and Tari, A., Research in big data 32. Gao, S., Li, L., Li, W., Janowicz, K., and Zhang, Y., Constructing
warehousing using Hadoop. J. Inform. Syst. Eng. Manag. 2(2), gazetteers from volunteered big geo-data based on Hadoop.
2017. https://doi.org/10.20897/jisem.201710. Comput. Environ. Urban. Syst. 61:172–186, 2017. https://doi.org/
26. Dean, J., and Ghemawat, S., MapReduce: A flexible data process- 10.1016/j.compenvurbsys.2014.02.004.
ing tool. CACM. 53(1):72–77, 2010. https://doi.org/10.1145/ 33. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S.,
1629175.1629198. et al., Hive: A warehousing solution over a map-reduce framework.
27. Wu, S., Li, F., Mehrotra, S., and Ooi, B.C., Query optimization for Proc. VLDB Endowment. 2(2):1626–1629, 2009. https://doi.org/
massively parallel data processing. In Proceedings of the 2nd ACM 10.14778/1687553.1687609.
Symposium on Cloud Computing (p. 12). ACM, 2011. https://doi. 34. Ross, J., The use of economic evaluation in health care: Australian
org/10.1145/2038916.2038928 decision makers' perceptions. Health Policy. 31(2):103–110, 1995.
28. Apache Hadoop: http://hadoop.apache.org/, Viewed in 02/2015. https://doi.org/10.1016/0168-8510(94)00671-7.
29. Taylor, R.C., An overview of the Hadoop/MapReduce/HBase
35. ANDI: National Agency for Investment Development of Algeria,
framework and its current applications in bioinformatics. BMC
http://www.andi.dz/index.php/en/secteur-de-sante, Viewed in 02/
bioinform. 11(12):S1, 2010. https://doi.org/10.1186/1471-2105-
2015.
11-S12-S1.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy