0% found this document useful (0 votes)

328 views16 pages

Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution

This document summarizes a research paper that proposes a Hadoop-based architecture and data model for a medical big data warehouse. The paper begins with an introduction that describes the challenges of managing and analyzing large and growing volumes of medical data using traditional data warehousing approaches. It then reviews related work on previous medical data warehouses and their limitations. The paper proposes a system architecture and conceptual data model for a medical big data warehouse built on Hadoop. It also describes a solution to address issues related to fact table size and lack of keys in Hive. Finally, the paper applies this solution by implementing a medical big data warehouse to improve healthcare resource distribution in a region of Algeria.

Uploaded by

amin sudrajat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

328 views16 pages

Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution

Uploaded by

amin sudrajat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Journal of Medical Systems (2018) 42: 59

https://doi.org/10.1007/s10916-018-0894-9

TRANSACTIONAL PROCESSING SYSTEMS

Medical Big Data Warehouse: Architecture and System Design,

a Case Study: Improving Healthcare Resources Distribution
Abderrazak Sebaa 1 & Fatima Chikh 2 & Amina Nouicer 1 & AbdelKamel Tari 1

Received: 23 September 2016 / Accepted: 8 January 2018 / Published online: 19 February 2018
# Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract
The huge increases in medical devices and clinical applications which generate enormous data have raised a big issue in
managing, processing, and mining this massive amount of data. Indeed, traditional data warehousing frameworks can not be
effective when managing the volume, variety, and velocity of current medical applications. As a result, several data warehouses
face many issues over medical data and many challenges need to be addressed. New solutions have emerged and Hadoop is one
of the best examples, it can be used to process these streams of medical data. However, without an efficient system design and
architecture, these performances will not be significant and valuable for medical managers. In this paper, we provide a short
review of the literature about research issues of traditional data warehouses and we present some important Hadoop-based data
warehouses. In addition, a Hadoop-based architecture and a conceptual data model for designing medical Big Data warehouse are
given. In our case study, we provide implementation detail of big data warehouse based on the proposed architecture and data
model in the Apache Hadoop platform to ensure an optimal allocation of health resources.

Keywords Data warehouse . Hadoop . Big data . Decision support . Medical resources allocation

Introduction traditional data management systems can not handle unstruc-

tured data of medical applications, they are limited to handle
Nowadays, the volume of data generated from different only data in the order of megabytes, their performances go
sources such as hospital information systems, medical sen- down when data increases and they fail in scaling-up.
sors, medical devices, websites and medical applications in- Indeed, several issues and problems arising over big data
crease rapidly by terabytes and petabytes. Moreover, users of when using OLAP and data warehousing as detailed in [2].
such data may be distant or inside of hospitals. Integrating and Among these problems, we quote:
handling this huge amount of data known as BBig Data^ is a
real challenge. Since health data involves large collections of 1 The fact table can easily grow in size leading to severe
structured and unstructured datasets, and as demonstrated in computational issues.
[1], integration, storing and managing efficiently this huge 2 Complexity and hardness of designing methodologies for
amount of data in a single hard drive using the traditional data OLAP and data warehousing.
warehousing platforms is very difficult. As well, such 3 Issue in computing methodologies of OLAP data cube.
4 Complications of the integration process of models, tech-
niques, algorithms and computational platforms on OLAP
This article is part of the Topical Collection on Transactional Processing over big data with classical platforms.
Systems
5 Query languages and optimization.
* Abderrazak Sebaa
balzak.sebaa@gmail.com An alternative emerged solution that is able to handle and
manage such volume of the structured and unstructured
1
amount of data is the Hadoop ecosystem. It is the best alter-
LIMED Laboratory, Faculty of Exact Sciences, University of Bejaia,
Bejaia, Algeria
native since it scales up well by distributing processing across
2
several host servers which are not necessary high-
Department of Computer Science, Faculty of Exact Sciences,
University of Bejaia, Bejaia, Algeria
performance computer; it is based on a distributed file system
59 Page 2 of 16 J Med Syst (2018) 42: 59

[3]. Indeed, a major advantage of Hadoop framework apart its studies and projects have been conducted on medical DWs.
scalability and strength, is its implementation low cost, by Among the first ones, Ewen and al. [5] highlighted the need
utilizing only multiple existing ordinary computer servers in- for a DW in the health sector. Authors in [6] shown the main
stead of one high-performance cluster. Hadoop pushes pro- differences between conventional business oriented DWs and
cessing to the data instead of data to the processing. clinical DWs, they also identified key research issues in clin-
Furthermore, it supports ETL (Extract, transform, and Load) ical data warehousing. GEDAW [7] is a bioinformatics project
processes in parallel. However, several issues arise when in- consisted of building a DW on the hepatic transcriptome. This
tegrating and warehousing of medical big data such as the lack project aimed to bring together within a single knowledge
of a standard and powerful architecture for integration, which base, complex, varied, and many data of liver genes for anal-
makes implementing big data warehouse a big issue. ysis. Its objective was to provide researchers with a compre-
Moreover, former design methodologies for data models can hensive tool for integrating transcriptome and provide deci-
not meet all needs of the new data analysis applications over sion support to guide biological research. DWs of different
big data warehousing given the new constraints such as the health insurance organizations in Austria were merged in an
nature and complexity of medical big data. evidence-based medicine collaboration project [8] called
The purpose of this paper is to address the arising problems HEWAF (Healthcare Warehouse Federation). Kerkri and al.
when using OLAP and data warehousing over big data espe- [9] presented EPIDWARE architecture to build DWs for epi-
cially medical big data. The contributions of this paper can be demiological and medical studies. It improved the medical
summarized as follows: care of the patients in care units. Pavalam and al. [10] pro-
posed a DW based architecture for HER (Electronic Health
& Firstly, we give an overview of some previous traditional Record) to the Rwandan health sector. It can be accessed from
medical DWs (Data Warehouses), their limitations, and different applications and allow data analysis for a swift
some recent Hadoop-based DWs. decision-making process and also in alerting the epidemic. A
& Secondly, we propose a system architecture and a conceptual data warehouse which gives information on epidemiology and
data model for a MBDW (Medical Big Data Warehouse). public health, about various diseases, their concentration, and
& Thirdly, we offer a solution to overcome both the growing resources repartition in Bejaia department is proposed in
of fact table size and the lack of primary and foreign keys [11–13]. The aim of this DW is to improve medical resources
in the framework Apache Hive required in the conceptual allocation of Bejaia region.
data model. This solution is based on nested partitioning
according to the dimension tables keys. Issues and limitations
& Finally, we apply our solution to implement a MBDW to
improve medical resources distribution for health sector in The dramatic increases in devices and applications generating
Bejaia region (in Algeria). enormous medical data have raised big issues in managing,
processing, and mining these massive amounts of data. New
The remainder of this paper is organized as follows: A brief solutions are needed since traditional frameworks can no lon-
review related to the medical DWs traditional and modern ger be used to handle the volume, variety, and velocity of
ones is given in Section 2. Section 3 details some concepts current medical applications. As a result, the previously pre-
and tools which are called to be used in the rest of this work. A sented DWs and several others face many common issues and
system architecture of the MBDW and a conceptual data mod- limitations over medical big data. Among these issues, we
el which exploits partitioning and bucketing are presented in highlight the most important ones. Firstly, impact of the huge
sections 4 and 5. The latest section discusses our implemen- volume and fast growth of medical data. Indeed, in the med-
tation and experimental results of our case study. ical field, a large amount of information about patients’ med-
ical histories, symptomatology, diagnoses and responses to
treatments and therapies are generated, which necessitates
Related works their collection. Secondly, the unstructured nature of medical
data is another important issue. In fact, unstructured docu-
In this section, we introduce some medical DWs, then we give ments, images, and clinical notes represent over 80% of cur-
their limitations and their common drawbacks. rent medical data [14], such unstructured data should be con-
verted into analysis-ready datasets. Thirdly, the complexity of
Traditional medical data warehouses data modeling, it is not easy to deal with computing OLAP
data cubes over big data, mainly due to the explosive size of
Despite the widespread interest in data integration and data sets and the complexity of multidimensional data models
historization in the medical field [4], this field has been slow [2]. For instance, if we have 10.000 diagnoses, this would
to adopt data warehousing. However, once adopted, several amount to 210.000 dimensions, therefore, this solution is
J Med Syst (2018) 42: 59 Page 3 of 16 59

Specific to run spatial queries in a data warehousing system.

No evaluation of the proposed solution has been performed.

practically unusable [6]. Fourthly, the integration of very com-

Specific to design of hospital information intelligently.

plex medical data as this aspect not only conveys in typical

There is no to what, and how to use their framework/

Manual translation of the AQL to Hadoop-querying
integration problems, mainly coming from the literature on
data and schema integration issues, also it has deep conse-
quences for the kind of analytics to be designed [15], since

Specific to traumatic brain injury field

effective large-scale analysis often needs the collection from

Suitable for medical data integration

multiple strongly heterogeneous data sources. For instance,
obtaining overall health view of a patient requires integrating

Application field /Limits

and analyzing medical health record along with readings from

Specific to diabetes field.

Specific to epilepsy field
multiple meters types such as accelerometers, glucose meters,
heart meters, etc. [14]. Finally, the temporal aspects of medical
data known as valid time and transaction time must both be
supported to provide bi-temporal support. Indeed, it is impor-
tant to know when data is considered to be valid in the real
world and when it is stored and changed in the database [6].
For instance, a blood test can be made multiple times for the

Infer knowledge from distributed, multivariate data of healthcare centers in different

Improve prediction of traumatic brain injury survival rates using data classification.
same patient, however, it considered being valid only for a

Propose a cloud storage platform with HBase of Hadoop for integrating, storing
Implementing and testing of an archetype-aware Dewey encoding optimization.
spatial datasets. Also, Map-Reduce is improved by spatial data partitioning

Propose a predictive analysis algorithm over Hadoop environment to predict

small period of time. For more details about medical big data

Unlimited query capacity & efficiency of unstructured medical image data

warehousing issues see [16].

and R*-tree indexes and Hive is extended with spatial query support.
Provide a scalable solution for analytical spatial queries over large scale
Hadoop-based data warehousing

Aside from the above issues and limitations of traditional data

Solve big data collection, storage & analysis problems

Knowledge discovery utilizing Apache Hadoop Hive

warehousing technologies, health care systems have rapidly
adopted electronic health record which has dramatically in-

and analyzing big data of medical records.

creased the size of clinical data. As a result, there are oppor-
tunities to use a big data framework such as Hadoop to process
and store medical big data since the most important aspect of
big data analysis is to process the data and obtain the result
within a time constraint. Researchers and industrials tried to

geographic locations
follow such new platforms. Thus, a small number of studies
the type of diabetes

were undertaken as depicted in Table 1. Yao and al. [17] built a

Main objectives

five-node Hadoop cluster to execute distributed Map-Reduce

algorithms to study user behaviors regarding various data pro-
duced by different hospital information systems for daily
work. They have shown that medical big data makes the de-
sign of hospital information systems more intelligent and eas-
Hadoop with Hive, Hbase and Sqoop
Hadoop with Hive, sqoop & Mahout

Hadoop, Archetype, and Electronic

ier to use by making personalized recommendations. In order

to interact with structured and unstructured medical data in the
epilepsy field in an efficient way to fulfill related queries,
Hadoop and Map-reduce
Horton-works Hadoop

Istephan and Siadat [18] proposed a framework which dynam-

HBase and Hadoop
Hadoop with Hive,
Hadoop with Hive

ically integrates user-defined modules. Hadoop-GIS [19] is a

Health Record
Big data medical frameworks

warehousing system of spatial data, built over Map-Reduce

Used tools

for medical image processing. It supports various analytical

queries. Saravanakumar and al. [20] developed a predictive
analysis algorithm over Hadoop environment to predict and
classify the type of diabetes and the type of treatment to be
Saravanakumar et al. [20]

Raja and Sivasankar [23]

provided. According to the authors, this system helps to make

Sundvall et al. [22]

an effective cure and care the patients with enhancing out-

Istephan et al. [18]

Yang et al. [24]

comes like affordability and availability. Rodger [21] used

Yao et al. [17]

Aji et al. [19]

Rodger [21]

Hadoop and Hive as a data warehouse infrastructure to im-

Table 1

prove prediction of traumatic brain injury survival rates using

Study

data classification into predefined classes. Indeed, to

59 Page 4 of 16 J Med Syst (2018) 42: 59

determine how the set of collected variables relates to the body characteristics. Therefore new platforms and technologies
injuries, he collected data on three ship variables (Byrd, were developed.
Boxer, and Kearsage) and injuries to different body regions
such as head, torso, extremities, and abrasions. He proposed a Big data technologies
hybrid approach on multiple ship databases where various
scalable machine-learning algorithms have been employed. To manage the volume, velocity, variety, and variability of big
Raja and Sivasankar [23] proposed a framework based on data, several technologies, tools, and framework were devel-
Hadoop to modernize the healthcare informatics systems to oped. The most important ones are The Map-reduce based
inferring knowledge from various healthcare centers in differ- system, for instance, BigTable, HadoopDB, Hadoop ecosys-
ent geographic locations. tem and its most important components such as HBase, Pig,
Yang et al. [24] build a cloud-based storage and distributed and Hive. The NoSQL databases, for instance: MongoDB,
processing platform to store and handle medical records data Cassandra, and VoldemortDB. The In-memory Frameworks,
using HBase of the Hadoop ecosystem providing diverse func- for instance: Apache Spark. The Graph databases, for in-
tions such as keyword search, data filtering, and basic statistics. stance: Neo4J, AllegroGraph, In_niteGraph,
Based on the size of data, authors used the Put with the single- HyperGraphDB, InfoGrid and Google Pregel. RDBMS-
threaded method or Complete-Bulkload mechanism to improve based system, for example, Greenplum, Aster, and Vertica.
the performance of data import into a database. The stream data processing frameworks, for instance:
The different listed works [17–24] used high-performance Apache S4, Apache Kafka, Apache Flume, Twitter Storm
computers and very powerful IT infrastructures that required a and Spark Streaming.
very substantial investment to use the great processing capabil-
ities of Hadoop technology. In this work, our goal is to re-use Map-reduce paradigm
their own IT system infrastructure and workstations of the health
institution without having to make a great investment in its The Map-Reduce paradigm [26] is a programming model
material capabilities and also to adopt data warehousing and allowing parallel computation of tasks using two phases prim-
OLAP methodologies. However, as indicated in [25], data itive: map and reduce phases defined by users. In the first
warehousing using Hadoop and hive is faced with many chal- BMap^ phase, a mapper is defined by the user to load data
lenges. Firstly, Hadoop does not support OLAP operations and chunk from DFS and transforms it into a list of intermediate
the multi-dimensional model can not meet all needs of the new key/value (ki, vi) pairs. Then, the key-value pairs are buffered
data analysis applications. Secondly, the limitation of Apache as (r) of files, and all key-value files are sorted by keys. The
Hive query language called HiveQL which does not support second BReduce^ phase, a reducer is defined by the user to
many standard SQL queries. Overcoming such limitations and combine files from different mappers together. The final re-
challenges in the medical field is the objective of this work. sults are written back to DFS [27].

Hadoop ecosystem
Background review
Hadoop ecosystem consists of several projects and libraries
This section details some concepts and tools called to be used such as massive scale database management solution called
in this study. HBase, data warehousing solution called Hive, the machine
learning library called Mahout, the job tracking and schedul-
Big data concept ing suite called ZooKeeper and Pig and other related libraries
and packages for massively parallel and distributed data
The ‘big data’ as a concept is often described by the three V management.
then, recently by the six V, which mean Volume, Velocity,
Variety, and Veracity, Variability, Value of data. Indeed, the Hadoop
high-volume of data generation refers to the volume (1st V).
The rate of data generation and the speed at which it should be Apache Hadoop [28] is an open source software allowing
processed refers to the velocity (2nd V). The heterogeneity and large-scale parallel distributed data analysis. It replicates and
the diversity of the data types refer to the variety (3rd V). The collocates automatically data across multiple different nodes,
degree of reliability and quality of sources of data refers to the then allows parallel processing across clusters. It is an open-
Veracity (4th V). The disparity and variation of the data flow source implementation of Google’s Map-Reduce computing
rates refer to the variability (5th V). Finally, value (6th V) of model. It is based on HDFS (Hadoop Distributed File system)
data depends on the volume of analyzed data. Traditional data which provides high-throughput access to application data.
management systems can not handle data with such Hadoop provides the robust, fault-tolerant HDFS, and
J Med Syst (2018) 42: 59 Page 5 of 16 59

reliability. Doug Cutting implemented the first version of capture data. During this phase, all tables’ sources are ex-
Hadoop in 2004 and becomes an Apache Software traction from appropriate data sources then converted into
Foundation project in 2008 [29]. On November 17, 2017, columns structure (CSV format). The transformation phase
the release 2.9.0 of Apache Hadoop is available [28]. involves data cleansing to comply with the target schema
based on the description of the multi-dimensional data
Hive model (partitioned schema) which is stored as meta-data
in an HDFS file). Finally loading phase, the data is propa-
Apache Hive [30] is initially a subproject developed by the gated into the regional storage system (database server). As
Facebook team. Hive is used in conjunction with the map- we will see in the next section, the data model is
reduce model of Hadoop to structure the data, to run queries, partitioned according to the dimension tables (in our case
allowing the creation of a variety of reports and summaries, study keys of ‘Region’ dimension table, then ‘Health-es-
and to perform historical analysis over this data. Over Hive, tablishment’ dimension table and then ‘Date’ dimension
tables are used to store data, such structure consists of a num- table). The integrated and stored data of database server
ber of rows and each row comprises a specified number of are then replicated in DataNodes of Hadoop cluster accord-
columns. In Hive, queries are expressed in an SQL-like de- ing to a replication strategy.
clarative language called HiveQL, these queries are compiled New database servers and automatically new data-node on
into map-reduce jobs and executed using Hadoop. It supports Hadoop cluster will be created for new sources (for instance
primitive and complex type. It processes data for analysis but new hospitals). Such feature ensures the scalability of this
not to serve users, so it does not need ACID guarantees (as for architecture.
traditional relational database) for data storage.
(3) Hadoop ecosystem

System architecture and data flow The core of our architecture is the Apache Hadoop [28]
framework. It is an open-source implementation of Google’s
Implementing efficient medical data warehouse based on the Map-Reduce computing model. It is based on HDFS (Hadoop
Hadoop ecosystem requires a flexible and modular architec- Distributed File System) which provides high-throughput ac-
ture for big data warehousing. This section describes the process to application data.
posed architecture along with the functionalities of each layer.
The overall architecture is depicted in Fig. 1. It is a scalable, (3.1) Hadoop. Hadoop cluster is the corpus of server nodes
reliable, and distributed architecture to extract, store, analyze, with different physical locations within a group on
and visualize healthcare data extracted from various resources Hadoop [32]. It replicates and collocates automatically
HIS (Hospitals Information systems). In the following, we data across multiple nodes, then performs parallel pro-
give some details about components and levels of the processing across clusters using the Map-Reduce para-
posed architecture and data flow within these components. digm. Thus, Hadoop reduces infrastructure cost. Its
most important components:
(1) Medical data source
& NameNode act as a metadata repository describing the
In each region (provinces of a department, according to the location of data blocks for each file stored in HDFS.
geographical division), medical data is extracted from multi- This makes the massive data processing easier.
ple distributed computer of HIS (Hospitals Information & Secondary NameNode is in charge periodically to check
Systems), information systems, laboratories software, radiol- the persistent status of NameNode and to download its
ogy data and from the regional directorate of Health. current snapshot and log files.
& DataNode runs to every machine in Hadoop cluster sepa-
(2) Distributed ETL rately. In the proposed architecture, data servers in which
data integration is done (point 2 of section 3) are automat-
An ETL is the component responsible to collect data from ically replicated in one or more DataNode. Furthermore, it
different data sources, to transform, and to clean based on is responsible for storing the structured and unstructured
business rules and requirements defined by the final user [31]. data and to answer the data reading and writing queries.
Our ETL approach is based on the partitioning of the
input data horizontally according to the nested partitioning (3.2) Hive. Apache Hive [30] is a data warehousing struc-
technique (section 5.1). In order to run in a distributed and ture. It is used in conjunction with the map-reduce
parallel way, the extraction phase is achieved in each med- model of Hadoop to process the data. Indeed, Hive
ical establishment of each geographic division (region) to stores data in tables, which consists of a number of
59 Page 6 of 16 J Med Syst (2018) 42: 59

Fig. 1 Hadoop-based system

architecture of medical big data
warehousing
Doctors Administrators Managers Governments

(5) Analysis & reporting tools

Reporting Data mining Dashboards Planning

Web-based Access

(3) Hadoop ecosystem

Hive
Metastore Hive Engine

Hadoop NameNode Secondary

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode n

(2) Distributed integration

Database Database
Database server
server server

HIS Logs HIS MR HIS Logs MR HIS Logs MR

rows, each row comprises a specified number of col- & Hive engine: it converts queries written in HiveQL into a
umns. HiveQL is used to express queries over Hive. map-reduce code and then executes on the Hadoop cluster.
The execution of user queries over Hive are compiled
into map-reduce jobs, then Hadoop (point 3.1 of sec- (4) Web-based Access
tion 4) execute these jobs. Using such queries allow
creating a variety of reports, summaries, and historical Our system allows trusted users (point 6 of section 4) ac-
analysis. The main functional components of Hive are: cess to data in order to create a variety of reports and data
summaries after authentication through the Web browser.
& Metastore: it contains metadata about tables stored in Hive
such as data formats, partitioning, and bucketing and storage (5) Analysis and reporting tools
information including the location of table’s data in the un-
derlying file system. It is specified during table creation and This includes several tools for reporting, planning,
reused every time the table is referenced in HiveQL. dashboards and data mining. These tools allow data to
J Med Syst (2018) 42: 59 Page 7 of 16 59

be analyzed statistically to track trends and patterns over the process which takes as input the diagram composed of data
time, and then produce regular reports tailored to the warehouse tables files (F, D1, D2,…, Dn) and outputs k diagrams:
needs of various users of the decision support system.
F 1 ; D1 ; D2 ; …; Dn ; ::; F k ; D1 ; D2 ; …; Dn
(6) MBDW users
The Table F is fragmented in k fragments F1,.., Fk and they
are computed in the following way:
They can be doctors, medical searchers, hospital managers,
health administrators, and governments. All such users can
interact with the system.

Design and data model adaptation

It is necessary to design an efficient data model allowing

a better understanding of information set about health
system and providing better performance for the execu-
tion of complex analytical queries. In this section, we
describe an optimized strategy for data modeling and its
implementation in the Have framework.
The multi-dimensional modeling technique has been
adopted widely in data warehouse modeling [5, 12, 13,
20]. Such technique can be extended for big data
warehousing. Usually, the multi-dimensional data
models-based data warehouses constituted from fact and
dimension tables, whether for star, constellation or snow-
flake schema. Fact tables contain keys of dimension tables
-as foreign keys- and measurable facts to examine.
Dimension tables describe dimensions and contain attri- Such as the number of fragments k is: 1 ≤ k ≤ ∏ci¼1 ni . Where
butes and key of dimension. However, directly adopting ni is the number of values of keys of the partitionable dimen-
the multi-dimensional modeling technique in big data sce- sion tables and ⋉ is the semi-join operator.
narios is not very relevant because the tables and espe-
cially fact tables grow in size very large, with millions or
even billions of rows [2], and query performance become Data partitioning techniques in Hive
unacceptable to end users.
To overcome the previous issues: growing size of the fact Using Apache Hive implies that we don’t have primary and
table and query performance. We propose nested partitioning foreign keys during the implementation and management of
technique of the fact tables. data. This is largely due to the fact that Hive focuses on ana-
lytical aspects, bulkiness, and diversification of data nature. In
Nested partitioning technique addition, Hive is not intended to run complex relational
queries. However, it is used to get data in simple and efficient
Our technique consists to apply a nested partitioning where manner. Nevertheless, these two concepts (primary and for-
the data warehouse fact table file is split into several tables eign key) are important if we want to store and manipulate
having fewer sizes but keeping the same multi-dimensional data in many facets and points of view. It is essential to be able
diagram semantic. to uniquely identify each tuple, manage relationships between
Let (F, D1, D2,…, Dn) be a multi-dimensional diagram (in tables, and ensure data consistency.
star, snowflake or constellation) where F = (d1,…,;dn) be the To overcome the previous issues: lack of primary and for-
fact table, and D1 = (d11,…, d1m), Dn = (dn1,…, dnk) be the di- eign keys in Hive. We propose to use two Hive concepts:
mension tables, such as dj1 (1 ≤ j≤ n) be a primary key of the Partitions and buckets.
dimension table Dj, and di (1≤ i≤ n) be a foreign key of F Hive allows dividing tables into partitions and buckets. The
referring to dj1. Let c, such that 1 ≤ c ≤ n, the number of for- first technique called partition which is a way of decomposing a
eign keys with which the fact Table F is partitionable. table into parts based on a value of data. The second technique
We define Nested partitioning on fact Table F based on c called bucketing, where tables or partitions may further be
dimension tables Di (1≤ i≤ c) by using their primary keys as subdivided into buckets, an extra structure to the data [3].
59 Page 8 of 16 J Med Syst (2018) 42: 59

Partitioning In Hive, each table can have one or more parti- solution to a case of medical data. An MBDW is designed for
tions which determine the distribution of data within sub- the health sector of the Wilaya of Bejaia in Algeria to help the
directories of the table directory where each partition has its physicians and health managers to understand, predict and
own directory, according to the partitioning field and can have improve availability and distribution of healthcare resources.
its own columns and storage information [33]. Using parti- In this section, we first describe the study objectives and
tions can make queries faster. For instance, a table can be settings. Then we give the implementation detail of the pro-
partitioned by date, and records with the same date would be posed MBDW solution, we give some preliminary results and
stored in the same partition. finally a discussion about our solution compared to the previ-
ous ones.
Bucketing It is another technique to decompose hive table
(data-sets) in a defined number of parts, which will be stored
in files. Indeed, the data in each partition can be divided into Study objectives
buckets based on the hashing of a column in the table. Each
bucket is stored in the partition directory as a file. Metadata Decision-making about the alternative uses of healthcare re-
about bucketing information of each table are stored in the sources is an issue of critical concern for governments and ad-
Meta-store [33]. Among bucketing advantages, the number ministrators in all health care systems [34]. In the other hand,
of buckets is fixed so that it does not vary with data. data of medical sources can be used to identify healthcare trends,
As shown in Fig. 2, nested partitioning means to divide the prevent diseases, combat social inequality and so on [1]. Thus,
fact table into many levels of partitions based on the different using medical data to improve health resources distribution is an
dimensions. Indeed, fact table will be partitioned using Hive important challenge in emerging countries.
partition based on a first dimension (based on the attribute The purpose of this case study is to provide decision
which should be considered as the primary key of the first makers a clear view of Bejaia health sector. It will guide
dimension table). Then, the partitioned tables will be divided leaders to make better decisions leading for equity in the dis-
using buckets based on a second dimension, and so on. tribution of medical resources, and for a significant reduction
Using partitioned tables will allow to distribute and to re- of the cost to offer a more efficient method to accurate health
duce the data volume from a fact table to many distributed fact management, to improve the availability of clinical material
tables. Therefore, these partitioned fact tables optimize data and human resources and to increase the quality of services
warehouse resources use (distributed processing, memory) and patient safety. It will allow managing different orienta-
and improve queries execution performance. tions during the planning stages, to have a complete predictive
vision, using a better repartition of care offerings (the hospital
specialization, the number of health centers by town, alloca-
tion of hospital equipment, the number of specialist per hos-
Case study: Improving healthcare resources pital, the number of beds by specialty and service). These
distribution in Bejaia health sector facilities have to be regularly adapted according to the chang-
ing of demand (changing health techniques, diseases, health
In the two previous sections, we presented our proposal for structures, population age and geographical location of the
medical big data warehouse (MBDW) architecture and data population).
model built on Hadoop cluster to perform analytical process- Our decision support system gives information about: - the
ing. The purpose of this section is to apply and validate our place and date of health centers building and the required

Fig. 2 Nested partitioning

J Med Syst (2018) 42: 59 Page 9 of 16 59

specialty, − the best strategy for medical professionals’ recruit- administratively divided into 19 Daïras (a set of municipali-
ment for each hospital; − the required equipment and the most ties) and covers 51 municipalities which stretch over an area
appropriate for a hospital (e.g. CT scanner or radiological of 3268 KM2. As shown in Fig. 3, it includes the following
unit); − screening date and place (e.g., breast cancer screening, health structures: one university hospital, five public hospital
cervical screening, etc) and also, it aims to give information to institutions, and one specialized public hospital establish-
help medical research. ments, with a total of 1533 technical beds, 8 nearby public
establishments of health with 51 polyclinics, one paramedical
Study settings school, and several laboratories. It employs 33 hospital-
university practitioners, 245 specialist medical practitioners,
The plan of Bejaia Health Sector is a part of the Algerian 734 general medical practitioners, and 2742 paramedical staff.
Health master plan. The latter provides for the period 2009– To achieve analytical processing of medical data, we de-
2025 investments of 20 billion Euros for constructing of new veloped the MBDW by extracting, transferred, and collecting
health facilities and modernization of existing hospitals. Such data from different operating systems, and software of Bejaia
investments have been initiated for construction and mainte- health structures which includes: PATIENT 3.40 (patients data
nance of infrastructure and hospital equipment and education management software), Microsoft EXCEL, and EPIPHARM
of medical professionals [35]. The outline of this program (a software for the management of drug stock) and automati-
project is to achieve 172 new hospitals, 45 specialized health cally stored in the DW.
complex, 377 polyclinics, 1000 outpatient rooms, 17 para-
medical training schools, and more than 70 specialized insti- MBDW implementation
tutions for persons with disabilities.
Currently, the Bejaia Health Sector manages healthcare for The implementation of the Hadoop based architecture is built
a population of around 1 million people of the Wilaya under the following hardware resources of the computer
(department) of Bejaia. The Wilaya of Bejaia is located in equipments of the medical institutions: memory (RAM) ca-
the north of Algeria, on the Mediterranean coast. It is pacity ranging from 2 to 10 GB, processor speed ranging from

Fig. 3 Geographic location of Bejaia medical institutions

59 Page 10 of 16 J Med Syst (2018) 42: 59

Table 2 Software’s specification

Software Version Argument

Required software Java 7 Required to run the Hadoop jobs

Required software OpenSSH 6.3 Required to manage the different Hadoop cluster nodes and users access
Processing tool Hadoop 2.6.0 Framework for large-scale processing and storage of data on clusters.
Warehousing tool Hive 1.2.1 Compatible with version 2.6.0 of Hadoop
Storage system HDFS 2.6.0 Distributed file system that handles large data sets based on a block size of 64 MB
Synchronization tool Zookeeper 3.4.6 Provides an infrastructure for cross-node synchronization
Programming paradigm Map-Reduce 2.6.0 Two distinct tasks that Hadoop programs perform

1.6 GHz to 2.4 GHz, disk space ranging from 250 GB to 2 TB & Fact table ‘Hospitalization’. It represents all information
for data storing. Computer nodes are connecting and network- about the patients’ hospitalization period. It holds the pri-
ing with RJ45 LAN cable and switches. mary key Id-hospitalization, foreign keys of dimension
Several software were used to build MBDW platform by tables: Id-patient, Id-illness, Id-doctor, Id-service, Id-
deploying Apache Hadoop 2.6.0 and Hive 1.2.1, ZooKeeper equipment, Id-H-establishment, Id-region and Id-date,
HDFS under Ubuntu Linux operation system as depicted in and several measures.
the Table 2. & Fact table ‘Consultation’. It exposes all information about
Table 3 shows some the used Hadoop cluster configura- the outpatient visit in a health facility. It contains the pri-
tions, such configuration parameters are essential for our sys- mary key Id-consultation, foreign keys of dimension ta-
tem operation and performances. bles: Id-patient, Id-illness, Id-doctor, Id-C-center, Id-H-es-
tablishment, Id-region and Id-Date and also several mea-
sures about the outpatient visit.
MBDW data model

In this section, we detail the conceptual data model of our case

study and its optimized version based on nested partitioning. MBDW adapted data model
We explain the used data model considering two main sub-
jects: hospitalization of patients and outpatient visits. In our A nested partitioning is used in both fact tables: Consultation and
proposed data model, we used the constellation schema. Hospitalization. As shown in Fig. 4b, we have used the two
concepts that are offered by Hive: Partitioning and Bucketing.
Each partitioning level used to partition the fact tables cor-
MBDW data model responds to a foreign key (which is not reported in Hive) of the
multi-dimensional model given in the previous section.
Figure 4a describes most important dimensions of the constel- Thus, the first level of partitioning is applied according to
lation schema that consists of two fact tables (Hospitalization the values in the column of the foreign key Id-region of the
and Consultation) which share the following dimension ta- dimension table BRegion^ which is the most overall dimen-
bles: Patient, Illness, Date, Doctor, Region, Health- sion. The dimension with underneath level is BHealth-
Establishment, Equipment, Service, and Consultation- establishment^ which represent the second partitioning level
Center. The details of the two fact tables are: (i.e., partitioning is applied according to the values in the
column of the foreign key BId-H-establishment^). The dimen-
Table 3 Configurations of the Hadoop cluster sion with another underneath level is BDate^ represents the
third partitioning level. Indeed, we use bucketing with BDate^
Configuration parameters Values
dimension table. Especially bucketing is used on the values of
Replication factor 3 the column Id-date since it belongs to a fixed interval. At this
Block size 64 MB level, we can add a third level of partitioning according to
Local cache capacity 18 -24 MB another dimension table but only if necessary. In our case,
Buffer size 32 KB we have not applied partitioning based on the foreign keys
Cache block report interval 10 S of other tables: Patient, Doctor, Service, Equipment, and
Maximum number of map tasks 4 Consultation-center to avoid a large number of partitions with
The maximum number of reduce tasks 4 little data, which will produce a large number of subdirectory
and unnecessary overhead for NameNode of HDFS.
J Med Syst (2018) 42: 59 Page 11 of 16 59

a) Constellation schema without nested partitioning

Date
Id-date
Month
Year

Illness
Consultation Hospitalization
Id-illness
Id-consultation Id-Hospitalization
Family-illness
#Id-patient #Id-patient
Name-illness
#Id-doctor #Id-doctor
#Id-illness Patient #Id-illness
Consultation-Center Equipment
#Id-C-center #Id-equipment
Id-C-center Id-patient
#Id-service Id-equipment
Name-C-center #Id-H-estab Age
#Id-H-estab Name-equipment
#Id-H-estab #Id-Region Gender
#Id-Region #Id-service
#Id-Date ….
measures #Id-Date
Doctor measures
Id-doctor
Specialty
#Id-H-estab

Service
Region Id-service
Pole
Id-region
Name-service
Name-region
Service-capacity
Department
#Id-H-estab
City

Health-Establishment
Id-H-estab
Type
Name-H-estab
#Id-region

b) Constellation schema with nested partitioning

Region
H-Estab
Hospitalization
Date
Consultation

Illness Equipment

Patient
Consultation-Center Service
Doctor

Fig. 4 Constellation schema

59 Page 12 of 16 J Med Syst (2018) 42: 59

Fig. 7 BId-patient^ code signification

Fig. 5 BId-illness^ code signification & Chronic disease (δ): This part of the field identifies and
indicates the chronic diseases. Indeed, patients with chron-
In the next sub-section, we give some important informa- ic conditions benefit broader rights from Algerian social
tion about keys of dimension tables. Such informations are security.
essential for the system users.
3) Future use field: is a reserved field for future needs that
Keys of dimension tables may occur in the future.

BIllness^ dimension table key The proposed BId-illness^ code

is an attribute identifier of illness dimension table designed BDoctor^ dimension table key The proposed BId-doctor^ code
using group coding which involving several fields that possess illustrated in Fig. 6 is an attribute identifier of BDoctor^ dimen-
specific meaning. BId-illness^ consists of three fields: the sion table (medical staff information) designed also using group
technical use field (3 + 4 digits and characters), the adminis- coding. BId-doctor^ consists of four fields. The three fist fields
trative use field (3 characters), and a future-use field formed to (2 + 3 characters +4 digits) are used to indicate the category and
be used in the future (2 characters), as shown in Fig. 5. The sub-category of the medical staff and the fourth field indicates
detail of each field is as follows: the recruitment date (6 digits). Indeed, the Algerian medical
profession consists of several categories which are: university
1) Technical use field: This field matches with the current hospital, general medical practitioners, specialist medical practi-
medical classification advocated by the World Health tioners, paramedics, laboratory staff, administrative staff and
Organization ICD-10 to make it easier for specialists technical staff. We detail the most important:
and doctors to identify diseases.
2) Administrative use field: This field consists of three parts: 1) University hospital staff: they are physicians in a position
α, β and δ as shown in Fig. 5. of acting in public institutions of a scientific, nature pro-
viding training in the medical sciences and also in medical
& Occupational disease (α): allow identifying the occupa- institutions and hospital-university centers. There exist
tional disease which is a health problem that occurs during three sub-categories (assistant, lecturer, and university
working or occupational activity and contracted under hospital professor).
some conditions. The Algerian social security system 2) Public health general medical practitioners: they are
deals with the identification of these diseases. medical practitioners without a specialty. There are three
& Notifiable disease (β): identify notifiable diseases which
are diseases under national surveillance, subject to a com-
Table 4 Fact and dimension tables’ size
pulsory declaration to the national health authority in ac-
cordance with the procedure laid down in the order num- Tables Minimal size Maximal size
ber. 179 of 17 November 1990 and also diseases under
Source file of illness dim 75 MB 375 MB
international surveillance, subject to mandatory reporting
Source file of patient dim 15 MB 41 MB
to the national health authority and mandatory notifiable
Source file of equipment dim 0,5 MB 1,4 MB
to the WHO (World Health Organization). Any doctor
whatever his type of exercise is required to declare the Source file of date dim 9 MB 26 MB
notifiable disease. Source file of doctor dim 0,4 MB 3,6 MB
Source file of service dim 0,6 MB 1,4 MB
Source file of region dim 0,2 MB 0,6 MB
Source file of health-establishment dim 0.1 MB 0.3 MB
Source file of consultation-center dim 0.1 MB 1,8 MB
Source file of hospitalization fact table 1, 6 GB 12,7 GB
Source file of consultation fact table 4,8 GB 38,2 GB
Fig. 6 BId-doctor^ code signification
J Med Syst (2018) 42: 59 Page 13 of 16 59

Table 5 Data replication strategy

categories of general medical practitioners who are: gen- large clusters. To address this problem, we implemented
eral practitioners, general pharmacists, and general den- a data placement and replication strategies –depicted in
tists. For example, general practitioners include three (3) Table 5 to improve data reliability, availability, network
subcategories: the general practitioner, the primary gen- bandwidth utilization, and reduce the effect of certain
eral practitioner, the general practitioner-in-chief. They failures such as the single-node and whole-rack failures.
can also take senior positions. As shown in Table 5, each establishment data is stored
3) Public health specialist medical practitioners: they are in its own DataNode server and replicated in two others
specialized physicians. There are three (3) subcategories Data-Node servers. For instance, the data of CHU-BJA
of specialized medical practitioners which are: assistant establishment is stored in its own DataNode (DataNode
specialist, senior specialist, chief specialist). They can al- 1) and replicated in two others Data-Node servers
so take senior positions. (DataNode 9 and DataNode 15).

BPatient^ dimension table key The proposed BId-patient^ Preliminary results

code illustrated in Fig. 7 is an attribute identifier of BPatient^
dimension table designed using group coding. It consists of In this sub-section, we give example of how to use the
three fields. The first field Bpatient type^ (1 character) is used framework to address the problem of medical resources
to identify the category of the patient (insured, uninsured and distribution. Indeed, we used the set of data described
stranger), the second field indicates the social security regis- in Table 4. We give two reports; the first one is based
tration number (12 digits) if it exists otherwise a number is on ‘Hospitalization’ fact table, and the second one is
attributed and the third field (2 characters) is used to identify based on ‘Consultation’ fact table.
the insured’s rightful claimants. We note that 80% of the Figure 8 shows one of the first results of the reporting phase
Algerian population is covered by insurance and therefore which consists of a comparison between the daily average of
they have social security registration number. patients requiring hospitalization and the number of available
After the partitioning and bucketing operation over data hospitalization places (empty beds) in the university hospital
performed as explained in subsection 5.4.B. The dimension and the five public hospital institutions of different cities.
tables are created and loaded by using Hive, and then fact
tables are loaded by joining necessary dimension tables.
Table 4 shows the estimated size of tables stored during the
year of 2015. The nine first table rows show the minimum and
maximum (according to the storing DataNodes) size of di-
mension tables (Illness, Patient, Doctor, Region, Date,
Health-Establishment, Equipment, Service, and
Consultation-Center).The two last ones show the minimum
and maximum size of fact tables (Hospitalization and
Consultation).

Data replication strategy

Our proposed Hadoop-based warehouse must handle the Fig. 8 Graphic representation of the patients requiring hospitalization and
hospitals empty beds
case of node failures, which can be frequent in such
59 Page 14 of 16 J Med Syst (2018) 42: 59

Fig. 9 Graphic representation of

rates of the required medical visits
and outpatient centers capacity

From the report of Fig. 8, we notice the lack of sufficient Indeed, to ensure availability and equitable distribution of
empty beds in the hospitals of both cities: AMIZOUR and health resources to peoples, most countries use the WHO
AOKAS for patients requiring hospitalization (usually pa- guideline expressed by a ration of the resource by a number
tients’ hospitalization is postponed if possible or transferred of populations. For instance, the WHO recommended critical
to other bordering hospitals). From the previous result, man- threshold for personal health ratio (doctors and nurses) pro-
agers can generate analytical and informative reports to en- viding patient care per 1000 population is 2.5. Although this
hance the health sector in Bejaia yield and make right deci- option is very important, it is not sufficient for ensuring a
sions with the aim of ensuring optimal distribution of health greater equity in the distribution of medical resources, since
care system component. For instance, to increase the capacity it does not take into account the specificities of each region
of hospitals in both cities: AMIZOUR and AOKAS, by giving and each population as is well expressed in Fig. 10.
them a high priority in the regional health master plan. In our previous work [11–13, 16], we proposed a data
The report presented in Fig. 9 illustrates a comparison be- warehousing based framework to address the problem of medi-
tween the daily average of outpatient visits and capacity of the cal resources allocation. However, the framework fails in
health outpatient centers per city. scaling-up and does not consider unstructured medical data.
Figure 9 shows that in both cities: BEJAIA and TAZMALT, Through this work, we have demonstrated that using a
the rate of outpatients’ visits exceeds the outpatient centers Hadoop-based architecture combined with our nested
capacity. partitioning technique solve the scaling, heterogeneity, and data
This situation is an inequitable distribution of the health size issues. We proposed a scalable, cost-effective, high avail-
resources. To address this shortfall, the decision-makers to ability, and fault tolerance solution, through a scalable architec-
take the necessary action by increasing capacity of the outpa- ture in such a way, to allow extending the nodes of the cluster as
tient center to meet the need of both regions. Therefore, in this per requirement. Cost effective since nodes are not necessarily
situation, the decision to take is to add new consultation rooms high-performance computers so there is no need to invest much
in both cities: BEJAIA and TAZMALT. on the hardware. The availability and fault tolerance are guaran-
teed through a replication strategy.

Discussion and comparisons to previous work

Several studies have suggested to use medical data to ensuring

equity and equality in healthcare. For instance, Kuo et al. [1]
argue that medical data can be used to identify healthcare
trends, to prevent diseases, to combat social and health in-
equality, to unlock new sources of economic value, to provide
fresh insights into science and hold governments accountable.
However, few studies have actually attempted to address the
issue of an equitable and equal healthcare resources distribu-
tion using data warehousing and big data technologies. This
could be explained by the slow adoption of data warehousing
technology in the clinical field and that most of the studies on
clinical data warehousing are referred towards specific dis-
eases as detailed in section 2.
Fig. 10 Equality and equity of medical resources distribution
J Med Syst (2018) 42: 59 Page 15 of 16 59

Conclusion Conference on Data Warehousing and Knowledge Discovery (pp.

185–194). Springer Berlin Heidelberg, 2006. https://doi.org/10.
1007/11823728_18
The recent work and projects on Hadoop-based medical data 9. Kerkri, E.M., Quantin, C., Allaert, F.A., Cottin, Y., Charve, P.,
warehousing described in this study show that Hadoop commu- Jouanot, F., and Yétongnon, K., An approach for integrating het-
nity in the medical field is growing. This is essentially because of erogeneous information sources in a medical data warehouse.
J. Med. Syst. 25(3):167–176, 2001. https://doi.org/10.1023/A:
the cost-effectiveness of Hadoop-based solutions, which also ad-
1010728915998.
dressed the traditional medical DW issues. In this paper, we have 10. Pavalam, S.M., Jawahar, M., and Akorli, F.K., Data warehouse
developed a Hadoop-based architecture and a conceptual data based Architecture for Electronic Health Records for Rwanda. In
model for medical big data warehouse based on current research Education and Management Technology (ICEMT) International
Conference on (pp. 253–255). IEEE, 2010. https://doi.org/10.
on big data modeling and tools. We have shown that the problem
1109/ICEMT.2010.5657660
of primary and foreign keys in Apache Hive can be resolved 11. Sebaa, A., Nouicer, A., Tari, A., Ramtani, T., and Ouhab, A.,
using nested partitioning. The proposed solution was applied to Decision support system for health care resources allocation.
the presented case study by designing and implementing a DW Electron. Physician. 9(6):4661–4668, 2017. https://doi.org/10.
19082/4661.
platform to ensure equitable distribution of health resources.
12. Sebaa, A., Nouicer, A., Tari, A., Ramtani, T., and Ouhab, A.,
Decision support system for Health Care Resources allocation.
Acknowledgements This work was partially supported by the Ministry Abstracts Book of ICHSMT’16- International Conference on
of Higher Education and Scientific Research of Algeria and the Health Sciences and Medical Technologies; 2016 Sep 27-29;
University of Bejaia, under the project CNEPRU (Ref. Tlemcen, Algeria. Mehr publishing. p. 8, 2016. ISBN: 978-600-
B*00620140066/2015-2018). 96661-0-2.
13. Sebaa, A., Tari, A., Ramtani, T., and Ouhab, A., DW RHSB: A
Compliance with Ethical Standards framework for optimal allocation of health resources. Int. J.
Comput. Sci. Commun Inf. Technol. 2(1):12–17, 2015.
Conflict of Interest Authors declare that they have no conflict of 14. Wang, L., and Alexander, C.A., Big data in medical applications
interest. and health care. Am. Med. J. 6(1):1, 2015. https://doi.org/10.3844/
amjsp.2015.1.8.
15. Cuzzocrea, A., Song, I.Y., and Davis, K.C., Analytics over large-scale
Ethical Approval This article does not contain any studies with human
multidimensional data: the big data revolution. In Proceedings of the
participants or animals performed by any of the authors.
ACM 14th international workshop on Data Warehousing and OLAP.
pp. 101–104. ACM, 2011. https://doi.org/10.1145/2064676.2064695
16. Sebaa, A., Nouicer, N., Chikh, F., and Tari, A., Big Data
References Technologies to Improve Medical Data Warehousing. In
Proceedings of 2nd international conference on Big Data, Cloud
and Applications. ACM, 2017. https://doi.org/10.1145/3090354.
1. Kuo, M.H., Sahama, T., Kushniruk, A.W., Borycki, E.M., and 3090376
Grunwell, D.K., Health big data analytics: Current perspectives, 17. Yao, Q., Tian, Y., Li, P.F., Tian, L.L., Qian, Y.M., and Li, J.S.,
challenges and potential solutions. Int. J. Big Data Intell. 1(1–2): Design and development of a medical big data processing system
114–126, 2014. https://doi.org/10.1504/IJBDI.2014.063835. based on Hadoop. J. Med. Syst. 39(3):23, 2015. https://doi.org/10.
2. Cuzzocrea, A., Warehousing and Protecting Big Data: State-Of- 1007/s10916-015-0220-8.
The-Art-Analysis, Methodologies, Future Challenges. In 18. Istephan, S., and Siadat, M.R., Unstructured medical image query
Proceedings of the International Conference on Internet of things using big data–an epilepsy case study. J. Biomed. Inform. 59:218–
and Cloud Computing (p. 14). ACM, 2016. https://doi.org/10.1145/ 226, 2016. https://doi.org/10.1016/j.jbi.2015.12.005.
2896387.2900335 19. Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., and Saltz, J.,
3. White, T., Hadoop: The definitive guide (third edition). O’Reilly, Hadoop GIS: a high performance spatial data warehousing system
2012. ISBN: 978-1-449-322252-0. over Map-Reduce. VLDB Endowment. 6(11):1009–1020, 2013.
4. Sumathi, S., and Esakkirajan, S., Fundamentals of relational data- https://doi.org/10.14778/2536222.2536227.
base management systems (Vol. 47). Springer, 2007. ISBN: 978 3 20. Saravanakumar, N.M., Eswari, T., Sampath, P., and Lavanya, S.,
540 48397 7. Predictive methodology for diabetic data analysis in big data. In 2nd
5. Ewen, E.F., Medsker, C.E., and Dusterhoft, L.E., Data warehousing ISBCC. Procedia Computer Science. 50:203–208, 2015. https://
in an integrated health system: building the business case. In doi.org/10.1016/j.procs.2015.04.069.
Proceedings of the 1st ACM international workshop on Data 21. Rodger, J.A., Discovery of medical big data analytics: Improving
warehousing and OLAP (pp. 47–53). ACM, 1998. https://doi.org/ the prediction of traumatic brain injury survival rates by data mining
10.1145/294260.294271 patient informatics processing software hybrid Hadoop hive.
6. Pedersen, T.B., and Jensen, C.S., Research issues in clinical data Informatics in Medicine Unlocked. 1:17–26, 2015. https://doi.org/
warehousing. In Scientific and Statistical Database Management. 10.1016/j.imu.2016.01.002.
Proceedings. Tenth international conference on (pp. 43–52). 22. Sundvall, E., Wei-Kleiner, F., Freire, S.M., and Lambrix, P.,
IEEE, 1998. https://doi.org/10.1109/SSDM.1998.688110 Querying archetype-based electronic health records using Hadoop
7. Guérin, E., Moussouni, F., Courselaud, B., and Loréal, O., UML and Dewey encoding of openEHR models. Stud. Health Technol.
modeling of Gedaw: A gene expression data warehouse specialised Inform. 235:406, 2017. https://doi.org/10.3233/978-1-61499-753-
in the liver. In The 3rd French bioinformatics conference proceed- 5-406.
ing: JOBIM 2002 (pp. 319–334), Saint-Malo, France, 2002. 23. Raja, P.V., and Sivasankar, E., Modern Framework for Distributed
8. Banek, M., Tjoa, A.M., and Stolba, N., Integrating different grain Healthcare Data Analytics Based on Hadoop. In Information and
levels in a medical data warehouse federation. In International Communication Technology-EurAsia Conference (pp. 348–355).
59 Page 16 of 16 J Med Syst (2018) 42: 59

Springer Berlin Heidelberg, 2014. https://doi.org/10.1007/978-3- 30. Apache Hive: https://hive.apache.org/, Viewed in 02/2015.
642-55032-4_34 31. Liu, X., Thomsen, C., and Pedersen, T.B., ETLMR: a highly scal-
24. Yang, C.T., Liu, J.C., Chen, S.T., and Lu, H.W., Implementation of able dimensional ETL framework based on mapreduce. In
a big data accessing and processing platform for medical records in Transactions on Large-Scale Data-and Knowledge-Centered
cloud. J. Med. Syst. 41(10):149, 2017. https://doi.org/10.1007/ Systems VIII (pp. 1–31). Springer Berlin Heidelberg, 2013.
s10916-017-0777-5. https://doi.org/10.1007/978-3-642-37574-3_1
25. Sebaa, A., Chick, F., Nouicer, A., and Tari, A., Research in big data 32. Gao, S., Li, L., Li, W., Janowicz, K., and Zhang, Y., Constructing
warehousing using Hadoop. J. Inform. Syst. Eng. Manag. 2(2), gazetteers from volunteered big geo-data based on Hadoop.
2017. https://doi.org/10.20897/jisem.201710. Comput. Environ. Urban. Syst. 61:172–186, 2017. https://doi.org/
26. Dean, J., and Ghemawat, S., MapReduce: A flexible data process- 10.1016/j.compenvurbsys.2014.02.004.
ing tool. CACM. 53(1):72–77, 2010. https://doi.org/10.1145/ 33. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S.,
1629175.1629198. et al., Hive: A warehousing solution over a map-reduce framework.
27. Wu, S., Li, F., Mehrotra, S., and Ooi, B.C., Query optimization for Proc. VLDB Endowment. 2(2):1626–1629, 2009. https://doi.org/
massively parallel data processing. In Proceedings of the 2nd ACM 10.14778/1687553.1687609.
Symposium on Cloud Computing (p. 12). ACM, 2011. https://doi. 34. Ross, J., The use of economic evaluation in health care: Australian
org/10.1145/2038916.2038928 decision makers' perceptions. Health Policy. 31(2):103–110, 1995.
28. Apache Hadoop: http://hadoop.apache.org/, Viewed in 02/2015. https://doi.org/10.1016/0168-8510(94)00671-7.
29. Taylor, R.C., An overview of the Hadoop/MapReduce/HBase
35. ANDI: National Agency for Investment Development of Algeria,
framework and its current applications in bioinformatics. BMC
http://www.andi.dz/index.php/en/secteur-de-sante, Viewed in 02/
bioinform. 11(12):S1, 2010. https://doi.org/10.1186/1471-2105-
2015.
11-S12-S1.

Big Data in Healthcare
No ratings yet
Big Data in Healthcare
14 pages
Big Data Analytics in Healthcare
100% (3)
Big Data Analytics in Healthcare
193 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Development of National Health Data Warehouse For Data Mining
No ratings yet
Development of National Health Data Warehouse For Data Mining
11 pages
Healt Care
No ratings yet
Healt Care
22 pages
Mini Project Doc 2
No ratings yet
Mini Project Doc 2
25 pages
Bsa Assignment
No ratings yet
Bsa Assignment
13 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Creating A Health Data Management Platform Using Hadoop
No ratings yet
Creating A Health Data Management Platform Using Hadoop
4 pages
A Data Warehouse Architecture For Clinical Data Warehousing: Tony R. Sahama and Peter R. Croll
No ratings yet
A Data Warehouse Architecture For Clinical Data Warehousing: Tony R. Sahama and Peter R. Croll
6 pages
(25439251 - Data and Information Management) Big Data in Health Care - Applications and Challenges
No ratings yet
(25439251 - Data and Information Management) Big Data in Health Care - Applications and Challenges
29 pages
Computational Health Informatics in The Big Data Age - A Survey
No ratings yet
Computational Health Informatics in The Big Data Age - A Survey
36 pages
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Verne, Jules - Data Warehousing For Biomedical Informatics-Dalmatian Press (2016)
No ratings yet
Verne, Jules - Data Warehousing For Biomedical Informatics-Dalmatian Press (2016)
656 pages
Big Data Analytics in Healthcare
No ratings yet
Big Data Analytics in Healthcare
16 pages
BioMed Research International - 2015 - Belle - Big Data Analytics in Healthcare
No ratings yet
BioMed Research International - 2015 - Belle - Big Data Analytics in Healthcare
16 pages
Develop The Hybrid Adadelta Stochastic Gradient Classifier Wit - 2023 - Measurem
No ratings yet
Develop The Hybrid Adadelta Stochastic Gradient Classifier Wit - 2023 - Measurem
7 pages
Self-Medical Analysis Using Internet-Based Computing Upon Big Data
No ratings yet
Self-Medical Analysis Using Internet-Based Computing Upon Big Data
6 pages
A Project Report On Web Based Data Management
No ratings yet
A Project Report On Web Based Data Management
16 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
10 1109ICoAC44903 2018 8939061
No ratings yet
10 1109ICoAC44903 2018 8939061
9 pages
De-Identified Personal Health Care System Using Hadoop
No ratings yet
De-Identified Personal Health Care System Using Hadoop
8 pages
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
No ratings yet
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
10 pages
Apache Hudi for Scalable Data Lakes: The Complete Guide for Developers and Engineers
From Everand
Apache Hudi for Scalable Data Lakes: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Storage Method For Medical and Health Big Data Based On Distributed Sensor Network
No ratings yet
Storage Method For Medical and Health Big Data Based On Distributed Sensor Network
10 pages
The Power of Big Data Mining To Improve The Health Care System in The United Arab Emirates
No ratings yet
The Power of Big Data Mining To Improve The Health Care System in The United Arab Emirates
33 pages
Big Data in Healthcare
No ratings yet
Big Data in Healthcare
16 pages
"Big Data" and The Electronic Health Record
No ratings yet
"Big Data" and The Electronic Health Record
8 pages
Ojha 2016
No ratings yet
Ojha 2016
7 pages
Hadoop-Hive Report
No ratings yet
Hadoop-Hive Report
17 pages
Development of National Health Data Warehouse For Data Mining
No ratings yet
Development of National Health Data Warehouse For Data Mining
12 pages
Jhe2021 6635463
No ratings yet
Jhe2021 6635463
14 pages
A Review of The Role and Challenges of Big Data in Healthcare Informatics 2022
No ratings yet
A Review of The Role and Challenges of Big Data in Healthcare Informatics 2022
10 pages
Big Data in Health Care Management
No ratings yet
Big Data in Health Care Management
2 pages
HAP 780 15 Big Data
No ratings yet
HAP 780 15 Big Data
19 pages
BDAHC
No ratings yet
BDAHC
4 pages
Case Study DS-BDA
No ratings yet
Case Study DS-BDA
29 pages
Sample
No ratings yet
Sample
11 pages
Applied Hudi Systems: Definitive Reference for Developers and Engineers
From Everand
Applied Hudi Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Role of Big Data Analytics in Hospital Management System
No ratings yet
The Role of Big Data Analytics in Hospital Management System
6 pages
A Novel Framework For Bringing Smart Big Data To Proactive Decision Making in Healthcare
No ratings yet
A Novel Framework For Bringing Smart Big Data To Proactive Decision Making in Healthcare
13 pages
Health Care Chapter - Big Data
No ratings yet
Health Care Chapter - Big Data
39 pages
(Solved) Case Study - GlobalHealth Innovations LTD, A Leading Healthcare... - Course Hero
No ratings yet
(Solved) Case Study - GlobalHealth Innovations LTD, A Leading Healthcare... - Course Hero
6 pages
FANTILLO GLENNCHARLIANE ClinDataRep
No ratings yet
FANTILLO GLENNCHARLIANE ClinDataRep
4 pages
Full Doc Janani
No ratings yet
Full Doc Janani
121 pages
Public Health Precautionary - Survey and Challenges Venkatesh V Nitin Bhushan K N M S Ramaiah University of Applied Sciences, Bengaluru, India.
No ratings yet
Public Health Precautionary - Survey and Challenges Venkatesh V Nitin Bhushan K N M S Ramaiah University of Applied Sciences, Bengaluru, India.
4 pages
A Review of The Literature On Big Data Analytics in Healthcare PDF
No ratings yet
A Review of The Literature On Big Data Analytics in Healthcare PDF
20 pages
Case Study On Processing Data Driven For Health
No ratings yet
Case Study On Processing Data Driven For Health
9 pages
Health 1
No ratings yet
Health 1
11 pages
Brief Introduction of Medical Database and Data Mi
No ratings yet
Brief Introduction of Medical Database and Data Mi
13 pages
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Shared Patient Records Networks in Open Source
No ratings yet
Shared Patient Records Networks in Open Source
6 pages
Preparing Next-Generation Scientists For Biomedical Big Data: Artificial Intelligence Approaches
No ratings yet
Preparing Next-Generation Scientists For Biomedical Big Data: Artificial Intelligence Approaches
11 pages
The Role of Data Science in Healthcare Advancement
No ratings yet
The Role of Data Science in Healthcare Advancement
11 pages
Machine-Learning Approach For Predicting Harmful Diseases Using Big Data and IoT A Review
No ratings yet
Machine-Learning Approach For Predicting Harmful Diseases Using Big Data and IoT A Review
6 pages
U3895 PDF
No ratings yet
U3895 PDF
13 pages
Healthcare Analytics On Patient Data Using Big Data Technologies For Disease Prediction and Readmission Analysis
No ratings yet
Healthcare Analytics On Patient Data Using Big Data Technologies For Disease Prediction and Readmission Analysis
6 pages
Big Data Application in Biomedical Research and
No ratings yet
Big Data Application in Biomedical Research and
10 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
A Survey On Machine Learning Assisted Big Data Analysis For Health Care Domain
No ratings yet
A Survey On Machine Learning Assisted Big Data Analysis For Health Care Domain
5 pages
AWS Certified DevOps Engineer Professional... Tests 2021
100% (3)
AWS Certified DevOps Engineer Professional... Tests 2021
210 pages
RTFM - Red Team Field Manual
No ratings yet
RTFM - Red Team Field Manual
111 pages
MP122 - Fcast Maint Plan and Task List Data
No ratings yet
MP122 - Fcast Maint Plan and Task List Data
4 pages
Drill Slides
No ratings yet
Drill Slides
14 pages
UML Lab Week-1
No ratings yet
UML Lab Week-1
24 pages
Ansible Tutorial
No ratings yet
Ansible Tutorial
33 pages
Software Testing Methodologies: A Information Review: Prof. Sapana Desai
No ratings yet
Software Testing Methodologies: A Information Review: Prof. Sapana Desai
5 pages
Internship Project Report Final Umesh
No ratings yet
Internship Project Report Final Umesh
36 pages
Management 22509
No ratings yet
Management 22509
14 pages
Free Website Hosting Services
No ratings yet
Free Website Hosting Services
15 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Splunk Queiries
No ratings yet
Splunk Queiries
1 page
HCCB Case Study
No ratings yet
HCCB Case Study
4 pages
Reset The Password On A Dell EqualLogic SAN
100% (1)
Reset The Password On A Dell EqualLogic SAN
2 pages
HR Analytics Dashboard
No ratings yet
HR Analytics Dashboard
1 page
Database Development Life Cycle
No ratings yet
Database Development Life Cycle
13 pages
"Introduction To Programming With Java": Lecture - 6
No ratings yet
"Introduction To Programming With Java": Lecture - 6
17 pages
Tarafic Based Load Balncing in SDN
No ratings yet
Tarafic Based Load Balncing in SDN
6 pages
Faa Do 178b PDF
No ratings yet
Faa Do 178b PDF
2 pages
Ashritha Resume
No ratings yet
Ashritha Resume
4 pages
Procedure To ROLLBACK FORCE Pending In-Doubt Transaction
No ratings yet
Procedure To ROLLBACK FORCE Pending In-Doubt Transaction
2 pages
Drug Management System (Synopsis)
No ratings yet
Drug Management System (Synopsis)
10 pages
NX Nastran 10 Instalation
No ratings yet
NX Nastran 10 Instalation
7 pages
CICS Complete Tutorial
100% (1)
CICS Complete Tutorial
24 pages
Syllabus - MyGurukulam
No ratings yet
Syllabus - MyGurukulam
2 pages
MIS Case Study - Honda Motors
100% (1)
MIS Case Study - Honda Motors
9 pages
MongoDB - Data Modelling
No ratings yet
MongoDB - Data Modelling
3 pages
All Dat
No ratings yet
All Dat
13 pages
Installing Oracle 11g in RHEL6
No ratings yet
Installing Oracle 11g in RHEL6
10 pages
Q1. Define Array With Syntax and Example
No ratings yet
Q1. Define Array With Syntax and Example
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution

Uploaded by

Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution

Uploaded by

Journal of Medical Systems (2018) 42: 59

TRANSACTIONAL PROCESSING SYSTEMS

Medical Big Data Warehouse: Architecture and System Design,

Introduction traditional data management systems can not handle unstruc-

Specific to run spatial queries in a data warehousing system.

No evaluation of the proposed solution has been performed.

Specific to design of hospital information intelligently.

There is no to what, and how to use their framework/

Specific to traumatic brain injury field

Suitable for medical data integration

Application field /Limits

Specific to diabetes field.

Infer knowledge from distributed, multivariate data of healthcare centers in different

Propose a predictive analysis algorithm over Hadoop environment to predict

Unlimited query capacity & efficiency of unstructured medical image data

Aside from the above issues and limitations of traditional data

Solve big data collection, storage & analysis problems

Knowledge discovery utilizing Apache Hadoop Hive

and analyzing big data of medical records.

were undertaken as depicted in Table 1. Yao and al. [17] built a

five-node Hadoop cluster to execute distributed Map-Reduce

Hadoop, Archetype, and Electronic

ier to use by making personalized recommendations. In order

Istephan and Siadat [18] proposed a framework which dynam-

ically integrates user-defined modules. Hadoop-GIS [19] is a

warehousing system of spatial data, built over Map-Reduce

for medical image processing. It supports various analytical

Raja and Sivasankar [23]

provided. According to the authors, this system helps to make

an effective cure and care the patients with enhancing out-

Yang et al. [24]

comes like affordability and availability. Rodger [21] used

Aji et al. [19]

Hadoop and Hive as a data warehouse infrastructure to im-

prove prediction of traumatic brain injury survival rates using

data classification into predefined classes. Indeed, to

Fig. 1 Hadoop-based system

(5) Analysis & reporting tools

(3) Hadoop ecosystem

Hadoop NameNode Secondary

DataNode 1 DataNode 2 DataNode 3 DataNode n

HIS Logs HIS MR HIS Logs MR HIS Logs MR

Design and data model adaptation

It is necessary to design an efficient data model allowing

Fig. 2 Nested partitioning

Fig. 3 Geographic location of Bejaia medical institutions

Table 2 Software’s specification

Software Version Argument

Required software Java 7 Required to run the Hadoop jobs

In this section, we detail the conceptual data model of our case

a) Constellation schema without nested partitioning

b) Constellation schema with nested partitioning

Fig. 4 Constellation schema

Fig. 7 BId-patient^ code signification

BIllness^ dimension table key The proposed BId-illness^ code

Table 5 Data replication strategy

BPatient^ dimension table key The proposed BId-patient^ Preliminary results

Data replication strategy

Fig. 9 Graphic representation of

Discussion and comparisons to previous work

Several studies have suggested to use medical data to ensuring

Conclusion Conference on Data Warehousing and Knowledge Discovery (pp.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.