Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution
Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution
https://doi.org/10.1007/s10916-018-0894-9
Received: 23 September 2016 / Accepted: 8 January 2018 / Published online: 19 February 2018
# Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
The huge increases in medical devices and clinical applications which generate enormous data have raised a big issue in
managing, processing, and mining this massive amount of data. Indeed, traditional data warehousing frameworks can not be
effective when managing the volume, variety, and velocity of current medical applications. As a result, several data warehouses
face many issues over medical data and many challenges need to be addressed. New solutions have emerged and Hadoop is one
of the best examples, it can be used to process these streams of medical data. However, without an efficient system design and
architecture, these performances will not be significant and valuable for medical managers. In this paper, we provide a short
review of the literature about research issues of traditional data warehouses and we present some important Hadoop-based data
warehouses. In addition, a Hadoop-based architecture and a conceptual data model for designing medical Big Data warehouse are
given. In our case study, we provide implementation detail of big data warehouse based on the proposed architecture and data
model in the Apache Hadoop platform to ensure an optimal allocation of health resources.
Keywords Data warehouse . Hadoop . Big data . Decision support . Medical resources allocation
[3]. Indeed, a major advantage of Hadoop framework apart its studies and projects have been conducted on medical DWs.
scalability and strength, is its implementation low cost, by Among the first ones, Ewen and al. [5] highlighted the need
utilizing only multiple existing ordinary computer servers in- for a DW in the health sector. Authors in [6] shown the main
stead of one high-performance cluster. Hadoop pushes pro- differences between conventional business oriented DWs and
cessing to the data instead of data to the processing. clinical DWs, they also identified key research issues in clin-
Furthermore, it supports ETL (Extract, transform, and Load) ical data warehousing. GEDAW [7] is a bioinformatics project
processes in parallel. However, several issues arise when in- consisted of building a DW on the hepatic transcriptome. This
tegrating and warehousing of medical big data such as the lack project aimed to bring together within a single knowledge
of a standard and powerful architecture for integration, which base, complex, varied, and many data of liver genes for anal-
makes implementing big data warehouse a big issue. ysis. Its objective was to provide researchers with a compre-
Moreover, former design methodologies for data models can hensive tool for integrating transcriptome and provide deci-
not meet all needs of the new data analysis applications over sion support to guide biological research. DWs of different
big data warehousing given the new constraints such as the health insurance organizations in Austria were merged in an
nature and complexity of medical big data. evidence-based medicine collaboration project [8] called
The purpose of this paper is to address the arising problems HEWAF (Healthcare Warehouse Federation). Kerkri and al.
when using OLAP and data warehousing over big data espe- [9] presented EPIDWARE architecture to build DWs for epi-
cially medical big data. The contributions of this paper can be demiological and medical studies. It improved the medical
summarized as follows: care of the patients in care units. Pavalam and al. [10] pro-
posed a DW based architecture for HER (Electronic Health
& Firstly, we give an overview of some previous traditional Record) to the Rwandan health sector. It can be accessed from
medical DWs (Data Warehouses), their limitations, and different applications and allow data analysis for a swift
some recent Hadoop-based DWs. decision-making process and also in alerting the epidemic. A
& Secondly, we propose a system architecture and a conceptual data warehouse which gives information on epidemiology and
data model for a MBDW (Medical Big Data Warehouse). public health, about various diseases, their concentration, and
& Thirdly, we offer a solution to overcome both the growing resources repartition in Bejaia department is proposed in
of fact table size and the lack of primary and foreign keys [11–13]. The aim of this DW is to improve medical resources
in the framework Apache Hive required in the conceptual allocation of Bejaia region.
data model. This solution is based on nested partitioning
according to the dimension tables keys. Issues and limitations
& Finally, we apply our solution to implement a MBDW to
improve medical resources distribution for health sector in The dramatic increases in devices and applications generating
Bejaia region (in Algeria). enormous medical data have raised big issues in managing,
processing, and mining these massive amounts of data. New
The remainder of this paper is organized as follows: A brief solutions are needed since traditional frameworks can no lon-
review related to the medical DWs traditional and modern ger be used to handle the volume, variety, and velocity of
ones is given in Section 2. Section 3 details some concepts current medical applications. As a result, the previously pre-
and tools which are called to be used in the rest of this work. A sented DWs and several others face many common issues and
system architecture of the MBDW and a conceptual data mod- limitations over medical big data. Among these issues, we
el which exploits partitioning and bucketing are presented in highlight the most important ones. Firstly, impact of the huge
sections 4 and 5. The latest section discusses our implemen- volume and fast growth of medical data. Indeed, in the med-
tation and experimental results of our case study. ical field, a large amount of information about patients’ med-
ical histories, symptomatology, diagnoses and responses to
treatments and therapies are generated, which necessitates
Related works their collection. Secondly, the unstructured nature of medical
data is another important issue. In fact, unstructured docu-
In this section, we introduce some medical DWs, then we give ments, images, and clinical notes represent over 80% of cur-
their limitations and their common drawbacks. rent medical data [14], such unstructured data should be con-
verted into analysis-ready datasets. Thirdly, the complexity of
Traditional medical data warehouses data modeling, it is not easy to deal with computing OLAP
data cubes over big data, mainly due to the explosive size of
Despite the widespread interest in data integration and data sets and the complexity of multidimensional data models
historization in the medical field [4], this field has been slow [2]. For instance, if we have 10.000 diagnoses, this would
to adopt data warehousing. However, once adopted, several amount to 210.000 dimensions, therefore, this solution is
J Med Syst (2018) 42: 59 Page 3 of 16 59
Propose a cloud storage platform with HBase of Hadoop for integrating, storing
Implementing and testing of an archetype-aware Dewey encoding optimization.
spatial datasets. Also, Map-Reduce is improved by spatial data partitioning
and R*-tree indexes and Hive is extended with spatial query support.
Provide a scalable solution for analytical spatial queries over large scale
Hadoop-based data warehousing
geographic locations
follow such new platforms. Thus, a small number of studies
the type of diabetes
Rodger [21]
determine how the set of collected variables relates to the body characteristics. Therefore new platforms and technologies
injuries, he collected data on three ship variables (Byrd, were developed.
Boxer, and Kearsage) and injuries to different body regions
such as head, torso, extremities, and abrasions. He proposed a Big data technologies
hybrid approach on multiple ship databases where various
scalable machine-learning algorithms have been employed. To manage the volume, velocity, variety, and variability of big
Raja and Sivasankar [23] proposed a framework based on data, several technologies, tools, and framework were devel-
Hadoop to modernize the healthcare informatics systems to oped. The most important ones are The Map-reduce based
inferring knowledge from various healthcare centers in differ- system, for instance, BigTable, HadoopDB, Hadoop ecosys-
ent geographic locations. tem and its most important components such as HBase, Pig,
Yang et al. [24] build a cloud-based storage and distributed and Hive. The NoSQL databases, for instance: MongoDB,
processing platform to store and handle medical records data Cassandra, and VoldemortDB. The In-memory Frameworks,
using HBase of the Hadoop ecosystem providing diverse func- for instance: Apache Spark. The Graph databases, for in-
tions such as keyword search, data filtering, and basic statistics. stance: Neo4J, AllegroGraph, In_niteGraph,
Based on the size of data, authors used the Put with the single- HyperGraphDB, InfoGrid and Google Pregel. RDBMS-
threaded method or Complete-Bulkload mechanism to improve based system, for example, Greenplum, Aster, and Vertica.
the performance of data import into a database. The stream data processing frameworks, for instance:
The different listed works [17–24] used high-performance Apache S4, Apache Kafka, Apache Flume, Twitter Storm
computers and very powerful IT infrastructures that required a and Spark Streaming.
very substantial investment to use the great processing capabil-
ities of Hadoop technology. In this work, our goal is to re-use Map-reduce paradigm
their own IT system infrastructure and workstations of the health
institution without having to make a great investment in its The Map-Reduce paradigm [26] is a programming model
material capabilities and also to adopt data warehousing and allowing parallel computation of tasks using two phases prim-
OLAP methodologies. However, as indicated in [25], data itive: map and reduce phases defined by users. In the first
warehousing using Hadoop and hive is faced with many chal- BMap^ phase, a mapper is defined by the user to load data
lenges. Firstly, Hadoop does not support OLAP operations and chunk from DFS and transforms it into a list of intermediate
the multi-dimensional model can not meet all needs of the new key/value (ki, vi) pairs. Then, the key-value pairs are buffered
data analysis applications. Secondly, the limitation of Apache as (r) of files, and all key-value files are sorted by keys. The
Hive query language called HiveQL which does not support second BReduce^ phase, a reducer is defined by the user to
many standard SQL queries. Overcoming such limitations and combine files from different mappers together. The final re-
challenges in the medical field is the objective of this work. sults are written back to DFS [27].
Hadoop ecosystem
Background review
Hadoop ecosystem consists of several projects and libraries
This section details some concepts and tools called to be used such as massive scale database management solution called
in this study. HBase, data warehousing solution called Hive, the machine
learning library called Mahout, the job tracking and schedul-
Big data concept ing suite called ZooKeeper and Pig and other related libraries
and packages for massively parallel and distributed data
The ‘big data’ as a concept is often described by the three V management.
then, recently by the six V, which mean Volume, Velocity,
Variety, and Veracity, Variability, Value of data. Indeed, the Hadoop
high-volume of data generation refers to the volume (1st V).
The rate of data generation and the speed at which it should be Apache Hadoop [28] is an open source software allowing
processed refers to the velocity (2nd V). The heterogeneity and large-scale parallel distributed data analysis. It replicates and
the diversity of the data types refer to the variety (3rd V). The collocates automatically data across multiple different nodes,
degree of reliability and quality of sources of data refers to the then allows parallel processing across clusters. It is an open-
Veracity (4th V). The disparity and variation of the data flow source implementation of Google’s Map-Reduce computing
rates refer to the variability (5th V). Finally, value (6th V) of model. It is based on HDFS (Hadoop Distributed File system)
data depends on the volume of analyzed data. Traditional data which provides high-throughput access to application data.
management systems can not handle data with such Hadoop provides the robust, fault-tolerant HDFS, and
J Med Syst (2018) 42: 59 Page 5 of 16 59
reliability. Doug Cutting implemented the first version of capture data. During this phase, all tables’ sources are ex-
Hadoop in 2004 and becomes an Apache Software traction from appropriate data sources then converted into
Foundation project in 2008 [29]. On November 17, 2017, columns structure (CSV format). The transformation phase
the release 2.9.0 of Apache Hadoop is available [28]. involves data cleansing to comply with the target schema
based on the description of the multi-dimensional data
Hive model (partitioned schema) which is stored as meta-data
in an HDFS file). Finally loading phase, the data is propa-
Apache Hive [30] is initially a subproject developed by the gated into the regional storage system (database server). As
Facebook team. Hive is used in conjunction with the map- we will see in the next section, the data model is
reduce model of Hadoop to structure the data, to run queries, partitioned according to the dimension tables (in our case
allowing the creation of a variety of reports and summaries, study keys of ‘Region’ dimension table, then ‘Health-es-
and to perform historical analysis over this data. Over Hive, tablishment’ dimension table and then ‘Date’ dimension
tables are used to store data, such structure consists of a num- table). The integrated and stored data of database server
ber of rows and each row comprises a specified number of are then replicated in DataNodes of Hadoop cluster accord-
columns. In Hive, queries are expressed in an SQL-like de- ing to a replication strategy.
clarative language called HiveQL, these queries are compiled New database servers and automatically new data-node on
into map-reduce jobs and executed using Hadoop. It supports Hadoop cluster will be created for new sources (for instance
primitive and complex type. It processes data for analysis but new hospitals). Such feature ensures the scalability of this
not to serve users, so it does not need ACID guarantees (as for architecture.
traditional relational database) for data storage.
(3) Hadoop ecosystem
System architecture and data flow The core of our architecture is the Apache Hadoop [28]
framework. It is an open-source implementation of Google’s
Implementing efficient medical data warehouse based on the Map-Reduce computing model. It is based on HDFS (Hadoop
Hadoop ecosystem requires a flexible and modular architec- Distributed File System) which provides high-throughput ac-
ture for big data warehousing. This section describes the pro- cess to application data.
posed architecture along with the functionalities of each layer.
The overall architecture is depicted in Fig. 1. It is a scalable, (3.1) Hadoop. Hadoop cluster is the corpus of server nodes
reliable, and distributed architecture to extract, store, analyze, with different physical locations within a group on
and visualize healthcare data extracted from various resources Hadoop [32]. It replicates and collocates automatically
HIS (Hospitals Information systems). In the following, we data across multiple nodes, then performs parallel pro-
give some details about components and levels of the pro- cessing across clusters using the Map-Reduce para-
posed architecture and data flow within these components. digm. Thus, Hadoop reduces infrastructure cost. Its
most important components:
(1) Medical data source
& NameNode act as a metadata repository describing the
In each region (provinces of a department, according to the location of data blocks for each file stored in HDFS.
geographical division), medical data is extracted from multi- This makes the massive data processing easier.
ple distributed computer of HIS (Hospitals Information & Secondary NameNode is in charge periodically to check
Systems), information systems, laboratories software, radiol- the persistent status of NameNode and to download its
ogy data and from the regional directorate of Health. current snapshot and log files.
& DataNode runs to every machine in Hadoop cluster sepa-
(2) Distributed ETL rately. In the proposed architecture, data servers in which
data integration is done (point 2 of section 3) are automat-
An ETL is the component responsible to collect data from ically replicated in one or more DataNode. Furthermore, it
different data sources, to transform, and to clean based on is responsible for storing the structured and unstructured
business rules and requirements defined by the final user [31]. data and to answer the data reading and writing queries.
Our ETL approach is based on the partitioning of the
input data horizontally according to the nested partitioning (3.2) Hive. Apache Hive [30] is a data warehousing struc-
technique (section 5.1). In order to run in a distributed and ture. It is used in conjunction with the map-reduce
parallel way, the extraction phase is achieved in each med- model of Hadoop to process the data. Indeed, Hive
ical establishment of each geographic division (region) to stores data in tables, which consists of a number of
59 Page 6 of 16 J Med Syst (2018) 42: 59
Web-based Access
Database Database
Database server
server server
rows, each row comprises a specified number of col- & Hive engine: it converts queries written in HiveQL into a
umns. HiveQL is used to express queries over Hive. map-reduce code and then executes on the Hadoop cluster.
The execution of user queries over Hive are compiled
into map-reduce jobs, then Hadoop (point 3.1 of sec- (4) Web-based Access
tion 4) execute these jobs. Using such queries allow
creating a variety of reports, summaries, and historical Our system allows trusted users (point 6 of section 4) ac-
analysis. The main functional components of Hive are: cess to data in order to create a variety of reports and data
summaries after authentication through the Web browser.
& Metastore: it contains metadata about tables stored in Hive
such as data formats, partitioning, and bucketing and storage (5) Analysis and reporting tools
information including the location of table’s data in the un-
derlying file system. It is specified during table creation and This includes several tools for reporting, planning,
reused every time the table is referenced in HiveQL. dashboards and data mining. These tools allow data to
J Med Syst (2018) 42: 59 Page 7 of 16 59
be analyzed statistically to track trends and patterns over the process which takes as input the diagram composed of data
time, and then produce regular reports tailored to the warehouse tables files (F, D1, D2,…, Dn) and outputs k diagrams:
needs of various users of the decision support system.
F 1 ; D1 ; D2 ; …; Dn ; ::; F k ; D1 ; D2 ; …; Dn
(6) MBDW users
The Table F is fragmented in k fragments F1,.., Fk and they
are computed in the following way:
They can be doctors, medical searchers, hospital managers,
health administrators, and governments. All such users can
interact with the system.
Partitioning In Hive, each table can have one or more parti- solution to a case of medical data. An MBDW is designed for
tions which determine the distribution of data within sub- the health sector of the Wilaya of Bejaia in Algeria to help the
directories of the table directory where each partition has its physicians and health managers to understand, predict and
own directory, according to the partitioning field and can have improve availability and distribution of healthcare resources.
its own columns and storage information [33]. Using parti- In this section, we first describe the study objectives and
tions can make queries faster. For instance, a table can be settings. Then we give the implementation detail of the pro-
partitioned by date, and records with the same date would be posed MBDW solution, we give some preliminary results and
stored in the same partition. finally a discussion about our solution compared to the previ-
ous ones.
Bucketing It is another technique to decompose hive table
(data-sets) in a defined number of parts, which will be stored
in files. Indeed, the data in each partition can be divided into Study objectives
buckets based on the hashing of a column in the table. Each
bucket is stored in the partition directory as a file. Metadata Decision-making about the alternative uses of healthcare re-
about bucketing information of each table are stored in the sources is an issue of critical concern for governments and ad-
Meta-store [33]. Among bucketing advantages, the number ministrators in all health care systems [34]. In the other hand,
of buckets is fixed so that it does not vary with data. data of medical sources can be used to identify healthcare trends,
As shown in Fig. 2, nested partitioning means to divide the prevent diseases, combat social inequality and so on [1]. Thus,
fact table into many levels of partitions based on the different using medical data to improve health resources distribution is an
dimensions. Indeed, fact table will be partitioned using Hive important challenge in emerging countries.
partition based on a first dimension (based on the attribute The purpose of this case study is to provide decision
which should be considered as the primary key of the first makers a clear view of Bejaia health sector. It will guide
dimension table). Then, the partitioned tables will be divided leaders to make better decisions leading for equity in the dis-
using buckets based on a second dimension, and so on. tribution of medical resources, and for a significant reduction
Using partitioned tables will allow to distribute and to re- of the cost to offer a more efficient method to accurate health
duce the data volume from a fact table to many distributed fact management, to improve the availability of clinical material
tables. Therefore, these partitioned fact tables optimize data and human resources and to increase the quality of services
warehouse resources use (distributed processing, memory) and patient safety. It will allow managing different orienta-
and improve queries execution performance. tions during the planning stages, to have a complete predictive
vision, using a better repartition of care offerings (the hospital
specialization, the number of health centers by town, alloca-
tion of hospital equipment, the number of specialist per hos-
Case study: Improving healthcare resources pital, the number of beds by specialty and service). These
distribution in Bejaia health sector facilities have to be regularly adapted according to the chang-
ing of demand (changing health techniques, diseases, health
In the two previous sections, we presented our proposal for structures, population age and geographical location of the
medical big data warehouse (MBDW) architecture and data population).
model built on Hadoop cluster to perform analytical process- Our decision support system gives information about: - the
ing. The purpose of this section is to apply and validate our place and date of health centers building and the required
specialty, − the best strategy for medical professionals’ recruit- administratively divided into 19 Daïras (a set of municipali-
ment for each hospital; − the required equipment and the most ties) and covers 51 municipalities which stretch over an area
appropriate for a hospital (e.g. CT scanner or radiological of 3268 KM2. As shown in Fig. 3, it includes the following
unit); − screening date and place (e.g., breast cancer screening, health structures: one university hospital, five public hospital
cervical screening, etc) and also, it aims to give information to institutions, and one specialized public hospital establish-
help medical research. ments, with a total of 1533 technical beds, 8 nearby public
establishments of health with 51 polyclinics, one paramedical
Study settings school, and several laboratories. It employs 33 hospital-
university practitioners, 245 specialist medical practitioners,
The plan of Bejaia Health Sector is a part of the Algerian 734 general medical practitioners, and 2742 paramedical staff.
Health master plan. The latter provides for the period 2009– To achieve analytical processing of medical data, we de-
2025 investments of 20 billion Euros for constructing of new veloped the MBDW by extracting, transferred, and collecting
health facilities and modernization of existing hospitals. Such data from different operating systems, and software of Bejaia
investments have been initiated for construction and mainte- health structures which includes: PATIENT 3.40 (patients data
nance of infrastructure and hospital equipment and education management software), Microsoft EXCEL, and EPIPHARM
of medical professionals [35]. The outline of this program (a software for the management of drug stock) and automati-
project is to achieve 172 new hospitals, 45 specialized health cally stored in the DW.
complex, 377 polyclinics, 1000 outpatient rooms, 17 para-
medical training schools, and more than 70 specialized insti- MBDW implementation
tutions for persons with disabilities.
Currently, the Bejaia Health Sector manages healthcare for The implementation of the Hadoop based architecture is built
a population of around 1 million people of the Wilaya under the following hardware resources of the computer
(department) of Bejaia. The Wilaya of Bejaia is located in equipments of the medical institutions: memory (RAM) ca-
the north of Algeria, on the Mediterranean coast. It is pacity ranging from 2 to 10 GB, processor speed ranging from
1.6 GHz to 2.4 GHz, disk space ranging from 250 GB to 2 TB & Fact table ‘Hospitalization’. It represents all information
for data storing. Computer nodes are connecting and network- about the patients’ hospitalization period. It holds the pri-
ing with RJ45 LAN cable and switches. mary key Id-hospitalization, foreign keys of dimension
Several software were used to build MBDW platform by tables: Id-patient, Id-illness, Id-doctor, Id-service, Id-
deploying Apache Hadoop 2.6.0 and Hive 1.2.1, ZooKeeper equipment, Id-H-establishment, Id-region and Id-date,
HDFS under Ubuntu Linux operation system as depicted in and several measures.
the Table 2. & Fact table ‘Consultation’. It exposes all information about
Table 3 shows some the used Hadoop cluster configura- the outpatient visit in a health facility. It contains the pri-
tions, such configuration parameters are essential for our sys- mary key Id-consultation, foreign keys of dimension ta-
tem operation and performances. bles: Id-patient, Id-illness, Id-doctor, Id-C-center, Id-H-es-
tablishment, Id-region and Id-Date and also several mea-
sures about the outpatient visit.
MBDW data model
Illness
Consultation Hospitalization
Id-illness
Id-consultation Id-Hospitalization
Family-illness
#Id-patient #Id-patient
Name-illness
#Id-doctor #Id-doctor
#Id-illness Patient #Id-illness
Consultation-Center Equipment
#Id-C-center #Id-equipment
Id-C-center Id-patient
#Id-service Id-equipment
Name-C-center #Id-H-estab Age
#Id-H-estab Name-equipment
#Id-H-estab #Id-Region Gender
#Id-Region #Id-service
#Id-Date ….
measures #Id-Date
Doctor measures
Id-doctor
Specialty
#Id-H-estab
Service
Region Id-service
Pole
Id-region
Name-service
Name-region
Service-capacity
Department
#Id-H-estab
City
Health-Establishment
Id-H-estab
Type
Name-H-estab
#Id-region
Region
H-Estab
Hospitalization
Date
Consultation
Illness Equipment
Patient
Consultation-Center Service
Doctor
Fig. 5 BId-illness^ code signification & Chronic disease (δ): This part of the field identifies and
indicates the chronic diseases. Indeed, patients with chron-
In the next sub-section, we give some important informa- ic conditions benefit broader rights from Algerian social
tion about keys of dimension tables. Such informations are security.
essential for the system users.
3) Future use field: is a reserved field for future needs that
Keys of dimension tables may occur in the future.
categories of general medical practitioners who are: gen- large clusters. To address this problem, we implemented
eral practitioners, general pharmacists, and general den- a data placement and replication strategies –depicted in
tists. For example, general practitioners include three (3) Table 5 to improve data reliability, availability, network
subcategories: the general practitioner, the primary gen- bandwidth utilization, and reduce the effect of certain
eral practitioner, the general practitioner-in-chief. They failures such as the single-node and whole-rack failures.
can also take senior positions. As shown in Table 5, each establishment data is stored
3) Public health specialist medical practitioners: they are in its own DataNode server and replicated in two others
specialized physicians. There are three (3) subcategories Data-Node servers. For instance, the data of CHU-BJA
of specialized medical practitioners which are: assistant establishment is stored in its own DataNode (DataNode
specialist, senior specialist, chief specialist). They can al- 1) and replicated in two others Data-Node servers
so take senior positions. (DataNode 9 and DataNode 15).
Our proposed Hadoop-based warehouse must handle the Fig. 8 Graphic representation of the patients requiring hospitalization and
hospitals empty beds
case of node failures, which can be frequent in such
59 Page 14 of 16 J Med Syst (2018) 42: 59
From the report of Fig. 8, we notice the lack of sufficient Indeed, to ensure availability and equitable distribution of
empty beds in the hospitals of both cities: AMIZOUR and health resources to peoples, most countries use the WHO
AOKAS for patients requiring hospitalization (usually pa- guideline expressed by a ration of the resource by a number
tients’ hospitalization is postponed if possible or transferred of populations. For instance, the WHO recommended critical
to other bordering hospitals). From the previous result, man- threshold for personal health ratio (doctors and nurses) pro-
agers can generate analytical and informative reports to en- viding patient care per 1000 population is 2.5. Although this
hance the health sector in Bejaia yield and make right deci- option is very important, it is not sufficient for ensuring a
sions with the aim of ensuring optimal distribution of health greater equity in the distribution of medical resources, since
care system component. For instance, to increase the capacity it does not take into account the specificities of each region
of hospitals in both cities: AMIZOUR and AOKAS, by giving and each population as is well expressed in Fig. 10.
them a high priority in the regional health master plan. In our previous work [11–13, 16], we proposed a data
The report presented in Fig. 9 illustrates a comparison be- warehousing based framework to address the problem of medi-
tween the daily average of outpatient visits and capacity of the cal resources allocation. However, the framework fails in
health outpatient centers per city. scaling-up and does not consider unstructured medical data.
Figure 9 shows that in both cities: BEJAIA and TAZMALT, Through this work, we have demonstrated that using a
the rate of outpatients’ visits exceeds the outpatient centers Hadoop-based architecture combined with our nested
capacity. partitioning technique solve the scaling, heterogeneity, and data
This situation is an inequitable distribution of the health size issues. We proposed a scalable, cost-effective, high avail-
resources. To address this shortfall, the decision-makers to ability, and fault tolerance solution, through a scalable architec-
take the necessary action by increasing capacity of the outpa- ture in such a way, to allow extending the nodes of the cluster as
tient center to meet the need of both regions. Therefore, in this per requirement. Cost effective since nodes are not necessarily
situation, the decision to take is to add new consultation rooms high-performance computers so there is no need to invest much
in both cities: BEJAIA and TAZMALT. on the hardware. The availability and fault tolerance are guaran-
teed through a replication strategy.
Springer Berlin Heidelberg, 2014. https://doi.org/10.1007/978-3- 30. Apache Hive: https://hive.apache.org/, Viewed in 02/2015.
642-55032-4_34 31. Liu, X., Thomsen, C., and Pedersen, T.B., ETLMR: a highly scal-
24. Yang, C.T., Liu, J.C., Chen, S.T., and Lu, H.W., Implementation of able dimensional ETL framework based on mapreduce. In
a big data accessing and processing platform for medical records in Transactions on Large-Scale Data-and Knowledge-Centered
cloud. J. Med. Syst. 41(10):149, 2017. https://doi.org/10.1007/ Systems VIII (pp. 1–31). Springer Berlin Heidelberg, 2013.
s10916-017-0777-5. https://doi.org/10.1007/978-3-642-37574-3_1
25. Sebaa, A., Chick, F., Nouicer, A., and Tari, A., Research in big data 32. Gao, S., Li, L., Li, W., Janowicz, K., and Zhang, Y., Constructing
warehousing using Hadoop. J. Inform. Syst. Eng. Manag. 2(2), gazetteers from volunteered big geo-data based on Hadoop.
2017. https://doi.org/10.20897/jisem.201710. Comput. Environ. Urban. Syst. 61:172–186, 2017. https://doi.org/
26. Dean, J., and Ghemawat, S., MapReduce: A flexible data process- 10.1016/j.compenvurbsys.2014.02.004.
ing tool. CACM. 53(1):72–77, 2010. https://doi.org/10.1145/ 33. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S.,
1629175.1629198. et al., Hive: A warehousing solution over a map-reduce framework.
27. Wu, S., Li, F., Mehrotra, S., and Ooi, B.C., Query optimization for Proc. VLDB Endowment. 2(2):1626–1629, 2009. https://doi.org/
massively parallel data processing. In Proceedings of the 2nd ACM 10.14778/1687553.1687609.
Symposium on Cloud Computing (p. 12). ACM, 2011. https://doi. 34. Ross, J., The use of economic evaluation in health care: Australian
org/10.1145/2038916.2038928 decision makers' perceptions. Health Policy. 31(2):103–110, 1995.
28. Apache Hadoop: http://hadoop.apache.org/, Viewed in 02/2015. https://doi.org/10.1016/0168-8510(94)00671-7.
29. Taylor, R.C., An overview of the Hadoop/MapReduce/HBase
35. ANDI: National Agency for Investment Development of Algeria,
framework and its current applications in bioinformatics. BMC
http://www.andi.dz/index.php/en/secteur-de-sante, Viewed in 02/
bioinform. 11(12):S1, 2010. https://doi.org/10.1186/1471-2105-
2015.
11-S12-S1.