0% found this document useful (0 votes)
8 views7 pages

Unit 4 - Big Data

The document discusses the significance of Big Data in socio-economic analysis, highlighting the need for a specialized architecture to effectively manage and analyze diverse data sources. It proposes a comprehensive data lifecycle model and a structured Big Data architecture designed for forecasting social and economic variables, incorporating data governance and persistence layers. The aim is to enable organizations to leverage vast amounts of data for accurate and timely predictions, ultimately improving decision-making and policy planning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Unit 4 - Big Data

The document discusses the significance of Big Data in socio-economic analysis, highlighting the need for a specialized architecture to effectively manage and analyze diverse data sources. It proposes a comprehensive data lifecycle model and a structured Big Data architecture designed for forecasting social and economic variables, incorporating data governance and persistence layers. The aim is to enable organizations to leverage vast amounts of data for accurate and timely predictions, ultimately improving decision-making and policy planning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Big Data Sources and Methods for Social and Economic Analyses

Introduction: The Age of Big Data in Socio-Economic Analysis


We live in what is often called the “Digital Era,” characterized by the widespread use of
technologies like the internet, smartphones, and smart sensors in nearly every aspect of
our daily lives, both for individuals and organizations. These technologies constantly
generate enormous amounts of fresh, digitized data about how people, companies, and
other entities interact. This phenomenon is sometimes referred to as the “Data Big Bang.”
Analyzing this data effectively can reveal valuable insights into social and economic
behaviors, trends, and patterns. The term “Big Data” emerged in the late 1990s and was
initially defined by three characteristics, often called the 3Vs: Volume (the sheer size of the
data), Velocity (the speed at which data is generated and transferred), and Variety (the
different types and structures of data, from text and logs to videos). This model has evolved
to include two more Vs: Value (the process of extracting meaningful information, known as
Big Data Analytics) and Veracity (ensuring data quality, proper governance, and addressing
privacy concerns), making it the 5Vs model.
This new data paradigm presents both significant opportunities and challenges for socio-
economic research, policy-making, and business decision-making. To harness the potential
of Big Data, organizations need to understand the available data sources, the types of data
they provide, and how to process and analyze them effectively. A well-designed Big Data
architecture, tailored to the specific needs of socio-economic analysis, is crucial for
managing the entire data lifecycle—from collection and ingestion to analysis, storage, and
visualization. Designing such an architecture requires addressing numerous challenges,
including scalability, data quality, heterogeneity (mixing structured and unstructured
data), integration from diverse sources, data privacy, and governance. While Big Data offers
immense potential for improving forecasts (like unemployment rates), detecting market
trends, and monitoring policy impacts, there hasn’t been a specific architecture proposed
for the unique demands of social and economic forecasting. This paper aims to fill that gap
by providing a framework of new data sources and methods, proposing a data lifecycle
model for Big Data, and developing a specialized Big Data architecture designed specifically
for forecasting social and economic changes.

Related Work: Existing Big Data Architectures


Before proposing a new architecture specifically for socio-economic analysis, the paper
reviews existing Big Data architectures developed for other domains. Research in this area
is relatively new, spurred by the “Data Big Bang” and the need for systems capable of
handling the unique properties of Big Data. Early architectures drew from distributed
computing concepts like grid computing. More recent proposals often incorporate
technologies like cloud computing due to its scalability for storage and processing.
Several reference architectures exist. Pääkkönen and Pakkala (2015) proposed a general
reference architecture identifying common functionalities like data sources, extraction,
preprocessing, processing, analysis, transformation, visualization, storage, and model
specification. Assunção et al. (2015) outlined a typical Big Data analytics workflow with
four phases: Data sources, data management (preprocessing/filtering), modeling, and
result analysis/visualization, often linked to cloud computing.
Specific domain architectures have also been developed. For instance, Zhang et al. (2017)
created an architecture for industrial data analytics aimed at optimizing production
processes and product lifecycle management. This involved stages for acquiring industrial
data (like sensor data), processing/storing it, and then mining it through layers for data,
methods, results, and application. Similarly, Wang et al. (2016a) developed a detailed
architecture for healthcare analytics, featuring layers for data sources, data aggregation
(acquisition, transformation, storage), analytics (processing/analysis), information
exploration (generating outputs like clinical decision support), and a crucial data
governance layer to manage security and privacy for sensitive health data.
Reviewing these existing architectures reveals common modules: a data module (sources),
a preprocessing module (extraction, integration, transformation), an analytics module
(modeling, knowledge discovery), and a results/visualization module. However, the
placement of functionalities like data storage and data governance varies. Storage, essential
for data reuse, is sometimes placed within specific modules or treated as a cross-cutting
function. Data governance, while critical (especially as the 5Vs model emphasizes Veracity),
is not consistently included in all proposed architectures. The paper argues that given the
unique characteristics of socio-economic data (uncertainty, human behavior complexity)
and the potential benefits of Big Data for forecasting, a dedicated architecture is needed,
which the subsequent sections will detail.

Non-Traditional Sources of Social and Economic Data


The digital footprint left by individuals and organizations has created a wealth of new data
sources beyond traditional surveys and official records. The paper proposes a taxonomy to
classify these non-traditional sources based on the purpose behind the data generation:
1. Information Search: Data generated when users actively look for information. The
prime example is search engine query data, like that provided by Google Trends.
Google Trends shows the volume of searches for specific terms over time and has
proven useful for nowcasting (predicting the present) and forecasting economic
indicators like unemployment, consumer spending (car/home sales, tourism), stock
market activity, and even political interest, although it has limitations.
2. Transactions: Data created during an exchange. This can be financial (e-commerce
purchases, e-banking, sensor data from tolls or card readers) or non-financial
(providing information to access a service, like in e-government or e-recruiting).
3. Information Diffusion: Data generated when users (individuals or organizations)
spread information, often for marketing or establishing a public image. Examples
include corporate websites, apps disseminating information, and Wiki pages.
4. Social Interaction: Data arising from users sharing opinions, ideas, and experiences
with others. Social Networking Sites (SNS) like Twitter and Facebook are major
sources here, along with opinion platforms (e.g., Ciao, TripAdvisor) and blogs. Twitter
data (tweets) has been used to predict elections, stock markets, and public opinion.
Facebook data, though harder to analyze due to heterogeneity, shows potential for
understanding consumer profiles and political leanings. Opinion platforms provide
valuable consumer reviews but face challenges like fake reviews. Blogs also offer
insights but research is still developing.
5. Non-Deliberate Generation: Data generated passively as a byproduct of using digital
devices, without the user actively intending to create it. This includes:
– Usage Data: Information about how, when, and where devices are used (e.g.,
web cookies, IP addresses, self-tracking sensor data).
– Location Data: Positional information from GPS, mobile phone signals (GSM,
Call Detail Records - CDRs), Bluetooth, or WiFi points. CDRs, for instance, have
been used to estimate population density and socio-economic levels.
– Personal Data: Information like age or sex, sometimes provided consciously
(filling forms) or inferred unconsciously (from search/purchase history).
These sources, particularly those leveraging the internet, provide timely and granular data
but also come with challenges like potential bias (e.g., representing only certain
demographics), noise, privacy concerns, and the need for sophisticated methods to extract
meaningful insights.

New Methods and Analytics for Big Data


Traditional statistical and econometric methods often struggle with the scale (Volume),
speed (Velocity), and diverse formats (Variety) of Big Data. Therefore, new and adapted
analytical methods are required to extract value from these non-traditional sources. The
paper reviews several key methods and analytics techniques relevant to socio-economic
analysis using Big Data:
1. Machine Learning (ML): This is a broad category of algorithms that allow computers
to learn patterns from data without being explicitly programmed. For Big Data, ML is
crucial for tasks like classification (e.g., identifying sentiment in text), regression
(predicting numerical values like sales), clustering (grouping similar data points, like
customer segments), and dimensionality reduction (simplifying complex datasets).
Techniques need to be scalable to handle large volumes.
2. Natural Language Processing (NLP): Since much Big Data is unstructured text
(social media posts, reviews, news articles), NLP techniques are essential. These
methods enable computers to understand, interpret, and process human language. Key
NLP tasks include text mining (extracting information), topic modeling (identifying
themes in large text collections), and sentiment analysis.
3. Sentiment Analysis (Opinion Mining): A specific application of NLP, sentiment
analysis aims to determine the emotional tone (positive, negative, neutral) expressed
in text. This is highly valuable for gauging public opinion, brand perception, or
reactions to events or policies using data from social media, reviews, etc.
4. Network Analysis: This involves studying relationships between entities (e.g., people,
organizations, webpages). Using graph theory, it analyzes connections and structures
within networks. In a socio-economic context, it can be used to understand social
interactions, information diffusion, or economic relationships using data from SNS,
communication logs, or trade data.
5. Spatial Analysis: With the rise of location data (GPS, mobile data), spatial analysis
methods are increasingly important. These techniques analyze geographic patterns
and relationships, helping to understand phenomena like population mobility, urban
development, or the geographic spread of economic activity.
6. Visualization: Effectively communicating insights from complex Big Data requires
powerful visualization techniques. Beyond simple charts, this includes interactive
dashboards, heat maps, network graphs, and geospatial visualizations that help
analysts and decision-makers understand patterns and trends intuitively.
Dealing with Big Data often necessitates distributed computing frameworks like Apache
Hadoop (with MapReduce) and Apache Spark, which allow processing tasks to be split
across many computers, enabling scalability. The choice of method depends heavily on the
type of data source and the specific research question being addressed.

The Big Data Lifecycle for Socio-Economic Analysis


To effectively manage Big Data within an organization, the paper emphasizes a data
lifecycle approach. It reviews existing models and proposes a comprehensive lifecycle
specifically tailored for socio-economic Big Data analysis, encompassing all stages from
data creation to its eventual disposal. This lifecycle provides the foundation for the
proposed architecture.
The key phases identified in this lifecycle include:
1. Data Creation/Acquisition: This initial phase involves generating or collecting data
from the diverse sources previously discussed (search engines, social media, sensors,
transactions, etc.).
2. Data Storage: Once acquired, data needs to be stored appropriately. Given the volume
and variety, this often involves distributed storage systems (like Hadoop Distributed
File System - HDFS) and databases capable of handling different structures (SQL for
structured, NoSQL for unstructured/semi-structured).
3. Data Processing & Cleaning: Raw data is often messy, incomplete, or inconsistent.
This phase involves cleaning the data (handling missing values, correcting errors),
transforming it into usable formats, and potentially reducing its dimensionality.
4. Data Integration: Combining data from multiple heterogeneous sources is crucial for
gaining a holistic view. This involves resolving differences in schemas, formats, and
identifiers (data matching or linkage) and aligning data across different time
frequencies or geographic levels.
5. Data Analysis/Modeling: This is where insights are extracted using the methods
described earlier (ML, NLP, network analysis, etc.). Models are built, trained, and
validated to explain phenomena, make predictions (forecasting), or classify data.
6. Data Visualization/Interpretation: The results of the analysis need to be presented
in a clear and understandable way, often through dashboards, reports, and various
visualization techniques, to support decision-making.
7. Data Archival/Disposal: Finally, policies are needed for long-term storage
(archiving) of data and results, as well as for securely disposing of data when it’s no
longer needed or legally permissible to keep, adhering to privacy regulations.
This lifecycle perspective ensures that all necessary steps and considerations, including
governance and storage, are integrated throughout the process of turning raw Big Data into
actionable socio-economic knowledge.

Proposed Big Data Architecture for Socio-Economic Forecasting


Building on the data lifecycle and the review of sources and methods, the paper proposes a
specific Big Data architecture designed for nowcasting and forecasting social and economic
variables. This architecture is structured into three main layers:
1. Data Analysis Layer: This is the core layer where data flows through various
processing stages. It consists of several interconnected modules:
2. Data Receiving Module: Connects to the diverse data sources (structured,
semi-structured, unstructured) identified earlier. It handles initial access,
whether in batch or stream, and aims to provide a homogenized access point
for subsequent modules. It needs to be aware of data structure (or lack thereof)
and access conditions.
3. Data Preprocessing Module: Takes data from the receiving module and
prepares it. Key steps include: recording metadata about the acquisition
process, validating data quality (checking formats, handling errors/missing
values), and extracting initial features or derived data at the entity level (e.g.,
word counts from text). Storage here is source-driven. *
4. Data Integration Module: Merges data from different preprocessed sources.
This is challenging due to heterogeneity in structure, format, time frequency,
and geographic scope. It involves defining schemas to link sources, using
linkage techniques to match entities across datasets, and harmonizing time
frequencies and geographic levels. Integrated data can be stored and fed back
as a new source.
5. Data Preparation Module: Transforms the integrated data into the specific
formats required by the analytical tools in the next module. This might involve
grouping elements, joining tables, estimating missing values, or pivoting data
(transforming key-value pairs into rows with features as columns). Storage
here becomes analysis-driven.
6. Data Analytics Module: Applies the statistical and machine learning methods
(descriptive and predictive) to the prepared data. Models are
estimated/trained, validated, and used to generate insights, classifications,
forecasts, or nowcasts. This module can operate in stream, scheduled, or on-
demand modes, depending on computational resources.
7. Results Publishing Module: Makes the outputs of the analytics module
(insights, models, predictions) available to the organization. This includes
creating reports, tables, and visualizations for human decision-makers
(strategic/tactical level) and potentially offering models as services (SaaS) for
integration into other operational systems.
2. Governance Layer: This is a horizontal layer that applies organizational policies, ethical
principles, and legal regulations across the entire data lifecycle. It ensures privacy, security,
and compliance. Its modules include:
Ingestion Management: Handles source licenses, access credentials, user permissions,
and metadata completeness.
Processing Management: Manages privacy/anonymization policies, ethical checks,
transformation tracking, and access permissions for data/computing resources.
Results Management: Ensures traceability of results, manages access permissions for
reports, and handles privacy aspects in outputs.
Archival and Disposal Management: Implements policies for archiving and deleting data,
procedures, and reports.
Auditing: Inspects compliance with regulations and policies, and potentially monitors
overall system performance.
3. Persistence Layer: This layer underpins the others by managing all storage needs. It
handles the storage of raw data, processed data, integrated data, metadata, analytical
models, and results. It typically involves distributed storage systems (local and/or cloud-
based) like HDFS and various database types (SQL, NoSQL) to accommodate the diverse
data and processing requirements throughout the architecture.

Conclusions
The paper concludes by summarizing its main contributions. It addresses the need for a
structured approach to using Big Data for social and economic analysis, particularly for
forecasting and nowcasting. It provides a framework by reviewing and classifying non-
traditional data sources (like search queries, social media, sensor data) and relevant
analytical methods (like machine learning, NLP, network analysis). It proposes a detailed
data lifecycle model tailored for Big Data in this context.
The primary contribution is the proposed Big Data architecture specifically designed for
socio-economic forecasting. This architecture, based on the data lifecycle, integrates
diverse data sources and analytical techniques within a structured framework consisting of
data analysis, governance, and persistence layers. The goal is to enable organizations
(businesses, governments, statistical institutions) to systematically leverage the vast
amounts of available data to generate more accurate, timely, and granular predictions of
social and economic trends and behaviors.
Implementing such an architecture offers significant advantages, potentially leading to
better-informed decision-making and policy planning. However, the authors acknowledge
challenges, such as integrating the architecture with existing systems and choosing the
right technological implementation (e.g., cloud computing for scalability). As future work,
they plan to implement the proposed architecture to generate real-time socio-economic
forecasts using internet data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy