BDA QB Answers 8 To 15
BDA QB Answers 8 To 15
Data Preprocessing
Relevancy,
recency,
range,
robustness
reliability.
Data Store first pre-processes from machine and file data sources. Pre-processing transforms the data in
table or partition schema or supported data formats. For example, JSON, CSV and AVRO. Data then exports
in compressed or uncompressed data formats.
Cloud service BigQuery consists of bigquery.tables.create; bigquery.dataEditor; bigquery.dataOwner;
bigquery.admin; bigquery.tables.updateData and other service functions. Analytics uses Google Analytics
360. BigQuery cloud exports data to a Googlecloud or cloud backup only.
Data store export to the cloud refers to transferring data from on-premise storage systems or local servers
to a cloud environment. Cloud storage provides scalable, secure, and cost-effective options for storing
large datasets, which is crucial for Big Data analytics.
Apache Sqoop is a key tool used for data transfers between relational databases and Hadoop Distributed File
Systems (HDFS). It facilitates the process of exporting and importing data, particularly between RDBMS systems (like
MySQL, PostgreSQL, and SQL Server) and cloud-based Hadoop clusters.
1. Establish Connection: Sqoop connects to the relational database via JDBC (Java Database
Connectivity). The system gathers metadata and examines the database for the data being transferred.
2. Data Export: Sqoop submits a map-only Hadoop job that divides the input dataset into splits and
transfers each split using individual map tasks. This ensures efficient use of resources and allows
scalable export to cloud-based HDFS.
3. Cloud Storage Destination: The exported data is usually placed in a directory in HDFS or a similar
cloud-based distributed file system. HDFS is designed to handle large datasets across multiple
servers, making it ideal for cloud environments.
1. Scalability
2. Cost Effectiveness:
3. Accessibility
4. Disaster Recovery
5. Security:
A Big Data platform supports large datasets and volume of data. The data generate at a higher velocity, in
more varieties or in higher veracity. Managing Big Data requires large resources of MPPs, cloud, parallel
processing and specialized tools. Bigdata platform should provision tools and services for:
11.How does Toy company can optimize the benefits using Big Data Analytics
To leverage Big Data Analytics effectively, a toy company can follow several strategies, drawing insights fro
m various industry practices:
1. Customer Acquisition and Retention
Personalized Experience: By analyzing customer data, the toy company can tailor marketing
campaigns to individual preferences, similar to how Amazon does it. This can be based on pa
st purchases, browsing behavior, and demographic information.
Loyalty Programs: Utilize data to identify patterns and trends that foster customer loyalty, e
nsuring that the marketing efforts are directed toward retaining valuable customers.
Segmented Marketing: Use Big Data to identify specific customer segments and target them
with customized advertising campaigns. For instance, the company can deliver ads through
SMS, e-mails, WhatsApp, LinkedIn, Facebook, and Twitter.
Ad Optimization: Real-
time analytics can help in understanding the effectiveness of various advertising channels an
d campaigns, allowing the company to allocate resources efficiently.
Product Insights: Analyze feedback from various sources such as social media, customer revi
ews, and purchase data to innovate and improve products. This will ensure the toys meet cu
stomer expectations and stay competitive.
Trend Analysis: Utilize Big Data to spot emerging trends in the market, enabling the compan
y to develop products that are in line with current consumer interests.
Fraud Prevention: Big Data analytics can help in detecting and preventing marketing frauds
by merging existing data with information from social media, websites, blogs, and emails. Th
is enriched data set can identify suspicious activities and prevent fraudulent transactions.
5. Risk Management
Identify Risks: Implement Big Data solutions to improve risk management models, allowing t
he company to develop smarter strategies for navigating high-risk environments.
Predictive Analytics: Use predictive analytics to foresee potential issues in the supply chain,
production, or market trends, enabling preemptive measures.
Understand Customer Needs: Use CVA to analyze what customers really want from the prod
ucts, ensuring that the toys deliver both perceived and desired value.
Consistent Customer Experience: Implement insights from CVA to provide consistent and de
lightful customer experiences, akin to leading marketers like Amazon.
Big Data analytics enable fraud detection. Big Data usages has the following features-for
Fusing of existing data at an enterprise data warehouse with data from sources such as social
Providing high volume data mining, new innovative applications and thus leading to new
ii)in medicine
Big Data analytics deploys large volume of data to identify and derive intelligence using predictive models about
individuals. Big Data driven approaches help in research in medicine which can help patients Following are some
findings: building the health profiles of individual patients and predicting models for diagnosing better and offer
better treatment, Aggregating large volume and variety of information around from multiple sources the DNAs,
proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems, that can enhance the understanding
of biology of diseases. Big data creates patterns and models by data mining and help in better understanding and
research, Deploying wearable devices data, the devices data records during active as well as inactive periods,
provide better understanding of patient health, and better risk profiling the user for certain diseases.
iii)advertising
The impact of Big Data is tremendous on the digital advertising industry. The digital advertising industry
sends advertisements using SMS, e-mails, WhatsApp, Linkedln, Facebook, Twitter and other mediums. Big
Data captures data of multiple sources in large volume, velocity and variety of data unstructured and
enriches the structured data at the enterprise data warehouse.
Big data real time analytics provide emerging trends and patterns, and gain actionable insights for facing
competitions from similar products. The data helps digital advertisers to discover new relationships, lesser
competitive regions and areas. Success from advertisements depend on collection, analyzing and mining.
The new insights enable the personalization and targeting the online, social media and mobile for
advertisements called hyper-localized advertising
Advertising on digital medium needs optimization. Too much usage can also effect negatively. Phone calls,
SMSs, e-mail-based advertisements can be nuisance if sent without appropriate researching on the
potential targets. The analytics help in this direction. The usage of Big Data after appropriate filtering and
elimination is crucial enabler of BigData Analytics with appropriate data, data forms and data handling in
the right manner.
The chocolate industry is undergoing significant transformation through digitalization and the incorporation
of Big Data Analytics. Here's how a chocolate company can optimize benefits using Big Data:
1. Digitalization of Chocolate Manufacturing
IoT and Smart Factories: Using Internet of Things (IoT) technologies, such as digital twins, d
ata analytics, and AI, to create interconnected smart factories. This real-
time data integration aids in decision-
making and streamlines processes, leading to increased productivity and reduced operational
costs.
Automated Processes: Machines can switch automatically between different production stage
s (dosing, mixing, refining, conching) and adapt to various recipes and product types, enhanci
ng flexibility and efficiency.
Data-
Driven Risk Assessment: By analyzing data related to raw materials, companies can identify
and mitigate potential food safety risks. Big Data allows for the identification of prevalent ris
ks associated with raw materials, comparative evaluation of risks based on origin, and critical
evaluation of suppliers.
Real-Time Monitoring: Utilizing digital tools to monitor nearly 400 unique chocolate-
related food safety incidents globally. This enables proactive measures and timely actions to
prevent incidents.
Demand Forecasting: By analyzing consumer data and market trends, chocolate companies ca
n predict demand more accurately, ensuring optimal inventory levels and reducing waste.
Supplier Evaluation: Big Data helps in selecting the best suppliers by evaluating their risk pro
files and performance, ensuring consistent quality and reliability in raw materials.
Product Innovation: Using insights from customer feedback and market trends to innovate ne
w products and improve existing ones, ensuring that offerings meet customer expectations.
Energy Management: Using data analytics to optimize energy consumption in the manufactur
ing process, leading to cost savings and sustainable practices.
Comprehensive Data Analysis: Merging data from various sources (social media, websites, e
mails) with enterprise data to detect and prevent marketing frauds. This enriched data set pro
vides deeper insights into potential fraud risks.
References :
Digital Bytes Make Better Bites: The Digitalization of Chocolate Manufacturing | News & Insights | Gray
Chocolate and Big Data: The Recipe for Food Safety Is Changing - FoodSafetyTech
ii)Automobile industry
The automobile industry is leveraging Big Data to drive innovation, enhance customer experiences, and im
prove operational efficiencies. Here’s how:
R&D Activities: Automobile companies use Big Data analytics to streamline research and dev
elopment processes. By analyzing vast datasets, companies can identify emerging trends, cu
stomer preferences, and technological advancements.
Inventory Management: Predictive analytics ensures optimal inventory levels, reducing wast
e and cost.
Customer Retention: Using Big Data to understand customer behavior, preferences, and sati
sfaction levels. This helps in creating personalized marketing strategies and improving custo
mer retention rates.
Customer Experience: Analyzing data from multiple customer interactions to enhance the ov
erall customer experience, which is crucial for long-term loyalty and competitive edge.
Predictive Maintenance: Analyzing data from vehicle sensors to predict potential issues befo
re they occur, reducing downtime and improving customer satisfaction.
Targeted Marketing: Utilizing Big Data to design and execute highly targeted marketing cam
paigns that resonate with specific customer segments.
Sales Forecasting: Predicting sales trends based on historical data and market analysis to inf
orm strategic decisions.
7. Risk Management
Fraud Detection: Combining data from various sources (social media, websites, blogs) with e
nterprise data to detect and prevent fraud.
Risk Assessment: Using Big Data to evaluate and manage risks associated with raw materials
, suppliers, and market conditions.
Competitive Edge: Companies like IBM, Microsoft, and SAP are leading the charge, offering a
dvanced Big Data solutions tailored to the automotive industry.
References : https://www2.deloitte.com/content/dam/Deloitte/ch/Documents/manufacturing/deloitte-ch-auto-
automotive-news-supplement.pdf
Hadoop is a computing environment in which input data stores, processes and stores the
results. The environment consists of clusters which distribute at the cloud or set of servers.
Each cluster consists of a string of data files constituting data blocks. The toy named
Hadoop consisted of a stuffed elephant. The Hadoop system cluster stuffs files in data
blocks.
The complete system consists of a scalable distributed set of clusters. Infrastructure consists of cloud for clusters. A
cluster consists of sets of computers or PCs. The Hadoop platform provides a low cost Big Data platform, which is
open source and uses cloud services. Tera Bytes of data processing takes just few minutes. Hadoop enables
distributed processing of large datasets (above 10 million bytes) across clusters of computers using a programming
model called MapReduce.
The system characteristics are scalable, self- manageable, self-healing and distributed file system. Scalable means
can be scaled up (enhanced) by adding storage and processing units as per the requirements.
The Hadoop core components of the framework are:
Hadoop Common - The common module contains the libraries and utilities that are
required by the other modules of Hadoop. For example, Hadoop common provides
various components and interfaces for distributed file system and general
input/output. This includes serialization, Java RPC (Remote Procedure Call) and file-
based data structures.
Hadoop Distributed File System (HDFS) - A Java-based distributed file system which
can store all kinds of data on the disks at the clusters.
MapReduce vl - Software programming model in Hadoop 1 using Mapper and
Reducer. The vl processes large sets of data in parallel and in batches.
YARN - Software for managing resources for computing. The user application tasks or
sub- tasks run in parallel at the Hadoop, uses scheduling and handles the requests for
the resources in distributed running of the tasks.
MapReduce v2 - Hadoop 2 YARN-based system for parallel processing of large
datasets and distributed processing of the application tasks
o A distributed file system designed to store large data sets across multiple
nodes.
o HDFS stores data in large blocks (default is 128 MB), which are distributed
across different nodes in a cluster, enabling parallel processing.
2. MapReduce:
o A programming model used for processing large datasets in parallel across a
Hadoop cluster.
o It consists of two main functions: Map, which processes and filters data, and
Reduce, which aggregates the output of the Map phase to provide final results.
o It manages and schedules jobs by allocating system resources for the execution
of tasks across the nodes in a cluster.
Scalability: The ecosystem allows for easy scaling by adding more nodes to handle increasing
amounts of data and processing.
Fault Tolerance: Data is replicated across multiple nodes, ensuring that it is available even if
hardware fails.
Cost-Effective: Built on open-source technologies and designed to run on commodity hardware,
Hadoop offers a cost-efficient solution for handling big data.
Flexibility: The ecosystem provides tools for various tasks, from data storage and processing to
machine learning and real-time analytics.
1. ZooKeeper:
o A coordination service for distributed applications, ensuring synchronization across clusters.
o ZooKeeper handles configuration management, name service, and failure recovery in a
distributed environment.
o It manages the distributed systems by controlling access to shared resources and resolving
issues like race conditions and deadlocks.
2. Oozie:
o A workflow scheduler system designed to manage and run Hadoop jobs.
o It can orchestrate complex job workflows by chaining multiple tasks and handling
dependencies between them.
o It uses Directed Acyclic Graphs (DAGs) to represent workflows and supports time-based
triggers for running recurrent jobs.
3. Sqoop:
o A tool used to transfer data between Hadoop and relational databases (such as MySQL,
Oracle, PostgreSQL).
o It supports both import and export functions, enabling data movement from databases to
HDFS, and from Hadoop back into relational systems.
4. Flume:
o A tool for collecting, aggregating, and transporting large volumes of streaming data to HDFS.
o Often used to ingest log data from various sources, such as social media, web servers, and
sensor networks, into Hadoop.
o It provides fault-tolerance and reliable data flow mechanisms, ensuring efficient handling of
large data streams.
5. HBase:
o A non-relational, distributed, column-oriented database that runs on top of HDFS.
o HBase is used for real-time read/write access to large datasets, offering random access to
billions of rows and millions of columns.
o It is modeled after Google’s Bigtable and provides scalability, fault tolerance, and
consistency.
6. Hive:
o A data warehouse software that facilitates querying and managing large datasets stored in
HDFS using a SQL-like query language called HiveQL.
o Hive translates SQL queries into MapReduce jobs, making it easier for users familiar with
SQL to work with large datasets in Hadoop.
o It supports batch processing and is optimized for read-heavy workloads.
7. Pig:
o A high-level platform for creating MapReduce programs using a scripting language called
Pig Latin.
o Pig is designed for processing large datasets and simplifies the writing of complex data
transformations compared to Java MapReduce.
o It allows users to focus on the data flow without worrying about the underlying MapReduce
details.
8. Mahout:
o A machine learning library that provides scalable algorithms for clustering, classification, and
collaborative filtering on large datasets.
o Mahout leverages Hadoop and MapReduce to handle data mining and machine learning
tasks, enabling pattern discovery in big data.
9. Ambari:
o A web-based management tool that simplifies the provisioning, monitoring, and management
of Hadoop clusters.
o Ambari provides an intuitive user interface and REST APIs for managing cluster health,
configuring security, and monitoring various Hadoop components.
o It is widely used to automate the management of Hadoop clusters, making it easier to
administer and maintain large distributed systems.