0% found this document useful (0 votes)
16 views18 pages

BDA QB Answers 8 To 15

Uploaded by

Raghu Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views18 pages

BDA QB Answers 8 To 15

Uploaded by

Raghu Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

8. Explain the following:1.Data Sources 2.Data Quality 3.

Data Preprocessing

8.1 Data Sources :


1.5 Data Source
Applications, programs and tools use data. Sources can be external, such as sensors, trackers, web logs,
computersystemslogs and feeds. Sources can be machines, which source data from data-creating
programs. A source can be internal. Sources can be data repositories, such as database, relational
database, flat file, spreadsheet, mail server, web server, directory services, even text or files such as
comma-separated values (CSV) files. Source may be a data store for applications
Types of Data Source: structured , semi-structured , multi-structured or unstructured
Structured Data Source
 Data source for ingestion, storage and processing can be a file, database or streaming data.
 The source may be on the same computer running a program or a networked computer
 Structured data sources are SQL Server, MySQL

Unstructured Data Source

 Unstructured data sources are distributed over high-speed networks.


 The data need high velocity processing. Sources are from distributed file systems.
 The sources are of file types, such as .txt (text file), .csv

8.2 Data Quality


High quality means data, which enables all the required operations, analysis, decisions, planning and knowledge
discovery correctly. Five R's as follows:

 Relevancy,
 recency,
 range,
 robustness
 reliability.

Factors Affecting Data Quality :


 Data Noise : Noise in data refers to data giving additional meaningless information besides true
(actual/required) information. Noise is random in character
 Outlier : An outlier in data refers to data, which appears to not belong to the dataset.For example, data that
is outside an expected range.
 Missing Value : Missing value implies data not appearing in the data set.
 Duplicate value : Duplicate value implies the same data appearing two or more times in a dataset

8.3 Data Pre-processing :


Data pre-processing is an important step at the ingestion layer.Pre-processing is a must before data mining and
analytics. Pre-processing is also a must before running a Machine Learning (ML) algorithm.

Pre-processing needs are:


 Dropping out of range, inconsistent and outlier values
 Filtering unreliable, irrelevant and redundant information
 Data cleaning, editing, reduction and/or wrangling
 Data validation, transformation or transcoding
 ELT processing

9. Discuss Data store export to cloud


The data pre-processing, data mining, analysis, visualization and data store. The data exports to cloud
services. The results integrate at the enterprise server or data warehouse.
Data store export to the cloud refers to transferring data from on-premise storage systems or local servers
to a cloud environment. Cloud storage provides scalable, secure, and cost-effective options for storing
large datasets.

Data Store first pre-processes from machine and file data sources. Pre-processing transforms the data in
table or partition schema or supported data formats. For example, JSON, CSV and AVRO. Data then exports
in compressed or uncompressed data formats.
Cloud service BigQuery consists of bigquery.tables.create; bigquery.dataEditor; bigquery.dataOwner;
bigquery.admin; bigquery.tables.updateData and other service functions. Analytics uses Google Analytics
360. BigQuery cloud exports data to a Googlecloud or cloud backup only.
Data store export to the cloud refers to transferring data from on-premise storage systems or local servers
to a cloud environment. Cloud storage provides scalable, secure, and cost-effective options for storing
large datasets, which is crucial for Big Data analytics.
Apache Sqoop is a key tool used for data transfers between relational databases and Hadoop Distributed File
Systems (HDFS). It facilitates the process of exporting and importing data, particularly between RDBMS systems (like
MySQL, PostgreSQL, and SQL Server) and cloud-based Hadoop clusters.

Steps for Exporting Data to the Cloud:

1. Establish Connection: Sqoop connects to the relational database via JDBC (Java Database
Connectivity). The system gathers metadata and examines the database for the data being transferred.
2. Data Export: Sqoop submits a map-only Hadoop job that divides the input dataset into splits and
transfers each split using individual map tasks. This ensures efficient use of resources and allows
scalable export to cloud-based HDFS.
3. Cloud Storage Destination: The exported data is usually placed in a directory in HDFS or a similar
cloud-based distributed file system. HDFS is designed to handle large datasets across multiple
servers, making it ideal for cloud environments.

Benefits of Exporting Data to the Cloud

1. Scalability

2. Cost Effectiveness:

3. Accessibility

4. Disaster Recovery

5. Security:

10.List the characteristics of Big Data platform

A Big Data platform supports large datasets and volume of data. The data generate at a higher velocity, in
more varieties or in higher veracity. Managing Big Data requires large resources of MPPs, cloud, parallel
processing and specialized tools. Bigdata platform should provision tools and services for:

 storage, processing and analytics,


 developing, deploying, operating and managing Big Data environment,
 reducing the complexity of multiple data sources and integration of applications into one cohesive
solution,
 custom development, querying and integration with other systems, and
 the traditional as well as Big Data techniques.

Characteristics of a Big Data Platform


Innovative Non-Traditional Methods: Utilizes advanced techniques for storage, processing, and analytics
that go beyond traditional approaches to handle complex data efficiently.
 Distributed Data Stores: Data is distributed across multiple nodes to ensure redundancy, fault
tolerance, and improved performance.
 Scalability and Elasticity: Cloud computing platforms allow seamless scalability and elasticity,
enabling the system to grow and shrink based on demand.
 High Volume Data Stores: Capable of storing and managing massive volumes of data, often running
into petabytes or more.
 Massive Parallelism: Executes multiple operations concurrently, leveraging parallel processing to
improve speed and efficiency.
 High-Speed Networks: Relies on high-speed networking infrastructure to facilitate quick data
transfer and low-latency communication between nodes.
 High-Performance Processing: Employs optimized and fine-tuned processing techniques to ensure
high performance for both batch and real-time data analytics.
 NoSQL Data Management Models: Utilizes NoSQL databases to handle unstructured and semi-
structured data efficiently, offering flexibility and scalability.
 In-Memory Processing: Uses in-memory data processing for faster transaction and query
performance, suitable for both OLAP (Online Analytical Processing) and OLTP (Online Transaction
Processing) systems.
 Comprehensive Data Analytics: Includes capabilities for data retrieval, mining, reporting,
visualization, and advanced analytics to extract insights from large datasets.
 Graph Databases: Supports graph databases for analyzing relationships and patterns within social
network data and other interconnected datasets.
 Machine Learning: Integrates machine learning algorithms and models to derive predictive and
prescriptive insights from data.
 Diverse Data Sources: Ingests data from a wide range of sources such as data storages, data
warehouses, big data solutions like Oracle Big Data, MongoDB NoSQL, Cassandra NoSQL, and more.
 Real-Time Data Sources: Captures and processes data from real-time sources including sensors,
financial transaction audit trails, web, social media, weather data, and health records.

11.How does Toy company can optimize the benefits using Big Data Analytics

Optimizing Benefits for a Toy Company Using Big Data Analytics

To leverage Big Data Analytics effectively, a toy company can follow several strategies, drawing insights fro
m various industry practices:
1. Customer Acquisition and Retention

 Personalized Experience: By analyzing customer data, the toy company can tailor marketing
campaigns to individual preferences, similar to how Amazon does it. This can be based on pa
st purchases, browsing behavior, and demographic information.

 Loyalty Programs: Utilize data to identify patterns and trends that foster customer loyalty, e
nsuring that the marketing efforts are directed toward retaining valuable customers.

2. Focused and Targeted Campaigns

 Segmented Marketing: Use Big Data to identify specific customer segments and target them
with customized advertising campaigns. For instance, the company can deliver ads through
SMS, e-mails, WhatsApp, LinkedIn, Facebook, and Twitter.

 Ad Optimization: Real-
time analytics can help in understanding the effectiveness of various advertising channels an
d campaigns, allowing the company to allocate resources efficiently.

3. Innovative Product Development

 Product Insights: Analyze feedback from various sources such as social media, customer revi
ews, and purchase data to innovate and improve products. This will ensure the toys meet cu
stomer expectations and stay competitive.

 Trend Analysis: Utilize Big Data to spot emerging trends in the market, enabling the compan
y to develop products that are in line with current consumer interests.

4. Detection of Marketing Frauds

 Fraud Prevention: Big Data analytics can help in detecting and preventing marketing frauds
by merging existing data with information from social media, websites, blogs, and emails. Th
is enriched data set can identify suspicious activities and prevent fraudulent transactions.

5. Risk Management

 Identify Risks: Implement Big Data solutions to improve risk management models, allowing t
he company to develop smarter strategies for navigating high-risk environments.

 Predictive Analytics: Use predictive analytics to foresee potential issues in the supply chain,
production, or market trends, enabling preemptive measures.

6. Supply Chain Optimization


 Enhanced Collaboration: By analyzing data across the supply chain, the company can foster
high-level collaboration among suppliers, improving efficiency and reducing constraints.

 Performance Tracking: Track supplier performance and optimize inventory management, en


suring a smooth supply chain operation.

7. Customer Value Analytics (CVA)

 Understand Customer Needs: Use CVA to analyze what customers really want from the prod
ucts, ensuring that the toys deliver both perceived and desired value.

 Consistent Customer Experience: Implement insights from CVA to provide consistent and de
lightful customer experiences, akin to leading marketers like Amazon.

12.Explain the usage of Big data analytics: i)to detect Marketing


Frauds ii)in medicine iii)advertising

: i)to detect Marketing Frauds


Big Data Analytics in Detection of Marketing Frauds

Big Data analytics enable fraud detection. Big Data usages has the following features-for

enabling detection and prevention of frauds:

 Fusing of existing data at an enterprise data warehouse with data from sources such as social

 media, websites, biogs, e-mails, and thus enriching existing data

 Using multiple sources of data and connecting with many applications

 Providing greater insights using querying of the multiple source data

 Analyzing data which enable structured reports and visualization

 Providing high volume data mining, new innovative applications and thus leading to new

 business intelligence and knowledge discovery

ii)in medicine
Big Data analytics deploys large volume of data to identify and derive intelligence using predictive models about

individuals. Big Data driven approaches help in research in medicine which can help patients Following are some

findings: building the health profiles of individual patients and predicting models for diagnosing better and offer

better treatment, Aggregating large volume and variety of information around from multiple sources the DNAs,

proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems, that can enhance the understanding

of biology of diseases. Big data creates patterns and models by data mining and help in better understanding and

research, Deploying wearable devices data, the devices data records during active as well as inactive periods,

provide better understanding of patient health, and better risk profiling the user for certain diseases.

iii)advertising

The impact of Big Data is tremendous on the digital advertising industry. The digital advertising industry
sends advertisements using SMS, e-mails, WhatsApp, Linkedln, Facebook, Twitter and other mediums. Big
Data captures data of multiple sources in large volume, velocity and variety of data unstructured and
enriches the structured data at the enterprise data warehouse.

Big data real time analytics provide emerging trends and patterns, and gain actionable insights for facing
competitions from similar products. The data helps digital advertisers to discover new relationships, lesser
competitive regions and areas. Success from advertisements depend on collection, analyzing and mining.
The new insights enable the personalization and targeting the online, social media and mobile for
advertisements called hyper-localized advertising

Advertising on digital medium needs optimization. Too much usage can also effect negatively. Phone calls,
SMSs, e-mail-based advertisements can be nuisance if sent without appropriate researching on the
potential targets. The analytics help in this direction. The usage of Big Data after appropriate filtering and
elimination is crucial enabler of BigData Analytics with appropriate data, data forms and data handling in
the right manner.

13.How are Big Data used in i)Chocolate company ii)Automobile industry

The chocolate industry is undergoing significant transformation through digitalization and the incorporation
of Big Data Analytics. Here's how a chocolate company can optimize benefits using Big Data:
1. Digitalization of Chocolate Manufacturing

 IoT and Smart Factories: Using Internet of Things (IoT) technologies, such as digital twins, d
ata analytics, and AI, to create interconnected smart factories. This real-
time data integration aids in decision-
making and streamlines processes, leading to increased productivity and reduced operational
costs.

 Automated Processes: Machines can switch automatically between different production stage
s (dosing, mixing, refining, conching) and adapt to various recipes and product types, enhanci
ng flexibility and efficiency.

2. Monitoring and Improving Food Safety

 Data-
Driven Risk Assessment: By analyzing data related to raw materials, companies can identify
and mitigate potential food safety risks. Big Data allows for the identification of prevalent ris
ks associated with raw materials, comparative evaluation of risks based on origin, and critical
evaluation of suppliers.

 Real-Time Monitoring: Utilizing digital tools to monitor nearly 400 unique chocolate-
related food safety incidents globally. This enables proactive measures and timely actions to
prevent incidents.

3. Optimizing Supply Chain and Inventory Management

 Demand Forecasting: By analyzing consumer data and market trends, chocolate companies ca
n predict demand more accurately, ensuring optimal inventory levels and reducing waste.

 Supplier Evaluation: Big Data helps in selecting the best suppliers by evaluating their risk pro
files and performance, ensuring consistent quality and reliability in raw materials.

4. Enhancing Customer Experience

 Personalized Marketing: Leveraging data analytics to understand customer preferences and b


ehavior, enabling targeted and personalized marketing campaigns. This helps in building stro
nger customer relationships and loyalty.

 Product Innovation: Using insights from customer feedback and market trends to innovate ne
w products and improve existing ones, ensuring that offerings meet customer expectations.

5. Efficient Operations and Cost Reduction


 Automation and Robotics: Integrating robotic technology to handle various production tasks,
which reduces labor costs and increases efficiency. Flexible configurations allow for quick ad
justments to produce different products or cater to seasonal demands.

 Energy Management: Using data analytics to optimize energy consumption in the manufactur
ing process, leading to cost savings and sustainable practices.

6. Fraud Detection and Prevention

 Comprehensive Data Analysis: Merging data from various sources (social media, websites, e
mails) with enterprise data to detect and prevent marketing frauds. This enriched data set pro
vides deeper insights into potential fraud risks.

References :
Digital Bytes Make Better Bites: The Digitalization of Chocolate Manufacturing | News & Insights | Gray

Chocolate and Big Data: The Recipe for Food Safety Is Changing - FoodSafetyTech

ii)Automobile industry

The automobile industry is leveraging Big Data to drive innovation, enhance customer experiences, and im
prove operational efficiencies. Here’s how:

1. Product Development and Innovation

 R&D Activities: Automobile companies use Big Data analytics to streamline research and dev
elopment processes. By analyzing vast datasets, companies can identify emerging trends, cu
stomer preferences, and technological advancements.

 Strategic Partnerships: Collaborations, such as National Instruments Corporation (NIC) acqui


ring Heinzinger GmbH’s electronic vehicle systems division, enhance capabilities in electrific
ation, battery testing, and sustainable energy.

2. Supply Chain and Manufacturing

 Optimized Manufacturing: Big Data helps in optimizing manufacturing processes by providin


g real-time insights into production lines, reducing downtime, and improving efficiency.

 Inventory Management: Predictive analytics ensures optimal inventory levels, reducing wast
e and cost.

3. Connected Vehicles and Intelligent Transportation


 Telematics Data: Collecting and analyzing data from connected vehicles to enhance safety, p
erformance, and user experience. This includes monitoring vehicle health, driving patterns,
and real-time navigation.

 Intelligent Transportation Systems: Improving traffic management and reducing congestion


through data-driven insights.

4. Customer Behavior Analytics

 Customer Retention: Using Big Data to understand customer behavior, preferences, and sati
sfaction levels. This helps in creating personalized marketing strategies and improving custo
mer retention rates.

 Customer Experience: Analyzing data from multiple customer interactions to enhance the ov
erall customer experience, which is crucial for long-term loyalty and competitive edge.

5. OEM Warranty and Aftersales/Dealers

 Predictive Maintenance: Analyzing data from vehicle sensors to predict potential issues befo
re they occur, reducing downtime and improving customer satisfaction.

 Aftermarket Services: Providing tailored services and solutions based on data-


driven insights to enhance customer satisfaction and loyalty.

6. Sales, Marketing, and Other Applications

 Targeted Marketing: Utilizing Big Data to design and execute highly targeted marketing cam
paigns that resonate with specific customer segments.

 Sales Forecasting: Predicting sales trends based on historical data and market analysis to inf
orm strategic decisions.

7. Risk Management

 Fraud Detection: Combining data from various sources (social media, websites, blogs) with e
nterprise data to detect and prevent fraud.

 Risk Assessment: Using Big Data to evaluate and manage risks associated with raw materials
, suppliers, and market conditions.

8. Global and Regional Market Insights


 Market Segmentation: The big data market in the automotive industry is segmented by appli
cation and geography, allowing companies to tailor strategies based on regional and applica
tion-specific insights.

 Competitive Edge: Companies like IBM, Microsoft, and SAP are leading the charge, offering a
dvanced Big Data solutions tailored to the automotive industry.
References : https://www2.deloitte.com/content/dam/Deloitte/ch/Documents/manufacturing/deloitte-ch-auto-
automotive-news-supplement.pdf

14.What is Hadoop?Explain the core components of Hadoop.

Hadoop is a computing environment in which input data stores, processes and stores the
results. The environment consists of clusters which distribute at the cloud or set of servers.
Each cluster consists of a string of data files constituting data blocks. The toy named
Hadoop consisted of a stuffed elephant. The Hadoop system cluster stuffs files in data
blocks.

The complete system consists of a scalable distributed set of clusters. Infrastructure consists of cloud for clusters. A
cluster consists of sets of computers or PCs. The Hadoop platform provides a low cost Big Data platform, which is
open source and uses cloud services. Tera Bytes of data processing takes just few minutes. Hadoop enables
distributed processing of large datasets (above 10 million bytes) across clusters of computers using a programming
model called MapReduce.

The system characteristics are scalable, self- manageable, self-healing and distributed file system. Scalable means
can be scaled up (enhanced) by adding storage and processing units as per the requirements.
The Hadoop core components of the framework are:

 Hadoop Common - The common module contains the libraries and utilities that are
required by the other modules of Hadoop. For example, Hadoop common provides
various components and interfaces for distributed file system and general
input/output. This includes serialization, Java RPC (Remote Procedure Call) and file-
based data structures.
 Hadoop Distributed File System (HDFS) - A Java-based distributed file system which
can store all kinds of data on the disks at the clusters.
 MapReduce vl - Software programming model in Hadoop 1 using Mapper and
Reducer. The vl processes large sets of data in parallel and in batches.
 YARN - Software for managing resources for computing. The user application tasks or
sub- tasks run in parallel at the Hadoop, uses scheduling and handles the requests for
the resources in distributed running of the tasks.
 MapReduce v2 - Hadoop 2 YARN-based system for parallel processing of large
datasets and distributed processing of the application tasks

15.Explain Hadoop Ecosystem with a neat Diagram


The Hadoop Ecosystem is a collection of open-source components and tools that work
together to enable the storage, processing, and analysis of large datasets in a distributed
environment. This ecosystem is built around the Hadoop framework and provides a variety
of solutions for managing Big Data. Here’s a detailed explanation of the core components
and tools of the Hadoop ecosystem:

Core Components of Hadoop

1. Hadoop Distributed File System (HDFS):

o A distributed file system designed to store large data sets across multiple
nodes.

o It provides fault-tolerant storage by replicating data blocks across different


nodes, ensuring data availability even in the event of hardware failures.

o HDFS stores data in large blocks (default is 128 MB), which are distributed
across different nodes in a cluster, enabling parallel processing.

2. MapReduce:
o A programming model used for processing large datasets in parallel across a
Hadoop cluster.

o It consists of two main functions: Map, which processes and filters data, and
Reduce, which aggregates the output of the Map phase to provide final results.

o It allows the distribution of tasks across many nodes, enabling efficient


processing of large data.

3. YARN (Yet Another Resource Negotiator):

o YARN is the resource management layer of Hadoop.

o It manages and schedules jobs by allocating system resources for the execution
of tasks across the nodes in a cluster.

o It allows multiple applications to run simultaneously and handle large-scale


distributed data processing.

Key Benefits of the Hadoop Ecosystem:

 Scalability: The ecosystem allows for easy scaling by adding more nodes to handle increasing
amounts of data and processing.
 Fault Tolerance: Data is replicated across multiple nodes, ensuring that it is available even if
hardware fails.
 Cost-Effective: Built on open-source technologies and designed to run on commodity hardware,
Hadoop offers a cost-efficient solution for handling big data.
 Flexibility: The ecosystem provides tools for various tasks, from data storage and processing to
machine learning and real-time analytics.

Hadoop Ecosystem Tools (BDA(18CS72)Module-2):

1. ZooKeeper:
o A coordination service for distributed applications, ensuring synchronization across clusters.
o ZooKeeper handles configuration management, name service, and failure recovery in a
distributed environment.
o It manages the distributed systems by controlling access to shared resources and resolving
issues like race conditions and deadlocks.
2. Oozie:
o A workflow scheduler system designed to manage and run Hadoop jobs.
o It can orchestrate complex job workflows by chaining multiple tasks and handling
dependencies between them.
o It uses Directed Acyclic Graphs (DAGs) to represent workflows and supports time-based
triggers for running recurrent jobs.
3. Sqoop:
o A tool used to transfer data between Hadoop and relational databases (such as MySQL,
Oracle, PostgreSQL).
o It supports both import and export functions, enabling data movement from databases to
HDFS, and from Hadoop back into relational systems.
4. Flume:
o A tool for collecting, aggregating, and transporting large volumes of streaming data to HDFS.
o Often used to ingest log data from various sources, such as social media, web servers, and
sensor networks, into Hadoop.
o It provides fault-tolerance and reliable data flow mechanisms, ensuring efficient handling of
large data streams.
5. HBase:
o A non-relational, distributed, column-oriented database that runs on top of HDFS.
o HBase is used for real-time read/write access to large datasets, offering random access to
billions of rows and millions of columns.
o It is modeled after Google’s Bigtable and provides scalability, fault tolerance, and
consistency.
6. Hive:
o A data warehouse software that facilitates querying and managing large datasets stored in
HDFS using a SQL-like query language called HiveQL.
o Hive translates SQL queries into MapReduce jobs, making it easier for users familiar with
SQL to work with large datasets in Hadoop.
o It supports batch processing and is optimized for read-heavy workloads.
7. Pig:
o A high-level platform for creating MapReduce programs using a scripting language called
Pig Latin.
o Pig is designed for processing large datasets and simplifies the writing of complex data
transformations compared to Java MapReduce.
o It allows users to focus on the data flow without worrying about the underlying MapReduce
details.
8. Mahout:
o A machine learning library that provides scalable algorithms for clustering, classification, and
collaborative filtering on large datasets.
o Mahout leverages Hadoop and MapReduce to handle data mining and machine learning
tasks, enabling pattern discovery in big data.
9. Ambari:
o A web-based management tool that simplifies the provisioning, monitoring, and management
of Hadoop clusters.
o Ambari provides an intuitive user interface and REST APIs for managing cluster health,
configuring security, and monitoring various Hadoop components.
o It is widely used to automate the management of Hadoop clusters, making it easier to
administer and maintain large distributed systems.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy