Big Daa R18 Manual
Big Daa R18 Manual
For
(DATA SCIENCE)
(R18 Regulations)
Accredited by NBA&NAAC WITH “A” GRADE; An ISO 9001: 2015 Certified Institution
Maisammaguda, Medchal (Dist), Hyderabad -500100, Telangana.
PROGRAM OUTCOMES (POs)
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
(DATA SCIENCE)
DEPARTMENT OF CSE-DATA SCIENCE
Things to Do:
1) Students should not bring any electronic gadgets into the lab.
2) They should not come late.
3) You should not create any disturbances to others.
CourseCode: 0 0 3 1.5
Course Objectives
1. The purpose of this course is to provide the students with the knowledge of Big data
Analyticsprinciples and techniques.
2. This course is also designed to give an exposure of the frontiers of Big data Analytics
Course Outcomes
1. Use Excel as an Analytical tool and visualization tool.
2. Ability to program using HADOOP and Map reduce.
3. Ability to perform data analytics using ML in R.
4. Use cassandra to perform social media analytics.
List of Experiments
1. Implement a simple map-reduce job that builds an inverted index on the set of
inputdocuments (Hadoop)
2. Process big data in HBase
3. Store and retrieve data in Pig
4. Perform Social media analysis using cassandra
5. Buyer event analytics using Cassandra on suitable product sales data.
6. Using Power Pivot (Excel) Perform the following on any dataset
a) Big Data Analytics
b) Big Data Charting
7. Use R-Project to carry out statistical analysis of big data
8. Use R-Project for data visualization of social media data
TEXT BOOKS:
1. Big Data Analytics, Seema Acharya, Subhashini Chellappan, Wiley 2015.
2. Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s
Business, Michael Minelli, Michehe Chambers, 1st Edition, Ambiga Dhiraj, Wiely CIO Series,
2013.
3. Hadoop: The Definitive Guide, Tom White, 3rd Edition, O‟Reilly Media, 2012.
4. Big Data Analytics: Disruptive Technologies for Changing the Game, Arvind Sathi, 1st
Edition,
IBM Corporation, 2012.
REFERENCES:
1. Big Data and Business Analytics, Jay Liebowitz, Auerbach Publications, CRC press (2013).
2. Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise and
Oracle R Connector for Hadoop, Tom Plunkett, Mark Hornick, McGraw-Hill/Osborne Media
(2013), Oracle press.
3. Professional Hadoop Solutions, Boris lublinsky, Kevin t. Smith, Alexey Yakubovich, Wiley,
ISBN: 9788126551071, 2015.
4. Understanding Big data, Chris Eaton, Dirk deroos et al., McGraw Hill, 2012.
5. Intelligent Data Analysis, Michael Berthold, David J. Hand, Springer, 2007.
6.
CO - PO MAPPING
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 3 - 2 3 3 - - 3
CO2 3 3 3 2 2 2 - 3 3 - - 3
CO3 3 3 3 3 3 - - 3 3 - - 3
CO4 3 3 3 3 3 - - 3 3 - - 3
CO5 3 3 3 - 3 - - 3 3 - - 3
AVG 3 3 3 3 3 2 2 3 3 2 2 3
CO - PSO MAPPING:
PSO1 PSO2
CO1 - 2
CO2 - 1
CO3 - 1
CO4 - 2
CO5 - 1
AVG 0 2
Experiment 1: Implement a simple map-reduce job that builds an inverted index on theset
of input documents (Hadoop)
Hadoop is an Apache open-source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides distributed
storage and computation across clusters of computers. Hadoop is designed to scale up from
single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
HelloWorld
HelloWorld
MapReduce
The Algorithm
The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and
hence, need to implement the Writable interface. Additionally, the key classes have
to implement the Writable-Comparable interface to facilitate sorting by the
framework. Input and Output types of a MapReduce job − (Input) <k1, v1> → map
→ <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Procedure:
exit
3. Install Hadoop by navigating to the following link and downloading the tar.gz
file for Hadoop version 3.3.0 (or a later version if you wish). (478 MB)
https://hadoop.apache.org/release/3.3.0.html
15. Before starting the Hadoop Distributed File System (hdfs), we need to
makesure that the rcmd type is “ssh” not “rsh” when we type the
following command
pdsh -q -w localhost
16. If the rcmd type is “rsh” as in the above figure, type the following commands:
export PDSH_RCMD_TYPE=ssh
20. Go to localhost:9870 from the browser. You should expect the following
2. Create a directory on the Desktop named Lab and inside it create two
folders;one called “Input” and the other called “tutorial_classes”.
[You can do this step using GUI normally or through terminal commands]
cd Desktop
mkdir Lab
mkdir Lab/Input
mkdir Lab/tutorial_classes
4. Add the file attached with this document “input.txt” in the directory Lab/Input.
5. Type the following command to export the hadoop classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Put the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .
Packagemr03.inverted_index;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
import java.util.StringTokenizer;
@Override
protected void map(LongWritable key, Text value,
Context context) throws
IOException, InterruptedException {
FileSplit split = (FileSplit)
context.getInputSplit();
StringTokenizer tokenizer = new
StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
String fileName =
split.getPath().getName().split("\\.")[0];
//remove special char using
// tokenizer.nextToken().replaceAll("[^a-zA-
Z]", "").toLowerCase()
//check for empty words
wordAtFileNameKey.set(tokenizer.nextToken() +
"@" + fileName);
context.write(wordAtFileNameKey, ONE_STRING);
}
}
}
IndexReducer.java
package
mr03.inverted_index;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
IndexDriver.java
package
mr03.inverted_index;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
FileSystem fs = FileSystem.get(conf);
boolean exists = fs.exists(new Path(output));
if(exists) {
fs.delete(new Path(output), true);
}
Job job = Job.getInstance(conf);
job.setJarByClass(IndexDriver.class);
job.setMapperClass(IndexMapper.class);
job.setCombinerClass(IndexCombiner.class);
job.setReducerClass(IndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
}
IndexCombiner.java
package
mr03.inverted_inde
x;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Output:
Experiment 2. Process big data in HBase
Theory:
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally
scalable.
It is based on Google's Big Table.It has set of tables which keep data in key value format. Hbase is
well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs
enabling development in practically any programming language. It is a part of the Hadoop ecosystem
that provides random real-time read/write access to data in the Hadoop File System.
• RDBMS get exponentially slow as the data becomes large
• Expects data to be highly structured, i.e. ability to fit in a well-defined schema
• Any change in schema might require a downtime
• For sparse datasets, too much of overhead of maintaining NULL values
Features of Hbase
• Horizontally scalable: You can add any number of columns anytime.
• Automatic Failover: Automatic failover is a resource that allows a system administrator
toautomatically switch data handling to a standby system in the event of system
compromise
• Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
• sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey,column key,and timestamp.
• Often referred as a key value store or column family-oriented database, or storing
versionedmaps of maps.
• fundamentally, it's a platform for storing and retrieving data with random access.
• It doesn't care about datatypes(storing an integer in one row and a string in another for
thesame column).
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using commodity hardware.
Hbase commands
Step 1:First go to terminal and type StartCDH.sh
Step 2:Next type jps command in the terminal
Step 5:hbase(main):001:0>version
Version will gives you the version of hbase
Create Table Syntax
Verification
After disabling the table, you can still sense its existence
through list and exists commands. You cannot scan it. It will give you the following error.
hbase(main):028:0> scan 'newtbl'
ROW COLUMN + CELL
ERROR: newtbl is disabled.
is_disabled
This command is used to find whether a table is disabled. Its syntax is as follows.
hbase> is_disabled 'table name'
disable_all
This command is used to disable all the tables matching the given regex. The syntax
for disable_all command is given below.
hbase> disable_all 'r.*
Suppose there are 5 tables in HBase, namely raja, rajani, rajendra, rajesh, and raju. The
following code will disable all the tables starting with raj.
hbase(main):002:07> disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Enabling a Table using HBase Shell
Syntax to enable a table:
enable ‘newtbl’
Example
Given below is an example to enable a table.
Verification
After enabling the table, scan it. If you can see the schema, your table is successfully
enabled.
is_enabled
This command is used to find whether a table is enabled. Its syntax is as follows:
hbase> is_enabled 'table name'
The following code verifies whether the table named emp is enabled. If it is enabled, it
will return true and if not, it will return false.
hbase(main):031:0> is_enabled 'newtbl'
true
0 row(s) in 0.0440 seconds
describe
This command returns the description of the table. Its syntax is as follows:
hbase(main):006:0> describe 'newtbl'
DESCRIPTION
ENABLED
hbase> describe 'table name'
Experiment: 3 Store and retrieve data in Pig
Aim:To perform storing and retrieval of big data using Apache pig
Resources:Apache pig
Theory:
Pig is a platform that works with large data sets for the purpose of analysis. The
Pig dialect is called Pig Latin, and the Pig Latin commands get compiled into
MapReduce jobs that can be run on a suitable platform, like Hadoop.
Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in turns enables them
to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces
sequences of Map-Reduce programs, for which large-scale parallel
implementations already exist (e.g., the Hadoop subproject). Pig's language layer
currently consists of a textual language called Pig Latin, which has the following
key properties:
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a relation.
Sorting
Diagnostic Operators
Employee Age
Name City
ID
001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai
Step-1: Create a Directoryin HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input file of Pig contains each tuple/record in individual lines with the entities
separated by a delimiter ( “,”).
In the local file system, create an input In the local file system, create an input
file student_data.txt containing data as file employee_data.txt containing data
shown below. as shown below.
001,Jagruthi,21,Hyderabad,9.1 001,Angelina,22,LosAngeles
002,Praneeth,22,Chennai,8.6 002,Jackie,23,Beijing
003,Sujith,22,Mumbai,7.8 003,Deepika,22,Mumbai
004,Sreeja,21,Bengaluru,9.2 004,Pawan,24,Hyderabad
005,Mahesh,24,Hyderabad,8.8 005,Rajani,21,Chennai
006,Rohit,22,Chennai,7.8 006,Amitabh,22,Mumbai
007,Sindhu,23,Mumbai,8.3
Step-3: Move the file from the local file system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
To get the path of the file student_data.txt type the below command
readlink -f student_data.txt
$ hdfs dfs -put /home/hadoop/Desktop/student_data.txt /bdalab/pigdir/
$ hdfs dfs -cat /bdalab/pigdir/student_data
$ hdfs dfs -put /home/hadoop/Desktop/employee_data /bdalab/pigdir/
Step-4: Apply Relational Operator – LOAD to load the data from the file student_data.txt into
Pig by executing the following Pig Latin statement in the Grunt shell. Relational
Operators are NOT case sensitive.
$ pig => will direct to grunt> shell
grunt> student = LOAD ' /bdalab/pigdir/student_data.txt' USING PigStorage(',')as (
id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt>employee = LOAD ' /bdalab/pigdir/employee_data.txt’ USING
PigStorage(',')as ( id:int, name:chararray, age:int, city:chararray);
Step-5: Apply Relational Operator – STORE to Store the relation in the HDFS directory
“/pig_output/” as shown below.
grunt> STORE student INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
Step-7: Apply Relational Operator – Diagnostic Operator – DUMP toPrint the contents of
the relation.
Step-9: Apply Relational Operator – Diagnostic Operator – EXPLAIN toDisplay the logical,
physical, and MapReduce executionplans of a relation usingExplain operator
Resources: Cassandra
Procedure:
Cassandra is a distributed database for low latency, high throughput services that handle
real time workloads comprising of hundreds of updates per second and tens of
thousands of reads per second.
When looking to replace a key-value store with something more capable on the
real-time replication and data distribution, research on Dynamo, the CAP theorem
and eventual consistency model shows Cassandra fits this model quite well. As
one learns more about data modeling capabilities, we gradually move towards
decomposing data.
Understand Cassandra’s architecture very well and what it does under the hood.
With Cassandra 2.0 you get lightweight transaction and triggers, but they are not
the same as the traditional database transactions one might be familiar with. For
example, there are no foreign key constraints available – it has to be handled by
one’s own application. Understanding one’s use cases and data access patterns
clearly before modeling data with Cassandra and to read all the available
documentation is a must.
Capture
This command captures the output of a command and adds it to a file. For example,
take a look at the following code that captures the output to a file named Outputfile.
cqlsh> CAPTURE '/home/hadoop/CassandraProgs/Outputfile'
When we type any command in the terminal, the output will be captured by the file
given. Given below is the command used and the snapshot of the output file.
cqlsh:tutorialspoint> select * from emp;
Consistency
This command shows the current consistency level, or sets a new consistency level.
cqlsh:tutorialspoint> CONSISTENCY
Current consistency level is 1.
Copy
This command copies data to and from Cassandra to a file. Given below is an example
to copy the table named emp to the file myfile.
cqlsh:tutorialspoint> COPY emp (emp_id, emp_city, emp_name, emp_phone,emp_sal) TO
‘myfile’;
4
If rows exported
you open and in 0.034the
verify seconds.
file given, you can find the copied data as shown below.
Describe
This command describes the current cluster of Cassandra and its objects. The variants
of this command are explained below.
Describe cluster − This command provides information about the cluster.
Range ownership:
-658380912249644557 [127.0.0.1]
Describe Keyspaces − This command lists all the keyspaces in a cluster. Given below
-2833890865268921414 [127.0.0.1]
is the usage of this command.
-6792159006375935836 [127.0.0.1]
cqlsh:tutorialspoint> describe keyspaces;
Describe
AND Type
memtable_flush_period_in_ms =0
AND min_index_interval = 128
This
AND command is used to
read_repair_chance describe a user-defined data type. Given below is the usage
= 0.0
of this command.
AND speculative_retry = '99.0PERCENTILE';
cqlsh:tutorialspoint> describe type card_details;
CREATE INDEX emp_emp_sal_idx ON tutorialspoint.emp (emp_sal);
card_details card
Expand
This command is used to expand the output. Before using this command, you have to
turn the expand command on. Given below is the usage of this command.
cqlsh:tutorialspoint> expand on;
cqlsh:tutorialspoint> select * from emp;
@ Row 1
-----------+------------
emp_id | 1
emp_city | Hyderabad
emp_name | ram
emp_phone | 9848022338
emp_sal | 50000
@ Row 2
-----------+------------
emp_id | 2
emp_city | Delhi
emp_name | robin
emp_phone | 9848022339
emp_sal | 50000
@ Row 3
-----------+------------
emp_id | 4
emp_city | Pune
emp_name | rajeev
emp_phone | 9848022331
emp_sal | 30000
Note − You can turn the expand option off using the following command.
cqlsh:tutorialspoint>
@ Row 4 expand off;
Disabled Expanded output.
-----------+------------
Exit
emp_id | 3
This command
emp_city is used to terminate the cql shell.
| Chennai
emp_name | rahman
emp_phone | 9848022330
emp_sal | 50000
(4 rows)
Show
This command displays the details of current cqlsh session such as Cassandra
version, host, or data type assumptions. Given below is the usage of this command.
cqlsh:tutorialspoint> show host;
Connected to Test Cluster at 127.0.0.1:9042.
Using this command, you can execute the commands in a file. Suppose our input file
is as follows −
Then you can execute the file containing the commands as shown below.
cqlsh:tutorialspoint> source '/home/hadoop/CassandraProgs/inputfile';
Aim: To perform the buyer event analysis using Cassandra on sales data
Theory:
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node (coordinator) plays
a proxy between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the
data will be captured and stored in the mem-table. Whenever the mem-table is full, data will
be written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter
to find the appropriate SSTable that holds the required data.
Apache is an open-source platform. This web server delivers web-related content using the
internet. It has gained huge popularity over the last few years, as the most used web server
software. Cassandra is a database management system that is open-source. It has the capacity
to handle a large amount of data across servers. It was first developed by Facebook for the
inbox search feature and was released as an open-source project back in 2008.
The following year, Cassandra became a part of Apache incubation, and combined with
Apache, it has reached new heights. To put it in simple terms, Apache Cassandra is a powerful
open-source distributed database system that can work efficiently to handle a massive amount
DATA-MODELLING
The way data is modeled is a major difference between Cassandra & MySQL. .
Let us consider a platform where users can post. Now, you have commented on a post of
another user. In these two databases, the information will be stored differently. In Cassandra,
you can store the data in a single table. The comments for each user is stored in the form of a
In MySQL, you have to make two tables with one-to-many relationships between them. As
MySQL does not permit unstructured data such as a List or a Map, one-to-many relationships
READ PERFORMANCE
The query to retrieve the comments made by a user(for example ‘5’) in MySQL, will look like
this.
When you utilize indexing in MySQL, it saves the data like a binary tree.
lookup.
WRITE PERFORMANCE
2. Then update it
Cassandra leverages an append-only model. Insert & update have no fundamental difference.
If you want to insert a row that comes with the same primary key as an existing row, the row
will be replaced. Or, if you update a row with a non-existent primary key, Cassandra will create
the row. Cassandra is very fast and stores large swathes of data on commodity hardware without
TRANSACTIONS
MySQL facilitates ACID transactions like any other Relational Database Management System
• Atomicity
• Consistency
• Isolation
• Durability
On the other hand, Cassandra has certain limitations to provide ACID transactions. Cassandra
can achieve consistency if data duplication is not allowed. But, that will kill Cassandra’s
availability. So, the systems that require ACID transactions must avoid NoSQL databases.
Procedure:
db.employee.insert(
{
empid: '1',
firstname: 'FN',
lastname: 'LN',
gender: 'M'
}
)
cqlsh>
SELECT TTL(name) FROM learn_cassandra.todo_by_user_email WHERE
user_email='john@email.com';
ttl(name)
43
(1 rows)
cqlsh>
SELECT * FROM learn_cassandra.todo_by_user_email WHERE
user_email='john@email.com';
(0 rows)
cqlsh>
INSERT INTO learn_cassandra.todo_by_user_email (user_email, creation_date, name)
VALUES(' ('john@email.com', '2021-03-14 16:07:19.622+0000', 'Insert query');
cqlsh>
UPDATE learn_cassandra.todo_by_user_email SET
name = 'Update query'
WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14
16:10:19.622+0000';
(2 rows)
Let’s only update if an entry already exists, by using IF EXISTS:
cqlsh>
UPDATE learn_cassandra.todo_by_user_email SET
name = 'Update query with LWT'
WHERE user_email = 'john@email.com' AND creation_date = '2021-03-14
16:07:19.622+0000' IF EXISTS;
[applied]
True
cqlsh>
INSERT INTO learn_cassandra.todo_by_user_email (user_email,creation_date,name)
VALUES('john@email.com', toTimestamp(now()), 'Yet another entry') IF NOT EXISTS;
[applied]
True
Experiment:6 using a power pivot(Excel) perform the following
on any data set
Aim: To perform the big data analytics using power pivot in Excel
Theory: Power Pivot is an Excel add-in you can use to perform powerful data analysis and create
sophisticated data models. With Power Pivot, you can mash up large volumes of data from various
sources, perform information analysis rapidly, and share insights easily.
In both Excel and in Power Pivot, you can create a Data Model, a collection of tables with
relationships. The data model you see in a workbook in Excel is the same data model you see in
the Power Pivot window. Any data you import into Excel is available in Power Pivot, and vice
versa.
Procedure:
Open the Microsoft Excel and go to data menu and click get data
Import the Twitter data set and click load to button
Click the diagram view and give the relation ships between the tables
Go to the Insert menu and click pivot table
Select the columns and u can perform drill down and rollup
operations using pivot table
We can load 10mllions rows of data also from multipleresources.
Experiment 6:Using Power Pivot perform the following on any data set
Aim :To create variety of charts using Excel for the given data
Resources:Microsoft Excel
Theory:
When your data sets are big, you can use Excel Power Pivot that can handle hundreds
of millions of rows of data. The data can be in external data sources and Excel Power
Pivot builds a Data Model that works on a memory optimization mode. You can perform
the calculations, analyze the data and arrive at a report to draw conclusions anddecisions.
The report can be either as a Power PivotTable or Power PivotChart or a combination of
both.
You can utilize Power Pivot as an ad hoc reporting and analytics solution. Thus, it would
be possible for a person with hands-on experience with Excel to perform the high-end
data analysis and decision making in a matter of few minutes and are a great asset to be
included in the dashboards.
Click the OK button. New worksheet gets created in Excel window and an empty Power PivotTable appears.
As you can observe, the layout of the Power PivotTable is similar to that of PivotTable.
The PivotTable Fields List appears on the right side of the worksheet. Here, you will find some differences
from PivotTable. The Power PivotTable Fields list has two tabs − ACTIVE and ALL, that appear below the
title and above the fields list. ALL tab is highlighted. The ALL tab displays all the data tables in the Data
Model and ACTIVE tab displays all the data tables that are chosen for the Power PivotTable at hand.
• Click the table names in the PivotTable Fields list under ALL.
The corresponding fields with check boxes will appear.
• Each table name will have the symbol on the left side.
• If you place the cursor on this symbol, the Data Source and the Model Table Name of that data table
will be displayed.
Suppose you want to create a Power PivotChart based on the following Data Model.
• Click on the Home tab on the Ribbon in the Power Pivot window.
• Click on PivotTable.
• Click on PivotChart in the dropdown list.
As you can observe, all the tables in the data model are displayed in the PivotChart Fields list.
Note that display of Field Buttons and/or Legend depends on the context of the PivotChart. You need to
decide what is required to be displayed.
As in the case of Power PivotTable, Power PivotChart Fields list also contains two tabs − ACTIVE and ALL.
Further, there are 4 areas −
• AXIS (Categories)
• LEGEND (Series)
• ∑ VALUES
• FILTERS
As you can observe, Legend gets populated with ∑ Values. Further, Field Buttons get added to the
PivotChart for the ease of filtering the data that is being displayed. You can click on the arrow on a Field
Button and select/deselect values to be displayed in the Power PivotChart.
Consider the following Data Model in Power Pivot that we will use for illustrations −
You can have the following Table and Chart Combinations in Power Pivot.
• Chart and Table (Horizontal) - you can create a Power PivotChart and a Power PivotTable, one next
to another horizontally in the same worksheet.
Chart and Table (Vertical) - you can create a Power PivotChart and a Power PivotTable, one below another
vertically in the same worksheet.
These combinations and some more are available in the dropdown list that appears when you click on
PivotTable on the Ribbon in the Power Pivot window.
Click on the pivot chart and can develop multiple variety of charts
Output:
Experiment 7:using R project to carry out statistical analysis of
big data
Procedure:
https://posit.co/download/rstudio-desktop/#download
step 2 :wget -c
https://download1.rstudio.org/desktop/jammy/amd64/rstudio
-2022.07.2-576-amd64.deb
fstep 4:rstudio
launch R studio
procedure:
-->install.packages("gapminder")
-->library(gapminder)
-->data(gapminder)
output:
A tibble: 1,704 × 6
-->summary(gapminder)
summary(gapminder)
output:
(Other) 1632
-->x<-mean(gapminder$gdpPercap)
-->x
output:[1] 7215.327
-->attach(gapminder)
-->median(pop)
output:[1] 7023596
-->hist(lifeExp)
-->boxplot(lifeExp)
will plot the below images
-->plot(lifeExp - gdpPercap)
-->install.packages("dplyr")
-->gapminder %>%
+ filter(year == 2007) %>%
+ group_by(continent) %>%
+ summarise(lifeExp = median(lifeExp))
output:
# A tibble: 5 × 2
continent lifeExp
<fct> <dbl>
1 Africa 52.9
2 Americas 72.9
3 Asia 72.4
4 Europe 78.6
5 Oceania 80.7
-->install.packages("ggplot2")
--> library("ggplot2")
-->ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
output:
-->head(country_colors, 4)
output:
Nigeria Egypt Ethiopia
"#7F3B08" "#833D07" "#873F07"
Congo, Dem. Rep.
"#8B4107"
-->head(continent_colors)
mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
median(Data_Cars$wt)
[1] 3.325
names(sort(-table(Data_Cars$wt)))[1]
A simple example of regression is predicting weight of a person when his height is known.
To do this we need to have the relationship between height and weight of a person.
The steps to create the relationship is −
• Carry out the experiment of gathering a sample of observed values of height
and corresponding weight.
• Create a relationship model using the lm() functions in R.
• Find the coefficients from the model created and create the mathematical
equation using these
• Get a summary of the relationship model to know the average error in predic-
tion. Also called residuals.
• To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response vari-
able.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficient
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Result:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
To get the summary of the relation ships
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(summary(relation))
Result:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
• object is the formula which is already created using the lm() function.
• newdata is the vector containing the new value for predictor variable.
Theory:
Data visualization is the technique used to deliver insights in data using visual cues such as
graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding it.
Data Visualization in R Programming Language
The popular data visualization tools that are available are Tableau, Plotly, R, Google Charts,
Infogram, and Kibana. The various data visualization platforms have different capabilities,
functionality, and use cases. They also require a different skill set. This article discusses the
use of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, andscientific
research. It is usually preferred for data visualization as it offers flexibility and minimum
required coding through its packages.
Types of Data Visualizations
Some of the various types of visualizations offered by R are:
Bar Plot
There are two types of bar plots- horizontal and vertical which represent data points as
horizontal or vertical bars of certain lengths proportional to the value of the data item. They
are generally used for continuous and categorical variable plotting. By settingthe
horiz parameter to true and false, we can get horizontal and vertical bar plots respectively.
Histogram
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be varied.
For a histogram, the parameter xlim can be used to specify the interval within which all
values are to be displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in the
histogram and when set to FALSE, the probability densities are represented on the y-axis such
that they are of the histogram adds up to one.
Histograms are used in the following scenarios:
• To verify an equal and symmetric distribution of the data.
• To identify deviations from expected values.
Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A boxplot
depicts information like the minimum and maximum data point, the median value, first and
third quartile, and interquartile range.
Box Plots are used for:
• To give a comprehensive statistical description of the data through a visual cue.
• To identify the outlier points that do not lie in the inter-quartile range of data.
Scatter Plot
A scatter plot is composed of many points on a Cartesian plane. Each point denotes the value
taken by two parameters and helps us easily identify the relationship between them.
Scatter Plots are used in the following scenarios:
• To show whether an association exists between bivariate data.
• To measure the strength and direction of such a relationship.
Heat Map
Heatmap is defined as a graphical representation of data using colors to visualize the value
of the matrix. heatmap() function is used to plot heatmap.
Syntax: heatmap(data)
Parameters: data: It represent matrix data, such as values of rows and columns
Return: This function draws a heatmap.
Procedure:
Step I : Facebook Developer Registration
Go to https://developers.facebook.com and register yourself by
clicking on Get Started button at the top right of page (See the
snapshot below). After it would open a form for registration which
you need to fill it to get yourself registered.
Step2:click on tools
install.packages("httpuv")
install.packages("Rfacebook")
install.packages("RcolorBrewer")
install.packages("Rcurl")
install.packages("rjson")
install.packages("httr")
library(Rfacebook)
library(httpuv)
library(RcolorBrewer)
acess_token="EAATgfMOrIRoBAOR9XUl3VGzbLMuWGb9FqGkTK3PFBuRyUVZA
WAL7ZBw0xN3AijCsPiZBylucovck4YUhUfkWLMZBo640k2ZAupKgsaKog9736lec
P8E52qkl5de8M963oKG8KOCVUXqqLiRcI7yIbEONeQt0eyLI6LdoeZA65Hyxf8so1
UMbywAdZCZAQBpNiZAPPj7G3UX5jZAvUpRLZCQ5SIG"
options(RCurloptions=list(verbose=FALSE,capath=system.file("CurlSSL","cacert.
pem",package = "Rcurl"),ssl.verifypeer=FALSE))
me<-
getUsers("me",token=acess_token)
View(me)
myFriends<-getFriends(acess_token,simplify =
FALSE)table(myFriends)
pie(table(myFriends$gend
er))output