About this ebook
This book has three components:
1. An overview of what data science is and how it relates to other disciplines
2. Technical applications of the machine learning algorithms to discover and predict
3. Practical R Programming to practice for practicing and aspiring data scientists using R Package.
What This Book Covers:
The books explains why Data science is important taking relevant examples from different domains and explains statistical concepts and machine learning concepts. Then using basic statistical and mathematical concepts an approach is taken to input basic command in R to gets hands on experience with using the R programming Package for practical understanding. Another important part is case studies. Some have a statistical/machine learning flair, some have more of a business/decision science or operations research flair, and some have more of a data engineering flair.
“The book serves as a good introductory frame work for data science. It covers the basic concepts related to data science in a simple and lucid manner that will help the reader absorb the concepts easily. The reader can also practice the examples using R. Presentation of basic R commands will help the reader to start experimenting with R. Overall the book presents a good introduction to data science and its applications.”
Dr. D. V. Srinivas Kumar,
Assisstant Professor,
School of Management Studies,
University of Hyderabad.
Contents:
1. Data Science: Key Concepts 2. Spotting Signals: An Overview 3. Problem based Analysis 4. Bivariate Analysis 5. Visual Constructs 6. Business Story Telling using R 7. Exploratory Data Analysis Case Study 8. Machine Learning in Action 9. Regression 10. Dimensionality Reduction Technique
About the Author:
Before taking on the assignment to write this book, Prema Alla trainedprofessionals and undertook consultancy work, working closely withAR Solutions Inc, 3 Executive Drive, Suite 351 Somerset NJ 08873.I wish to thank Derick Jose, who guided and mentored me through the whole process of writing this book.
Related to Introduction to Data Science Using R
Related ebooks
Mastering Machine Learning with R - Second Edition Rating: 0 out of 5 stars0 ratingsBig Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners Rating: 3 out of 5 stars3/5Mastering Time Series Analysis and Forecasting with Python Rating: 0 out of 5 stars0 ratingsLearn R By Coding Rating: 0 out of 5 stars0 ratingsData Scientist Roadmap Rating: 5 out of 5 stars5/5Start Predicting In A World Of Data Science And Predictive Analysis Rating: 0 out of 5 stars0 ratingsForecasting Models – an Overview With The Help Of R Software Rating: 0 out of 5 stars0 ratingsThe Real Work of Data Science: Turning data into information, better decisions, and stronger organizations Rating: 0 out of 5 stars0 ratingsIntroduction to R for Business Intelligence Rating: 0 out of 5 stars0 ratingsIlluminating Data: A hands on guide to data visualization in R Rating: 0 out of 5 stars0 ratingsReal-Time Analytics: Techniques to Analyze and Visualize Streaming Data Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5Statistics: Practical Concept of Statistics for Data Scientists Rating: 0 out of 5 stars0 ratingsData Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition) Rating: 0 out of 5 stars0 ratingsPractical Data Analytics for BFSI Rating: 0 out of 5 stars0 ratingsIntroduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries Rating: 5 out of 5 stars5/5Data Warehousing: Optimizing Data Storage And Retrieval For Business Success Rating: 0 out of 5 stars0 ratingsTime series database A Clear and Concise Reference Rating: 0 out of 5 stars0 ratingsBeginning Statistics with Data Analysis Rating: 4 out of 5 stars4/5Making Big Data Work for Your Business: A guide to effective Big Data analytics Rating: 0 out of 5 stars0 ratingsAdvanced Statistics Demystified Rating: 4 out of 5 stars4/5Neo4j High Performance Rating: 0 out of 5 stars0 ratingsA Practical Guide to Analytics for Governments: Using Big Data for Good Rating: 0 out of 5 stars0 ratingsInspiring Leadership in Retail & Restaurant Development: Life Lessons and Shared Inspiration from our Industry's Top Thought Leaders Rating: 0 out of 5 stars0 ratingsRisk-Adjusted Value Management RVM Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsStatistics Simplified: Advanced Thinking Skills, #6 Rating: 0 out of 5 stars0 ratingsPractice Makes Perfect Linear Algebra: With 500 Exercises Rating: 0 out of 5 stars0 ratingsSocial Media Data Mining and Analytics Rating: 0 out of 5 stars0 ratingsNeo4j Cookbook Rating: 0 out of 5 stars0 ratingsRegression Analysis: Mastering the Art of Regression Analysis, Predict, Analyze, Decide Rating: 0 out of 5 stars0 ratings
Business For You
Never Split the Difference: Negotiating As If Your Life Depended On It Rating: 4 out of 5 stars4/5Law of Connection: Lesson 10 from The 21 Irrefutable Laws of Leadership Rating: 4 out of 5 stars4/5Company Rules: Or Everything I Know About Business I Learned from the CIA Rating: 4 out of 5 stars4/5On Writing Well, 30th Anniversary Edition: An Informal Guide to Writing Nonfiction Rating: 4 out of 5 stars4/5The Richest Man in Babylon: The most inspiring book on wealth ever written Rating: 4 out of 5 stars4/5Real Artists Don't Starve: Timeless Strategies for Thriving in the New Creative Age Rating: 4 out of 5 stars4/5The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers Rating: 4 out of 5 stars4/5Collaborating with the Enemy: How to Work with People You Don't Agree with or Like or Trust Rating: 4 out of 5 stars4/5Becoming Bulletproof: Protect Yourself, Read People, Influence Situations, and Live Fearlessly Rating: 4 out of 5 stars4/5The Energy Bus: 10 Rules to Fuel Your Life, Work, and Team with Positive Energy Rating: 3 out of 5 stars3/5Your Next Five Moves: Master the Art of Business Strategy Rating: 5 out of 5 stars5/5Emotional Intelligence: Exploring the Most Powerful Intelligence Ever Discovered Rating: 4 out of 5 stars4/5The Intelligent Investor, Rev. Ed: The Definitive Book on Value Investing Rating: 4 out of 5 stars4/5Capitalism and Freedom Rating: 4 out of 5 stars4/5Robert's Rules Of Order: QuickStudy Laminated Reference Guide Rating: 5 out of 5 stars5/5The Introvert's Edge: How the Quiet and Shy Can Outsell Anyone Rating: 4 out of 5 stars4/5How to Grow Your Small Business: A 6-Step Plan to Help Your Business Take Off Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Everything Guide To Being A Paralegal: Winning Secrets to a Successful Career! Rating: 5 out of 5 stars5/5Set for Life, Revised Edition: An All-Out Approach to Early Financial Freedom Rating: 4 out of 5 stars4/5Super Learning: Advanced Strategies for Quicker Comprehension, Greater Retention, and Systematic Expertise Rating: 4 out of 5 stars4/5Wise as Fu*k: Simple Truths to Guide You Through the Sh*tstorms of Life Rating: 5 out of 5 stars5/5Robert's Rules of Order: The Original Manual for Assembly Rules, Business Etiquette, and Conduct Rating: 4 out of 5 stars4/5The Book of Beautiful Questions: The Powerful Questions That Will Help You Decide, Create, Connect, and Lead Rating: 4 out of 5 stars4/5How to Get Ideas Rating: 4 out of 5 stars4/5Grant Writing For Dummies Rating: 5 out of 5 stars5/5Good to Great: Why Some Companies Make the Leap...And Others Don't Rating: 4 out of 5 stars4/5
Reviews for Introduction to Data Science Using R
0 ratings0 reviews
Book preview
Introduction to Data Science Using R - Prema Alla
CHAPTER 1
Data Science: Key Concepts
In this chapter we will also look at the five disruptions that are caused in the market place by data science. Once the context and its importance is understood it’s easy to simplify and demonstrate what data science actually is. We will also study traditional architecture versus Data science and understand the importance of Signal detection, which we shall study as chapter 2 and the machine learning techniques that help with this signal detection is studied from chapter 8 onwards, although we have covered few machine learning concepts in this chapter. This chapter shall also discuss solution architecture and the three critical components that are required for any solution.
FIVE DISRUPTIVE PRODUCTS
The five quick disruptive products launched in the market place will be discussed now:
1. A very simple Japanese App
2. Healthcare App
3. Coursera
4. Sensory device in Agriculture Sector
5. Autonomous Car
THE JAPANESE APP
The first one is a very simple Japanese app, which essentially helps two people to discover each other. Essentially, what the App does is, for every individual a set of questions has to be answered. When these questions are answered it gives a characteristics score that tells if the person likes music, books, viewpoints on philosophy, religion etc. Whatever the parameters are, the questions have to be answered and each person gets a score attached to each question answered.
The other score that is attached to this device is the location. If a device is carried while walking on the street it will tell how many people with similar scores are around you within a 1 km radius. This app will enable strangers to look up at one another and have coffee, chat or get to know one another better. Using similarity score and location they are able to discover one another.
Disruption: An app that leveraged and capitalized on new social norms of today’s casual meetups. Revolutionized the way people find others with similar taste/interests. Usage of data to find patterns and clusters from humongous set of entries and present to the users in a meaningful way, which is ‘right match’ in this case. Turning Data to Insights.
FIGURE 1.1 Japanese dating app
THE HEALTHCARE APP
The second one is in the healthcare space. In this healthcare app a heart implant is able to communicate information such as rate of heartbeat, condition of heart in real time with your mobile phone. The mobile app also communicates remotely to the doctor.
Disruption: Reduction in visits to the clinic, reduction in non-medical costs. Continuous monitoring of organ health vs. one time data captured during the physician visit. Presents an opportunity to track patterns and higher chance of identifying an anomaly and hence act early/on time.
FIGURE 1.2 Heart implats
COURSERA
The third disruptive product is Coursera, an online educational platform where one can learn various kinds of courses for free. There are a lot of educational videos and tutorials online. When students watch these videos it is possible to pinpoint those places in the video when students pause or stop. Those jump and exit points are noted and this enables to figure out how to re-orchestrate the content, to make the content more engaging.
Disruption: While MOOCS have expanded the access to education to learners by overcoming lack of infrastructure/resources, COURSERA aimed to continuously improve the quality of the content delivered by collecting data on focus/topics of interest from thousands of students from across the world. By redesigning UX, and fine tuning content COURSERA disrupted the way online education was delivered by its predecessors like Khanacademy, MIT OCW, etc.
FIGURE 1.3 MOOC
SENSORY DEVICE IN AGRICULTURE SECTOR
Fourth, disruptive product is in the Agriculture sector. Netherlands agriculture is a big part of their economy. They make the worlds best cheese and butter. One of the problems farmers face there is understanding the health of cows, which are carrying. Therefore now they have attached a sensory device to the cow’s ears, through which farmers can remotely (communicated via a satellite), monitor their cow’s health.
Disruption: Livestock farming techniques and the sensors help with cattle health monitoring and action can be taken immediately if the cattle are unwell. This helps within time detection of disease and helps prevention of spread of disease to the other cows through prediction.
FIGURE 1.4 Sensored cows in Netherland
AUTONOMOUS CAR
Lastly, the autonomous car, an autonomous car is special in that the car moves without a driver. This device tracks and scans the surroundings of the car at high speeds. It has the intelligence to process all kinds of realtime information and communicates it back to the steering wheel.
Disruption: Processing data from images and supplementary sensors, selfdriving cars create a virtual world through which they navigate. By reducing the reaction time by millions of folds than human level, they aim to eliminate human error driven accidents and traffic congestions. Significant improvement in time and fuel efficiency whilst saving lives.
FIGURE 1.5 Googles autonomous car
A look at all the five uses shows one thing that is common to all of these and that is a data product which is working behind the scenes, very silently humming. To create a data product a data science process is needed, which will unlearn patterns from that data and create a bigger product. So in the five examples that happen in our everyday like how our heath gets taken care of, how we learn, how we fall in love, how we farm and how we drive, all of these are touched increasingly by data products. Data science needs to be an integral part of any organization you consider, else there is a very high probability that you will lose the market place.
One of the biggest secrets of winners is that they are able to see patterns faster. So a core team, which uses data science techniques to process all the structured, unstructured data and looks at patterns around it and acts on it in real time is what most companies are aiming at today.
DATA SCIENCE Vs TRADITIONAL METHODS
It’s similar to an iceberg floating on water. Most organizations just see the tip of the iceberg. For example they just know how much sales is happening. They fail to realize what is driving sales. Ifthere is a change in the promotions by 5% what is the expected growth in sales? There are lots of unknown questions for which answers are required.
Most organizations have tons of data on sales, finance aspects; call centre data and reports, which are typically delivered on Business Objects, Cognos, and Microsoft Analysis Services. These reports quickly answer few important basic questions such as which call centre agent has the best all round time. What happens in Data science is inserting a process called analytical modeling process where there are specific techniques such as segmentation, scoring models, text-mining models, which will process the data and give a different lens. This will enable one to see patterns in the data.
DIFFERENCES IN ARCHITECTURE
Here is a detailed architecture of traditional companies versus the new age companies. Both of them have a Data Repository and a Dashboard but where they are different is in the four layers. There is Machine Learning Process (Text Mining, Collaborative filtering) in-between the data repository and Dashboards, which will change the game. They detect what is called a signal. A Signal is nothing but a pattern, so once the pattern is detected via an action, they keep a close watch on that action. This is a simplified view of the Data science architecture.
FIGURE 1.6 4 core differences between data science and dashboards
DEMYSTIFYING MACHINE LEARNING
The goal of Data scientist is to use data to discover signals that cause changes and which ultimately have an impact on the revenue of the firm. Even for a data scientist, it is humanely impossible to analyze big data. But with the aid of a computer, it can be easily done. Yet, a computer can only compute what has been programmed into it. So how do data scientists cope with this scenario, where analysis of the data will require the computer to pick up the ‘trends’ on its own? This is where machine learning comes in.
Machine Learning is a remarkable application of artificial intelligence that enables computing systems to perform tasks through a process of selflearning
without their being specifically programmed for the same. As data scientists cannot pinpoint exactly what sorts of patterns, the computer should recognize, this application of machine learning comes in extremely handy. Thus, machine learning facilitates the computer to automatically adapt to new patterns and signals in data, while
learning or recognizing previous trends and data computations. When Google’s search bar uses
autocomplete" before you type in your query, it is an example of machine learning, as the Google server has learnt to give you ‘predictions’ of what you might want to search based on your previous search history.
We will now familiarize with five techniques
TECHNIQUE 1: SEGMENTATION
This process involves breaking data into various chunks based on shared characteristics. The analyst then picks the clusters through an iterative process looking for uniqueness between segments. We could segment based on demographic, need based, behavior based etc. The statistical techniques that we use for segmentation are K Means, Hierarchical clustering and Discriminant analysis, as shown in figure 1.7.
Some business questions that are answered by segmentation are:
•What are the behavioral personas about customer, which lie buried in my raw customer transactions in the database? This is explained in Figure 1.8
•Which specific customer behavior discriminates a high value segment from low value segment? This is explained in Figure 1.9
•How do customer behavior segments migrate across time and what does it reveal to us? This is explained in Figure 1.10 and 1.11
FIGURE 1.7 A Real ife customer segmentation case study
FIGURE 1.8 Behavioral components considered for fleet card segmentation
FIGURE 1.9 Dimensions of fleet behavior measured and segmented
FIGURE 1.10 Cash cow - segment profile
FIGURE 1.11 Cash cow - behavior portrait and target action
Segmenting in BANKING Industry
In order to give the right offer and product to the right customer and to do it the efficient way you will need to use a segmentation method. In banking we could classify and segment the customers into 5 clusters and their line of credit, pricing and campaign intervention for each segment can be studied as seen in the graph 1.12
Clustering
It is considered the most important unsupervised learning problem. Cluster analysis is in simple language dividing data into different clusters or groups.
FIGURE 1.12 Segmentation in banking industry
The greater the similarity within a group the better is the cluster. The greater the dissimilarity between groups the cluster is more distinct. One technique of clustering is the k means technique. This technique