0% found this document useful (0 votes)
22 views95 pages

Dsa Unit 3

The document discusses the significant role of data science in education, highlighting its impact on personalized learning, administrative efficiency, and the growth of AI in the educational sector. It also covers the use of big data analytics, e-learning analytics, and educational data mining to enhance learning experiences and predict student outcomes. Additionally, it introduces social media analytics and text vectorization as tools for analyzing data and improving educational practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views95 pages

Dsa Unit 3

The document discusses the significant role of data science in education, highlighting its impact on personalized learning, administrative efficiency, and the growth of AI in the educational sector. It also covers the use of big data analytics, e-learning analytics, and educational data mining to enhance learning experiences and predict student outcomes. Additionally, it introduces social media analytics and text vectorization as tools for analyzing data and improving educational practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

UNIT-3

Data Science in Education, social media.


Text Vectorization
Data Science in Education
•Data science has a massive impact across many
facets of education — from customized learning
experiences and more efficient administrative
programs to augmenting educators in their daily
teaching practices.
•Data, analytics, and artificial intelligence (AI)
have become increasingly linked to the impact of
the pandemic on the educational experience.
•Demand for virtual learning platforms and more
personalized, interactive, AI-fueled learning
tools are shaping the education market and
contributing to innovative learning and tutoring
techniques, and elearning platform
development.
•The AI in education market already surpassed $1
billion in 2020 and is anticipated to grow at a
CAGR of more than 40 percent between 2021
and 2027, reaching approximately $20 billion.
Big data analytics and innovative teaching
process
• Big Data in education allows
you to rethink approaches, close
long-standing gaps, and adapt the
learning experience to improve the
efficiency of the system all around.
• The task of describing
the technology of operating with
volumetric data arrays, aimed at
identifying the formed patterns in
education and further modeling the
development of the system, also
remains relevant.
E-Learning analytics comprise the use of data science
The e-Learning Analytics techniques over data coming from multiple sources. In a
typical online education environment, the most common
data sources are:
•Learning Management System (LMS) activity
records describing the users’ interaction with the online
platform and training content.
•Learning Management System (LMS) performance
records describing the users’ results over the proposed
evaluation tests.
•User profile information, specially those
characteristics that could impact the way students learn.
There are two main types of analytics:
• Descriptive: Provide insights about the past and allow to
make decisions aimed to impact future learning processes.
• Predictive: Perform predictions about elements and
variables that could impact ongoing learning processes.
These type of analytics allow educators to take proactive
actions.
Educational Data Mining (EDM)

This model describes the most


relevant questions you should ask
yourself in order to set the ground for
a successful project:
•What? What kind of data does the
system gather, manage, and is
available for the analysis?
•Who? Who is targeted by the
analysis?
•Why? What are our objectives
(educational and business)?
•How? How does the system perform
the analysis of the collected data?
Ideal data framework.
administrative (system-wide) data

forecast input data

Info about course content,


course outcomes and
objectives

info about student interactions with learning personal info


systems (e-books, online courses, etc.);
Uses of Data Science in Education
• Modern education data science allows for
the usage of numerous techniques,
algorithms, and other applications of
theoretical knowledge into the practical
world.
• It can aid educators in providing the best
learning experience in existence. So, let’s
take a look into the use of data science in
schools, colleges, and universities.
Use Cases of Data Science in Education
•Improve Adaptive Learning
•Improved Performance of educators and
Students
•Improved Social Skills
•Improvement in the Curriculum
•Better Parent Involvement
•Better Organization
•Student Recruitment
Practical Applications of Data Science in Education
• Data science, data analytics, AI, cloud, and IoT are currently being
leveraged to improve upon and personalize the education
process and experience at both the K-12 and higher education
levels.

Enhance Knowledge Retention

• IoT is one of the most adaptable technologies for the


modern learning experience because IoT-enabled
devices provide an accessible vehicle for
communication, educational materials and resources,
and supportive visualizations that improve knowledge
retention and understanding.
Anticipate Student Graduation and Dropout Rates
• Higher education institutions are increasingly leveraging
data science in education and machine learning solutions
to predict scenarios such as which students are most
likely to enroll, graduate, and be ready for a career in their
chosen area of study.
• These capabilities also help educational providers track
patterns in student dropout rates and the corresponding
demographic and educational factors to predict potential
future dropouts so they can proactively intervene and
allocate resources to prevent it.
Deeper Understanding of Student Progress
• Advanced analytics, including AI, is also being used to gain
insights into academic performance so that teachers, faculty,
parents, and students can better understand how a student is
responding to certain tests
Accessible Education for Dispersed Student Base
• Additionally, many students are working remotely during the
pandemic and virtual classrooms and cloud based e-learning
platforms, along with customized applications, are becoming the
norm as educators strive to deliver high quality learning programs for
a dispersed student base.
• Augmented and virtual reality are similarly providing exciting
simulations and gamification to promote a more immersive remote
learning experience.
Educational Chatbots
• Intelligent chatbots are being used by schools to help address
pervasive absenteeism.
• For example, an AI-driven two-way text messaging system was
developed to help children who frequently miss classes by
enabling teachers to touch base with the student’s family.
Molding the Future Workforce
• While education institutions are investing in new technologies and data
science in education to address current challenges and objectives,
they’re also keeping the future in mind, particularly the future of the
workforce.
• Data science disciplines, including data analytics, are vital to these
institutions as they respond to changing economic and social
conditions, evolving technologies and disruption, and a new era of work.
Data Science in Education Case Studies
1. Data Science Case Study – Georgia State University
• Georgia State University (GSU) in the United States, has used various Data Science and
Machine Learning tools for mining insights from the student data.
• Taking the right decisions based on their analysis helped them in a remarkable way and
they realized that their graduation rate has grown from 32% to 54% between the
years 2003 and 2014.
• They also utilized the student data for dealing with the issues of student retention and
course completion.

2. Data Science Case Study – Arizona State University


• Another case is Arizona State University (ASU) which is considered to be one of the top
American universities.
• The mathematics department of ASU has developed a system called “Adaptive
Learning” based on the analysis of student data.
• So that they can take suitable actions for the improvement of student performance.
• After the implementation of this system, there was a considerable improvement in the
success rate of the students and also the dropout rate of students got decreased by
around 5.4%.
What is social media analytics?
• Social media analytics is
the ability to gather and
find meaning in data
gathered from social
channels to support
business decisions—and
measure the performance
of actions based on those
decisions through social
media.
3 types of social media analytics
• There are three main types of social media analytics: social
listening analytics, sentiment analysis, and influencer analysis.
1.Social listening analytics is the process of tracking and
monitoring what is being said about a brand or product on social media.
2.Sentiment analysis goes a step further. It determines whether the
sentiment around a brand or product is positive, negative, or neutral.
3.Influencer analysis looks at which individuals have the most
influence over others when it comes to a particular topic or brand.
Social media analytics tools
• Social media analytic tools are
software designed to derive and
interpret the above-mentioned
data gathered from social
platforms to support your
business goals.
• With these tools, you can
monitor and track your
engagement levels, spot trends,
and see what’s the ROI on your
social media efforts.
• Increased Reach
As a business owner, you are always looking for ways to increase your reach and social
media is a great way to do that.
• Competitive Edge
The ability to analyze competitors is a significant additional benefit of social media
analytics tools.
• Better Product Management
Customers frequently post reviews and comments about products on social media. Brands
can use media monitoring strategies to examine this input.
• Improved Engagement
Business models are transforming into more experience-driven. Social media analytics can
enhance customer experience and provide new insights into what people want from a
company.
• Greater Insights
The largest advantage when it comes to optimizing your digital marketing initiatives is
provided by social media analytics tools.
• Competitor Analysis
The process of assessing your competitors on social media in order to identify opportunities and develop strategies for
brand expansion is known as social media competitor analysis.
• Follower Analysis
There are a number of reasons why it’s important to use social media monitoring to understand the difference between
active and inactive followers.

Active Followers Inactive Followers


more likely to be engaged with the brand less likely to be engaged with the brand
more likely to be potential customers or clients less likely to be potential customers or clients.
more likely to see your content in their social media less likely to see your content in their social
feeds media feeds
more likely to remember your brand which means less likely to remember your brand which
your brand is top of mind means your brand is not top of mind

• Content Analysis
One of the key aspects of social media engagement is the ability to track and measure the performance of your
content. This helps understand what is working well and what could be improved.
Influencer identification
• Influencer identification is an important part of
influencer marketing.
• It involves finding the right influencers to work with
to promote your brand or product.
• It is a great way to reach a larger audience and
build relationships with influencers who can help
you reach your goals.
Attrition Analysis
• Customer attrition is the loss of customers by a
business.
• Most customers of a given business will not
remain active customers indefinitely.
• Whether a one-time purchaser or a loyal
customer over many years, every customer will
eventually cease his or her relationship with the
business.
Text Vectorization
What is Text vectorization?
• Vectorization is jargon for a classic approach of
converting input data from its raw format (i.e.
text ) into vectors of real numbers which is the
format that ML models support. This approach
has been there ever since computers were first
built, it has worked wonderfully across various
domains, and it’s now used in NLP.
• In Machine Learning, vectorization is a step in
feature extraction. The idea is to get some distinct
features out of the text for the model to train on,
by converting text to numerical vectors.
• NLP (Natural language processing)
is a branch of artificial intelligence
that helps machines to understand,
interpret and manipulate human
language.
• Machine learning algorithms most
often take numeric feature vectors
as input. Thus, when working with
text documents, we need a way to
convert each document into a
numeric vector. This process is
known as text vectorization. In
much simpler words, the process of
converting words into numbers is
called Vectorization.
What is a Vector?
• Vector denotes the mathematical or geometrical
representation quantity.
• Consider a vector of geometrical point P [2, 3, 4].
• This vector basically represents the point P in 3-dimensional
space.
“Text2Vector” Conversion in Machine
Learning
• We generally see particularly in Natural Language
Processing(NLP) that before feeding our raw data(Text data)
, we convert the Text to Vector form so that it can be
processed by any Machine/Deep learning Algorithm. This is
called Featurization
• Featurization is the process to convert varied forms of
data to numerical data which can be used for basic ML
algorithms..…Data can be text data, images, videos,
graphs, various database tables, time-series, categorical
features, etc.
• Feature Engineering : Feature engineering is simply the
process of changing numerical features in such a way that
machine learning models can work efficiently.
Bag of Words (BoW) model
The core idea behind the Bag of Words (BoW)
representation is that any given piece of text can be
represented by a list of all unique words post
stopwords removal. In the BoW approach order of
words does not matter. For example, the below
financial message can be represented in the form of
a bag as shown in Fig.
• So, by creating the “bags” we can represent each of
the messages in our training and test set.
• But the main question still remains how we can build
financial messages sentiment analyzer from these
“bags” representation?
• Now, let’s say the bags, for most of the positive
sentiment messages, contain words such
as significant, increased, rose, appreciated, impro
ved, etc., whereas negative sentiment bags contain
words such
as decreased, declined, loose, liquidated, etc.
BoW Example
Customers like the value, connectivity, and
appearance of the television. They mention
that it's a branded budget TV, and easy to
connect with any network. Customers also
like the clarity, quality, and performance.
That said, opinions are mixed on sound
quality, and performance.
• So, whenever we get a new message, we have to look at its
“bag-of-words” representation.
• Does the bag for this message resembles with the positive
sentiment message or not ?
• Based on the answer to the above question, we can classify
the message into appropriate sentiment.
• Now, the next obvious question that comes to our mind is
how the machine will create the bags of words automatically
for sentiment classification.
• So, the answer to this question is we have to represent all the
bags in a matrix format, after which we can apply different
machine learning algorithms like naive Bayes, logistic
regression, support vector machines, etc., to do the final
Bag of Words Example
Now, let’s understand the BoW approach by the
following example. Let’s assume we have three
documents as given below:
Document 1: Machine learning uses historical data
to predict output values.
Document 2: It is seen as a part of artificial
intelligence.
Document 3: Machine learning programs can
perform tasks without being explicitly programmed
So, in the BoW approach, each document is represented on a
separate row and each word of the vocabulary (post
stopwords removal) has its own column. These vocabulary
words are also known as features of the text.
prog intell hist prog
artif outp mac perf expli pred see lear task valu with use
ram data igen oric part ram
icial ut hine orm citly ict n ning s es out s
s ce al med
D1 0 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1
D2 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0
D3 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 0

The size of the above matrix is (3,18) where 3 denotes the number of rows i.e., the
number of documents, and 18 represents the unique number of words over all the
three documents excluding stopwords. These unique words represent columns
also known as vocabulary and it is actually used as a feature of the text.
• The above matrix was created by filling each cell
with the frequency of each word in the
documents Document 1, Document 2, and
Document 3 represented by D1, D2, and D3
respectively.
• The bag of words representation is also known
as the bag of words model but it shouldn’t be
confused with a machine learning model. A bag
of words model is just the matrix representation
of the frequency of words per document from
• It is important to note that the values inside the cells
can be filled in two ways:
1.We can either fill the cell with the frequency of a word
(values >=0)
2.We can fill the cell with either 0, in case the word is not
present or 1, in case the word is present also known
as the binary bag of words model.
• Out of the above two methods, the frequency
approach is more commonly used in practice and the
NLTK library in Python also fills the BoW model with
word frequencies instead of binary 0 or 1 values.
Stemming
Stemming, in the
realm of Natural
Language
Processing (NLP),
is quite like getting
to the root of a
plant but
language-wise.
It's a technique used
for trimming down a
word to its root form,
known as the 'stem'.
The purpose of this
linguistic "pruning" is
to bring varying forms
of a word down to their
PORTER STEMMING ALGORITHM – BASIC INTRO

The Porter Stemming


algorithm (or Porter
Stemmer) is used
to remove the suffixes
from an English word
and obtain its
stem which becomes
very useful in the field
of Information Retrieval
(IR).
• Porter stemmer is a suffix stripping
algorithm.
• It uses predefined rules to convert words
into their root forms.
• We are importing the PorterStemmer()
from the NLTK library in python in the
above code.
• This module will help in removing the
suffixes of known English words.
PORTER STEMMING ALGORITHM – BASIC INTRO

https://www.youtube.com/watch?v=GQ1sXx8hH4k
https://www.youtube.com/watch?v=HHAilAC3cXw
Example Inputs
Let’s consider a few example inputs and check what will be their
stem outputs.
Example 1
In the first example, we input the word MULTIDIMENSIONAL to the
Porter Stemming algorithm. Let’s see what happens as the word goes
through steps 1 to 5.
• The suffix will not match any of the cases
found in steps 1, 2.
• Then it comes to step 3.
• The stem of the word has m > 1 (since m = 5)
and ends with “AL”.
• Hence in step 3, “AL” is deleted (replaced with
null).
• Calling step 5 will not change the stem
further.
• Finally the output will be MULTIDIMENSION.
MULTIDIMENSIONAL → MULTIDIMENSION
Example 2
In the second example, we input the word
CHARACTERIZATION to the Porter Stemming
algorithm. Let’s see what happens as the word goes through
steps 1 to 5.
The suffix will not match any of the cases found in step 1.
So it will move to step 2.
The stem of the word has m > 0 (since m = 3) and ends with
“IZATION”.
Hence in step 2, “IZATION” will be replaced with “IZE”.
Then the new stem will be CHARACTERIZE.
Step 3 will not match any of the suffixes and hence will move to
step 4.
Now m > 1 (since m = 3) and the stem ends with “IZE”.
So in step 4, “IZE” will be deleted (replaced with null).
No change will happen to the stem in other steps.
Finally the output will be CHARACTER.
CHARACTERIZATION → CHARACTERIZE → CHARACTER
Generating Unigram, Bigram, Trigram and
Ngrams
What is n-gram Model
In natural language processing n-gram is a contiguous sequence of n
items generated from a given sample of text where the items can be
characters or words and n can be any numbers like 1,2,3, etc.
For example, let us consider a line – “Either my way or no way”, so
below is the possible n-gram models that we can generate –
TF-IDF
• TF-IDF or Term Frequency–Inverse Document
Frequency, is a numerical statistic that’s intended to
reflect how important a word is to a
document. Although it’s another frequency-based
method, it’s not as naive as Bag of Words.
How does TF-IDF improve over Bag
of Words?
• In Bag of Words, we witnessed how vectorization was
just concerned with the frequency of vocabulary words
in a given document. As a result, articles, prepositions,
and conjunctions which don’t contribute a lot to the
meaning get as much importance as, say, adjectives.
• TF-IDF helps us to overcome this issue. Words that
get repeated too often don’t overpower less frequent
but important words.
EXAMPLE
The initial step is to make a vocabulary of unique words and calculate
TF for each document. TF will be more for words that frequently
appear in a document and less for rare words in a document.
Inverse Document Frequency (IDF)
• It is the measure of the importance of a word. Term frequency (TF)
does not consider the importance of words. Some words such as’ of’,
‘and’, etc. can be most frequently present but are of little significance.
IDF provides weightage to each word based on its frequency in the
corpus D.
• IDF of a word (w) is defined as
In our example, since we have two documents in the corpus,
N=2.
Term Frequency — Inverse Document Frequency
(TFIDF)
It is the product of TF and IDF.
•TFIDF gives more weightage to the word that is rare in the
corpus (all the documents).
•TFIDF provides more importance to the word that is more
frequent in the document.
After applying TFIDF, text in A and B documents can be represented as a
TFIDF vector of dimension equal to the vocabulary words. The value
corresponding to each word represents the importance of that word in a
particular document.
Word2Vec
• Word2Vec, short for “word to vector,” is a technology
used to represent the relationships between different
words in the form of a graph. This technology is
widely used in machine learning for embedding and
text analysis.
• Google introduced Word2Vec for their search engine
and patented the algorithm, along with several
following updates, in 2013. This collection of
interconnected algorithms was developed by Tomas
Mikolov.
Word2Vec:
Word2Vec is a neural network-based language model
that learns distributed vector representations of words
from a large corpus of text. It is a popular technique
used for natural language processing (NLP) tasks such
as text classification, machine translation, and sentiment
analysis. The Word2Vec model generates vector
representations for each word in the vocabulary based
on the context in which the word appears in the training
data. It does so by training a neural network to predict
the likelihood of a word given its surrounding words in a
What is a word embedding?
• If you ask someone which word is
more similar to “king” – “ruler” or
“worker” – most people would say
“ruler” makes more sense, right? But
how do we teach this intuition to a
computer? That’s where word
embeddings come in handy.
• A word embedding is a
representation of a word used
in text analysis.
• It usually takes the form of a
vector, which encodes the
word’s meaning in such a way
that words closer in the vector
space are expected to be similar
in meaning.
• Language modeling and feature
learning techniques are typically
used to obtain word
embeddings, where words or
phrases from the vocabulary are
mapped to vectors of real
numbers.
• The meaning of a term is determined by its context: the
words that come before and after it, which is called the
context window. Typically, this window is four words
wide, with four words to the left and right of the target
term. To create vector representations of words, we look
at how often they appear together.
• Word embeddings are one of the most fascinating
concepts in machine learning. If you’ve ever used virtual
assistants like Siri, Google Assistant, or Alexa, or even a
smartphone keyboard with predictive text, you’ve already
interacted with a natural language processing model based
on embeddings.
What’s the difference between word representation, word
vectors, and word embeddings?
• Word meanings and relations between them can be established through
semantics analysis. For this, we need to convert unstructured text data into a
structured format suitable for comparison.
Word representations are visualizations that can be depicted as
independent units (e.g. dots) or by vectors that measure the similarity
between words in multidimensional space.
Word vectors are multidimensional numerical representations where
words with similar meanings are mapped to nearby vectors in space.
Word embedding is a technique for representing words with
low-dimensional vectors, which makes it easier to understand
similarity between them.
How is Word2Vec trained?
Word to vector is trained using a neural network that learns the relationships
between words in large databases of texts. To represent a particular word as a
vector in multidimensional space, the algorithm uses one of the two modes:
continuous bag of words (CBOW) or skip-gram.

Continuous bag-of-words
(CBOW)
The continuous bag-of-words
model predicts the central word
using the surrounding context
words, which comprises a few
words before and after the
current word.
Skip-gram
The skip-gram model architecture is
designed to achieve the opposite of the
CBOW model. Instead of predicting the
center word from the surrounding context
words, it aims to predict the surrounding
context words given the center word.
The choice between the two approaches
depends on the specific task at hand. The
skip-gram model performs well with
limited amounts of data and is particularly
effective at representing infrequent words.
In contrast, the CBOW model produces
better representations for more commonly
occurring words.
• Word2Vec is an algorithm that uses a
shallow neural network model to learn
the meaning of words from a
large corpus of texts.
• Unlike deep neural networks (DNNs),
which have multiple hidden layers,
shallow neural networks only have one
or two hidden layers between the input
and output.
• This makes the processing prompt and
transparent.
• The shallow neural network of
Word2Vec can quickly recognize
semantic similarities and identify
synonymous words using logistic
regression methods, making it faster
than DNNs.
Avg Word2Vec:
=> Avg Word2Vec is an extension of the
Word2Vec model that generates vector
representations for sentences or documents
instead of individual words. It works by taking the
average of the vector representations of all the
words in a sentence or document to generate a
single vector representation for the entire text.
This approach can be useful in cases where we
want to classify or compare entire texts rather
TF-IDF weighted Word2Vec:

=> TF-IDF weighted Word2Vec is a hybrid approach that


combines the benefits of both the TF-IDF (Term
Frequency-Inverse Document Frequency) and Word2Vec
models. It first generates the vector representation of each
word in the vocabulary using the Word2Vec model, and then
multiplies it by the TF-IDF score of the word in the document.
This approach gives more weight to the important words in a
document while still capturing the semantic meaning of the
words.
Uses:
Word2Vec, Avg Word2Vec, and TFIDF Word2Vec can be used in
various NLP applications such as:
=> Sentiment Analysis: These models can be used to classify the
sentiment of a piece of text as positive, negative, or neutral.
=> Document Similarity: These models can be used to compare the
similarity between two documents or to cluster similar documents
together.
=> Information Retrieval: These models can be used to retrieve
relevant information from a large corpus of text.
=> Language Translation: These models can be used to translate
text from one language to another.
=> Chatbots: These models can be used to build chatbots that can
understand natural language input from users and provide
appropriate responses.
Count-Vectorizer

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy