Semantic Analysis Theory1
Semantic Analysis Theory1
freq_dist = FreqDist()
freq_dist.inc(token.type())
For any word, we can check how many times it occurred in a particular document. E.g.
We will write a small program and will explain its working in detail. We will write
some text and will calculate the frequency distribution of each word in the text.
import nltk
a = "Guru99 is the site where you can find the best tutorials for Software Testing Tutori
al, SAP Course for Beginners. Java Tutorial for Beginners and much more. Please visit t
he site Guru99.com and much more."
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()
Explanation of code:
Please visualize the graph for a better understanding of the text written
Frequency distribution of each word in the graph
NOTE: You need to have matplotlib installed to see the above graph
Observe the graph above. It corresponds to counting the occurrence of each word in the
text. It helps in the study of text and further in implementing text-based sentimental
analysis. In a nutshell, it can be concluded that nltk has a module for counting the
occurrence of each word in the text which helps in preparing the stats of natural
language features. It plays a significant role in finding the keywords in the text. You
can also extract the text from the pdf using libraries like extract, PyPDF2 and feed the
text to nlk.FreqDist.
Counting each word may not be much useful. Instead one should focus on collocation
and bigrams which deals with a lot of words in a pair. These pairs identify useful
keywords to better natural language features which can be fed to the machine. Please
look below for their details.
Collocations: Bigrams and Trigrams
What is Collocations?
Collocations are the pairs of words occurring together many times in a document. It is
calculated by the number of those pair occurring together to the overall word count of
the document.
Consider electromagnetic spectrum with words like ultraviolet rays, infrared rays.
The words ultraviolet and rays are not used individually and hence can be treated as
Collocation. Another example is the CT Scan. We don't say CT and Scan separately,
and hence they are also treated as collocation.
Bigrams and Trigrams provide more meaningful and useful features for the feature
extraction stage. These are especially useful in text-based sentimental analysis.
Bigrams Example Code
import nltk
Output
[('Guru99', 'is'), ('is', 'totally'), ('totally', 'new'), ('new', 'kind'), ('kind', 'of'), ('of', 'learning')
, ('learning', 'experience'), ('experience', '.')]
Sometimes it becomes important to see a pair of three words in the sentence for
statistical analysis and frequency count. This again plays a crucial role in forming
NLP (natural language processing features) as well as text-based sentimental
prediction.
import nltk
text = “Guru99 is a totally new kind of learning experience.”
Tokens = nltk.word_tokenize(text)
output = list(nltk.trigrams(Tokens))
print(output)
Output
[('Guru99', 'is', 'totally'), ('is', 'totally', 'new'), ('totally', 'new', 'kind'), ('new', 'kind', 'of'), ('
kind', 'of', 'learning'), ('of', 'learning', 'experience'), ('learning', 'experience', '.')]
Semantic Analysis:
For humans, making sense of text is simple: we recognize individual words and the
context in which they’re used. If you read this tweet:
"Your customer service is a joke! I've been on hold for 30 minutes and
counting!"
You understand that a customer is frustrated because a customer service agent is taking
too long to respond.
However, machines first need to be trained to make sense of human language and
understand the context in which words are used; otherwise, they might misinterpret
the word “joke” as positive.
Hyponyms: specific lexical items of a generic lexical item (hypernym) e.g. orange
is a hyponym of fruit (hypernym).
Meronomy: a logical arrangement of text and words that denotes a constituent
part of or member of something e.g., a segment of an orange
Polysemy: a relationship between the meanings of words or phrases, although
slightly different, share a common core meaning e.g. I read a paper, and I wrote a
paper)
Synonyms: words that have the same sense or nearly the same meaning as
another, e.g., happy, content, ecstatic, overjoyed
Antonyms: words that have close to opposite meanings e.g., happy, sad
Homonyms: two words that are sound the same and are spelled alike but have a
different meaning e.g., orange (color), orange (fruit)
Semantic analysis also takes into account signs and symbols (semiotics) and
collocations (words that often go together).
Automated semantic analysis works with the help of machine learning algorithms.
The automated process of identifying in which sense is a word used according to its
context.
Natural language is ambiguous and polysemic; sometimes, the same word can have
different meanings depending on how it’s used.
The word “orange,” for example, can refer to a color, a fruit, or even a city in Florida!
The same happens with the word “date,” which can mean either a particular day of the
month, a fruit, or a meeting.
In semantic analysis with machine learning, computers use word sense disambiguation
to determine which meaning is correct in the given context.
Relationship Extraction
This task consists of detecting the semantic relationships present in a text. Relationships
usually involve two or more entities (which can be names of people, places, company
names, etc.). These entities are connected through a semantic category, such as “works
at,” “lives in,” “is the CEO of,” “headquartered at.”
For example, the phrase “Steve Jobs is one of the founders of Apple, which is headquartered in
California” contains two different relationships:
Depending on the type of information you’d like to obtain from data, you can use one
of two semantic analysis techniques: a text classification model (which assigns
predefined categories to text) or a text extractor (which pulls out specific information
from the text).
Keyword extraction: finding relevant words and expressions in a text. For instance,
you could analyze the keywords in a bunch of tweets that have been categorized
as “negative” and detect which words or topics are mentioned most often.
Entity extraction: identifying named entities in text, like names of people,
companies, places, etc. A customer service team might find this useful to
automatically extract names of products, shipping numbers, emails, and any other
relevant data from customer support tickets.
Automatically classifying tickets using semantic analysis tools alleviates agents from
repetitive tasks and allows them to focus on tasks that provide more value while
improving the whole customer experience.
Tickets can be instantly routed to the right hands, and urgent issues can be easily
prioritized, shortening response times, and keeping satisfaction levels high.
Conclusion
When combined with machine learning, semantic analysis allows you to delve into
your customer data by enabling machines to extract meaning from unstructured text at
scale and in real time.
Powerful semantic-enhanced machine learning tools will deliver valuable insights that
drive better decision-making and improve customer experience.