NLP Unit - 05
NLP Unit - 05
In a real-world scenario, consider building a ticketing system for an organization to track and
route issues faced by employees to internal or external agents. This application of concepts
can be beneficial. Figure 4-1 shows a representative screenshot for such a system; it’s a
corporate ticketing system called Spoke.
A sample pipeline summarizing these ideas may look like what’s shown in Figure 5.3.
The process begins with no labeled data and uses a public API or a model with public dataset
or weak supervision as the baseline model. After production, signals are obtained to refine the
model and select the best instances for labeling. Over time, more data is collected to build
more sophisticated models.
Figure 5-3. A pipeline for building a classifier when there’s no training data
This section discusses the practical issue of insufficient training data for a custom text
classifier, discussing various solutions to address this and aims to help prepare for future data
collection and creation scenarios in text classification projects.
E-Commerce:
E-Commerce Catalog:
A product catalog is a database of the products that the enterprise deals or a user can
purchase. This contains product description attributes as well as images for each product.
Extracting attributes information from products automatically is called attribute extraction.
Attribute extraction from product descriptions can guarantee that all the relevant product
information is properly indexed and displayed for each product, improving product
discoverability.
Review Analysis:
The user reviews section in an e-commerce platform provides valuable insights into product
quality, usability, and delivery feedback. NLP techniques provide an overall perspective for
all reviews by performing tasks such as sentiment analysis, review summarization so on.
Product Search:
E-commerce search systems are specialized for product-related queries, unlike general
search engines like Google. These systems require customized information processing,
extraction, and search pipelines according to their type of e-commerce businesses.
Product Recommendations:
Search in E-Commerce:
In e-commerce, an effective search feature is crucial for customers to quickly find their
desired products, positively impacting the conversion rate and revenue. Unlike general search
engines, e-commerce search engines provide faceted search, allowing users to navigate with
filters like brand, price, and product attributes. This streamlined search approach, as seen on
platforms like Amazon and Walmart, empowers users to refine their search results, improving
the overall shopping experience and increasing the likelihood of making a purchase. Faceted
search can be built with most popular search engine backends like Solr and Elasticsearch.
The filters such as color,size etc are the key that defines the faceted search. However, they
may not always be readily available for all products. Some reasons for that are:
• The seller didn’t upload all the required information while listing the product on the
e-commerce website.
• Some of the filters are difficult to obtain, or the seller may not have the complete
information to provide—for example, the caloric value of a food product.
• Attribute extraction
• Product enrichment
Attribute Extraction:
Attributes are properties that define a product. Attributes play a crucial role in providing a
comprehensive overview of products on e-commerce platforms, impacting click-through rates
and sales.
The algorithms that extract the attribute information from various product descriptions are
generally called attribute extraction algorithms.There are two types of attribute extraction
algorithms: direct and derived.
Direct attribute extraction algorithms assume the presence of the attribute value in the
input text. For example, “Sony XBR49X900E 49-Inch 4K Ultra HD Smart LED TV
On the other hand, derived attribute extraction algorithms do not assume that the attribute of
interest is present in the input text. They derive that information from the context. Gender is
one such attribute that is usually not present in the product title, but from the input text, the
algorithm should identify. Consider the product description: “YunJey Short Sleeve Round
Neck Triple Color Block Stripe T-Shirt Casual Blouse.” The product is for women, but the
gender “female” is not explicitly mentioned in the product description or title. Sequence
labeling models, like LSTM, and text classification are used for direct and indirect attribute
extraction, respectively.
Product Categorization and Taxonomy:
Product categorization is a process of dividing products into groups. These groups can be
defined based on similarity—e.g., products of the same brand or products of the same type
can be grouped together. Generally, e-commerce has pre-defined broad categories of
products, such as electronics, personal care products, and foods. When a new product arrives,
it should be categorized into the taxonomy before it’s put in the catalog.
Product Enrichment:
Product deduplication is crucial in e-commerce to prevent the same product from being listed
multiple times due to variations in naming conventions. For example, “Garmin nuvi
2699LMTHD GPS Device” and “nuvi 2699LMTHD Automobile Portable GPS Navigator”
refer to the same product. Deduplication can be achieved through attribute match, title match,
and image match.
Attribute match involves comparing attribute values of similar products, using string
matching or similarity metrics to identify overlaps and discrepancies.
Title match identifies variants of the same product by comparing bigrams and trigrams among
titles, or by generating title-level features and calculating Euclidean distances between them.
Image match relies on computer vision techniques like pixel-to-pixel matching or Siamese
networks to match product images, reducing duplication and improving catalog accuracy.
Review Analysis
Reviews are crucial for e-commerce portals as they provide direct customer feedback on
products. Leveraging this information can improve the customer experience and directly
affect product sales. This section delves into different aspects of review sentiment analysis,
ensuring a comprehensive understanding of customer feedback.
Sentiment Analysis
Sentiment analysis for e-commerce reviews varies based on aspects and attributes. A
screenshot of aspect-level reviews of iPhone X on Amazon illustrates how to slice and dice
reviews.
The data reveals that 67% of reviews have five stars, while 22% have one star. Understanding
customer behaviour is crucial for e-commerce companies, as shown in Figure two with
extreme reviews of the same product.
Both positive and negative reviews provide retailers with insights into customer opinions.
Negative reviews, such as those about defective screens, are more crucial to understand.
Positive reviews express generic positive sentiment, while negative reviews highlight specific
user preferences. Understanding reviews is essential due to their unstructured format, which
includes errors like spelling and incomplete words, making analysis more challenging.
Ratings are directly proportional to the overall sentiment of reviews, and understanding
emotions directly from the text can help retailers rectify anomalies during analysis. Reviews
often cover most aspects of a product, reflecting everything in the rating. Understanding
customer emotions or feedback is crucial for retailers to improve their products. However,
providing a high-level emotion index for the entire review is insufficient for deeper
understanding. Aspect-level understanding of reviews is needed, which can be pre-defined or
extracted from the review data. Approaches will be supervised or unsupervised based on this
understanding.
Supervised approach
A supervised approach identifies seed words in sentences and tags them with corresponding
aspects. Sentiment analysis is done at the sentence level, allowing for filtering and aggregated
sentiments to understand customer feedback. For example, review sentences related to screen
quality, touch, and response time can be grouped together. An example from a travel website
shows aspect-level sentiment analysis, extracting semantic concepts like location, check-in,
value, and cleanliness to provide a more detailed view of reviews.
Unsupervised approach
Topic modeling is an unsupervised method for identifying latent topics in a document, such
as aspects. It groups sentences discussing the same aspect and outputs the probability of each
word being in all topics. This allows for the grouping of words with a high chance of
belonging to a certain aspect and identifying characteristic words for that aspect. Another
unsupervised approach is to create sentence representations and perform clustering instead of
LDA, which may yield better results when there are fewer review sentences. This method can
help predict ratings for all aspects and provide a more granular view of user preferences.
The text discusses the use of latent rating regression analysis (LARA) to connect overall user
ratings to individual aspect-level sentiment. It provides an example of a system generating
aspect-level ratings for a hotel review, with a table provided for further details.
The final rating is a weighted combination of individual aspect-level sentiments, with the
objective of estimating the weights and aspect-level sentiment together. This information is
crucial for e-retailers to take action before taking any action, as it indicates the importance a
reviewer places on a specific topic. This can be performed sequentially or based on various
sentiments.
Understanding Aspects
Retailers aim to analyze product aspects and sentiments in reviews to understand user
interest. However, due to the large volume of reviews, summarization algorithms like
LexRank can help. LexRank, similar to PageRank, connects sentences based on similarity
and selects the most central sentences for an extractive summary. An example pipeline for
review analysis covers both overall and aspect-level sentiments. After detecting review-level
aspects, sentiment analysis is conducted for every aspect and aggregated based on aspects.
Summarization algorithms like LexRank are then used to summarize these sentiments. This
helps retailers remove overall sentiment for an aspect and provide a summary of opinions
explaining the sentiment.
Figure 5-6. The complete flowchart of review analysis: overall sentiments, aspect-level
sentiments, and aspect-wise significant reviews
E-commerce recommends products based on a user's purchase profile, which can be inferred
from their behavior on the platform. Neighborhood-based methods help determine the
products the user will be interested in next. This involves analyzing similar products based on
attributes, purchase history, and customer interactions. In addition to numerical data,
e-commerce also has a large amount of textual data that can be used in product
recommendations. The recommendation algorithm can include product descriptions in text to
provide better understanding and match more granular attributes, such as clothing material.
Reviews provide valuable insights into product quality and user opinions, guiding product
recommendations. For example, user feedback on a product's screen size can filter related
products and make recommendations more useful. A case study demonstrates how
e-commerce can leverage product reviews to build a recommendation system. Reviews not
only help identify better products for recommendation but also reveal product
interrelationships through customer feedback.
Reviews often contain specific information about product attributes, but extracting these
attributes may have limitations due to the need for an explicit ontology. Instead, we learn
them through a latent vector representation, which is not covered in this book. Each product
is associated with a review, and discussions on various aspects related to the product can be
obtained through latent factor models. This distribution can be modeled on all reviews related
to a product using topic models like LDA, providing a vectorial representation or "topic
vector" of the product's discussion.
Product linking
The task involves understanding the link between two products using topic vectors, which
capture the intrinsic properties of the product in a latent attribute space. A combined feature
vector is created from the respective topic vectors for the products and predicted if there is
any relationship between them. This process is called "link prediction." The objectives of
obtaining topic vectors and link prediction can be solved jointly, learning topic vectors for
each product and the function to combine them for a product pair. This process becomes
expressive enough to capture the intrinsic attributes of the product and reveal hierarchical
dependence, which depicts the product's taxonomy. The latent representation, which has more
expressivity than exact attribute extraction, is efficient for link prediction and revealing
meaningful notions about the product taxonomy, making it useful for making better product
recommendations and obtaining more similar products.
Figure 5-8. Topic vector and topic hierarchy express how different taxonomic identities and
relations are captured in reviews
Wrapping Up
The e-commerce industry's success is largely due to data collection and data-driven decisions.
Natural Language Processing (NLP) techniques have significantly improved user experience
and revenue in the retail and e-commerce sectors. This chapter covers various aspects of NLP
in e-commerce, including faceted search, product attributes, review analysis, and product
recommendations. While most examples are product commerce, techniques can be applied to
travel and food.
Social media
Applications
Trending topics on social networks reveal popular content and noteworthy content, crucial for
media houses, retailers, first responders, and government entities. This information helps
them refine their user engagement strategies and provides valuable insights at specific
geolocations, enhancing their overall user experience.
Opinion mining
Social media allows people to express opinions about products, services, or policies, making
it crucial for brands and organizations to gather and understand this information. Manually
analyzing thousands of posts is impossible, so summarizing and extracting key insights is
highly valuable.
Sentiment detection
Sentiment analysis of social media data is the most popular application of Natural Language
Processing (NLP) for brands to understand user sentiments about their products and
competitors. This helps identify customer cohorts to engage with and understand long-term
shifts in customer sentiment over time.
Social networks are often misused to spread false news, swaying masses' opinions through
false propaganda. Understanding and identifying fake news and rumors is crucial for both
preventive and corrective measures to control this menace.
Adult content filtering
Social media platforms are plagued by the spread of inappropriate content, and Natural
Language Processing (NLP) is widely utilized to identify and filter such content.
Customer support
Social media has become a crucial tool for brands to handle customer complaints and
concerns, with Natural Language Processing (NLP) being utilized to understand, categorize,
filter, prioritize, and, in some cases, automatically respond to these complaints.
Unique Challenges
Until now, we’ve (implicitly) assumed that the input text (most of the time, if not always)
follows the basic tenets of any language, namely:
• Single language
• Single script
• Formal
• Grammatically correct
• Few or no spelling errors
• Mostly text-like (very few non-textual elements, such as emoticons, images, smileys, etc.)
Standard Natural Language Processing (NLP) systems assume highly structured and formal
language when dealing with text data from social platforms. However, this assumption is
challenged due to users' brevity in social posts, which creates an informal language with
nonstandard spellings, hashtags, emotions, new words, acronyms, code-mixing, and
transliteration. This unique language, considered the "language of social," is characterized by
nonstandard spellings, hashtags, and new words, making it a distinct and unique language.
NLP tools for standard text data struggle with SMTD due to language differences in tweets
compared to other textual formats like newspapers, blog posts, emails, and book chapters.
These differences pose challenges to standard NLP systems. Let’s look at the key dif ferences
in detail:
No grammar
Social media conversations lack grammar rules, causing difficulties in pre-processing steps
like tokenization and sentence boundary identification. Modules specialized for SMTD are
needed to handle these non-standard language conventions.
Nonstandard spelling
Languages typically have a single way of writing words, but in SMTD, words can have
multiple spelling variations. For an NLP system to function effectively, it must recognize that
all these words refer to the same word, as seen in the example of "tomorrow."
Multilingual
Articles typically originate from a single language, with only a small portion being written in
multiple languages, and often mixed on social media.Consider the following example from a
social media website:
Yaar tu to, GOD hain. Tui
JU te ki korchis? Hail u man!
The text, a mix of English, Hindi, and Bengali, expresses gratitude and admiration for the
individual's achievements in JU.
Transliteration
Languages have their own scripts, but on social media, people often use "transliteration" to
represent characters from different scripts. This is common in SMTD, where the typing
interface is in Roman script but the language is non-English.
Special characters
SMTD, a type of social media content, includes non-textual entities like special characters,
emojis, hashtags, emoticons, images, and non-ASCII characters, which require
pre-processing pipeline modules for NLP.
Ever-evolving vocabulary
Social language vocabulary rapidly increases daily, causing NLP systems to encounter the out
of vocabulary (OOV) problem. This issue affects the performance of the NLP system, as seen
in an infographic. The study showed that 10-15% new words are added to the vocabulary of
social media every month, compared to the previous month's data.
Length of text
Social media text lengths are shorter than other channels like blogs, product reviews, and
emails due to Twitter's 140-character restriction. This shorter writing style has become the
norm, appearing in informal communication like messages and chats, as Twitter's popularity
and adoption have grown.
Noisy data
Social media posts are a vast source of spam, ads, and irrelevant content, making it
impossible to consume raw data. Filtering out noisy data is crucial for NLP tasks like sarcasm
detection, ensuring no spam or irrelevant content is included.
We’ll now take a deep dive into applying NLP to SMTD to build some interesting
Word Cloud
A word cloud is a pictorial way of capturing the most significant words in a given
document or corpus. It’s nothing but an image composed of words (in different sizes) from
the text under consideration, where the size of the word is proportional to its importance
(frequency) in the text corpus. It’s a quick way to understand the key terms in a corpus.
One of the crucial steps in processing text data from social media, particularly from platforms
like Twitter, is tokenization.
The ideal tokenization of this tweet should be: ['Hey', '@NLPer', '!', 'This', 'is', 'a', '#NLProc',
'tweet', ':-D'].
However, using NLTK's word_tokenize, we would get: ['Hey', '@', 'NLPer', '!', 'This', 'is', 'a',
'#', 'NLProc', 'tweet', ':-', 'D'].
Clearly, NLTK's tokenizer fails to correctly tokenize this tweet. It’s important to use a
tokenizer that gives correct tokens. Twokenize is specifically designed to deal with SMTD.
There are a number of specialized tokenizers available to work with SMTD. Some of the
popular ones are nltk.tokenize.TweetTokenizer, Twikenizer , Twokenizer.
Trending Topics:
Given the volume of traffic, what is trending can (and often does) change within a few hours
in social media. Keeping track of what’s trending by the hour may not be so important for an
individual, but for a business entity, it can be very important. In the lingo of social media, any
conversation around a topic is often associated with a hashtag. Thus, finding trending topics
is all about finding the most popular hashtags in a given time window.
One of the simplest ways to do this is using a Python API called Tweepy. Tweepy gives a
simple function, trends_available, to fetch the trending topics. It takes the geolocation
(WOEID, or Where On Earth Identifier) as an input and returns the trending topics in that
geolocation. The function trends_available returns the top-10 trending topics for a given
WOEID, on the condition that the trending information for the given WOEID is available.
The response of this function call is an array of objects that are “trending.” In response, each
object encodes the following information: name of the topic that’s trending, the
corresponding query parameters that can be used to search for the topic using Twitter search,
and the URL to Twitter search.
For businesses and brands across the globe, it’s important to know whether people’s opinion
is positive or negative and if this sentiment polarity is changing over time. In today’s world,
social media is a great way to understand people’s sentiment about a brand. For building a
system for sentiment analysis we’ll use TextBlob, which is a Python-based NLP toolkit built
on top of NLTK and Pattern.
This will give us polarity and subjectivity values of each of the tweets in the corpus.
Polarity is a value between [–1.0, 1.0] and tells how positive or negative the text is.
Pre-Processing SMTD:
In SMTD, and it’s important to remove markup elements. A great way to achieve this is to
use a library called Beautiful Soup:
SMTD is often full of symbols, special characters, etc. In order to understand them, it’s
important to convert the symbols present in the data to simple and easier-to-understand
characters. This is often done by converting to a standard encoding format like UTF-8.
Handling apostrophes:
The way to handle them is to expand apostrophes. This requires a dictionary that can map
apostrophes to full forms:
Handling emojis:
A good way to handle emojis is to replace the emoji with corresponding text explaining the
emoji. For example, replace “” with “fire”. To do so, we need a mapping between emojis and
their corresponding elaboration in text. Demoji is a Python package that does exactly this. It
has a function, findall(), that gives a list of all emojis in the text along with their
corresponding meanings.
Split-joined words:
In SMTD, users sometimes combine multiple words into a single word, where the word
disambiguation is done by using capital letters, for example GoodMorning, RainyDay,
PlayingInTheCold, etc.
we might want to remove the URL all together. The code snippet replaces all URLs
import re
url_pattern = re.compile(r'https?://\S+|www\.\S+')
# Use the sub() method to replace URLs with the specified replacement
return text_without_urls
# Example:
output_text = remove_urls(input_text)
print("Original Text:")
print(input_text)
print(output_text)
Nonstandard spellings:
On social media, people often write words that are technically spelling mistakes. We use
TextBlob which itself has some spelling-correction capabilities:
As the volume of the complaints and issues given by users and customers grew, this prompted
brands to create dedicated handles and pages to handle support traffic. Twitter and Facebook
have launched various features to support brands, and most customer relationship
management (CRM) tools support customer service on social channels. A brand can connect
their social channels to the CRM tool and use the tool to respond to inbound messages.
Owing to the public nature of conversations, brands are obligated to respond quickly.
However, brands’ support pages receive a lot of traffic. Some of this is genuine questions,
grievances, and requests. These are popularly known as “actionable conversations,” as
customer support teams should act on them quickly. On the other hand, a large portion of
traffic is simply noise: promos, coupons, offers, opinions, troll messages, etc. This is
popularly called “noise.” Customer support teams don’t want to respond to noise. Ideally,
they want only actionable messages to be converted into tickets in their CRM tools.
We can build a model that can identify noise and actionable messages. The pipeline will be
very similar:
2. Clean it
3. Pre-process it
4. Tokenize it
5. Represent it
6. Train a model
7. Test model
8. Put it in production
Over time, users have evolved to behave beyond community norms; this is known as
“trolling.” A large portion of posts on social platforms are full of controversial content such
as trolls, memes, internet slang, and fake news.
Identifying Memes:
Identifying content that could be a meme that’s heckling others or is otherwise offensive or
violating other group or platform rules is important. There are two
Behavior-based meme identification, on the other hand, relies on the activity surrounding a
post. By analyzing metrics such as the number of shares, comments, and likes, viral content
can be identified.
Fake News:
The number of incidents related to fake news has risen significantly along with the rise in
users on social platforms. Various media houses and content moderators are actively working
on detecting and weeding out such fake news. There are some principled approaches that can
be used to tackle this menace:
Fact verification deals with validating various facts in a news article. Given a sentence and a
set of facts, a system needs to find out if the set of facts supports the claim or not. Amazon
Research at Cambridge created a curated dataset to deal with such cases of misinformation
present in natural text.
A simple setup for this problem would be to build a parallel data corpus with instances of
fake and real news excerpts and classify them as real or fake. Researchers from Harvard
recently developed a system [42] to identify which text is written by humans and which text
is generated by machines (and therefore could be fake). This system uses statistical methods
to understand the facts.
Health care
Healthcare as an industry encompasses both goods (i.e., medicines and equipment)
and services (consultation or diagnostic testing) for curative, preventive, palliative, and
rehabilitative care.
Healthcare is a significant part of advanced economies' GDP, often exceeding 10%.
Automating and optimizing processes and systems can provide significant benefits. Natural
Language Processing (NLP) can help in various applications, such as clinical research and
revenue cycle management. The blue cells represent current applications, purple cells are
emerging and being tested, and red cells are next-generation and practical applications.
Health Assistants
Health assistants and chatbots, like Woebot, can enhance patient and caregiver
experiences by utilizing expert systems and NLP. Woebot, for instance, helps patients with
mental illness and depression by combining NLP with cognitive therapy.
To handle the nuances and complexity of EHRs, an open Fast Healthcare Interoperability
Resources (FHIR) standard was created, which uses a standardized format with unique
locators for consistency and reliability. The data is fed into a model based on Recurrent
Neural Networks (RNNs), which predicts the outcome from the start of the record to its end.
The model was evaluated on various health outcomes and achieved an AUC score of 0.86 on
patient stay longer in the hospital, 0.77 on unexpected readmissions, and 0.95 on predicting
patient mortality. Interpretability is essential in healthcare, as models should pinpoint why
they suggest a particular outcome. Attention, a concept in deep learning, is used to
understand the most important data points and incidents for an outcome.
Google AI team has identified best practices for building ML models for healthcare, covering
the entire machine learning life cycle from problem definition to data collection and
validation. These suggestions are applicable to Natural Language Processing (NLP),
computer vision, and structured data problems. While these techniques focus on managing
physical well-being, mental well-being remains unquantifiable.
The study analyzes Twitter data to classify tweets based on their content. The first two tweets
are genuine suicide attempts, the bottom two are sarcastic or false statements, and the middle
two mention an explicit suicide attempt date. The data is normalized and cleaned, and
character models are used to classify tweets. Emotional states are estimated using hashtags,
with tweets containing #anger but not #sarcasm and #jk classified as emotional or no
emotional content.
The models effectively identified 70% of individuals highly likely to attempt suicide, with
only 10% false alarms.
Figure 5-4 shows a confusion matrix detailing misclas sification of various emotions that
were modeled.
When we provide this as an input to Comprehend Medical, we get the output shown in Figure
Figure 5-15. Comprehend Medical output for the FHIR record example
The text describes the process of extracting medical information from various sources,
including clinic and doctor details, diagnosis, medications, frequency, dosage, and route.
Access to these features is provided through an AWS boto library. Cloud APIs and libraries
can be useful for building medical information extraction, but BioBERT is recommended for
specific requirements. BioBERT is a bidirectional encoder model that adapts to biomedical
texts, addressing the differences in word distributions between regular English and medical
records. The domain adaptation phase initializes model weights with a standard BERT model
and pre-trained biomedical texts, including texts from PubMed. Figure 5-6 shows the process
of pre-training and fine-tuning BioBERT.
BioBERT, an open-sourced model and weights, is used for medical named entity recognition,
relation extraction, and question-answering on healthcare texts. It outperforms BERT and
other state-of-the-art techniques and can be adapted to specific tasks and datasets. NLP can
be used in various healthcare applications, including health records, social media monitoring
for mental health issues, and finance and law. The model and weights can be found on
GitHub.
Figure 5-16. BioBERT pre-training and fine-tuning