0% found this document useful (0 votes)
30 views28 pages

NLP Unit - 05

Uploaded by

016-Triveni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views28 pages

NLP Unit - 05

Uploaded by

016-Triveni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Ticketing

In a real-world scenario, consider building a ticketing system for an organization to track and
route issues faced by employees to internal or external agents. This application of concepts
can be beneficial. Figure 4-1 shows a representative screenshot for such a system; it’s a
corporate ticketing system called Spoke.

Figure 5-1. A corporate ticketing system


The company has hired a medical counsel and partnered with a hospital to create a system to
identify medical-related issues and assign them to relevant teams. However, past tickets are
not labeled as health-related. To build a classification system, several options are explored.

Use existing APIs or libraries


To organize content, start with a public API or library and map its classes to what's
relevant to your organization. Google APIs can classify content into over 700 categories,
including 82 related to medical or health issues. Map these categories accordingly, such as
ignoring substance abuse and obesity issues for medical counsel or determining whether
insurance should be part of HR or referred outside.

Use public datasets


Public datasets like 20 Newsgroups, part of the sklearn library, can be used for text
classification, including sci.med, and can be used to train a basic classifier, classifying
various topics in different categories.

Utilize weak supervision


We have a history of past tickets that are not labeled. To create a dataset, we can
bootstrap it using previous methods. For instance, we can use a rule to categorize past tickets
based on words like fever, diarrhea, headache, or nausea.
Active learning
Prodigy is a tool that can be used for data collection experiments by asking customer
service desk employees to tag ticket descriptions with a pre-set list of categories.

Figure 5-2. Active learning with Prodigy

Learning from implicit and explicit feedback


During the development, iteration, and deployment of a solution, we receive feedback
to improve our system. This feedback can be explicit or implicit, derived from dependent
variables like ticket response times and rates, which can be incorporated using active learning
techniques.

A sample pipeline summarizing these ideas may look like what’s shown in Figure 5.3.
The process begins with no labeled data and uses a public API or a model with public dataset
or weak supervision as the baseline model. After production, signals are obtained to refine the
model and select the best instances for labeling. Over time, more data is collected to build
more sophisticated models.
Figure 5-3. A pipeline for building a classifier when there’s no training data

This section discusses the practical issue of insufficient training data for a custom text
classifier, discussing various solutions to address this and aims to help prepare for future data
collection and creation scenarios in text classification projects.

E-Commerce:

E-Commerce Catalog:

A product catalog is a database of the products that the enterprise deals or a user can
purchase. This contains product description attributes as well as images for each product.
Extracting attributes information from products automatically is called attribute extraction.
Attribute extraction from product descriptions can guarantee that all the relevant product
information is properly indexed and displayed for each product, improving product
discoverability.

Review Analysis:

The user reviews section in an e-commerce platform provides valuable insights into product
quality, usability, and delivery feedback. NLP techniques provide an overall perspective for
all reviews by performing tasks such as sentiment analysis, review summarization so on.

Product Search:
E-commerce search systems are specialized for product-related queries, unlike general
search engines like Google. These systems require customized information processing,
extraction, and search pipelines according to their type of e-commerce businesses.

Product Recommendations:

Recommendation engines are essential for e-commerce platforms, as they intelligently


understand customer preferences and suggest relevant products. By analyzing product content
and textual reviews using NLP algorithms, these systems engage customers with personalized
recommendations, increasing the likelihood of purchases.

Search in E-Commerce:

In e-commerce, an effective search feature is crucial for customers to quickly find their
desired products, positively impacting the conversion rate and revenue. Unlike general search
engines, e-commerce search engines provide faceted search, allowing users to navigate with
filters like brand, price, and product attributes. This streamlined search approach, as seen on
platforms like Amazon and Walmart, empowers users to refine their search results, improving
the overall shopping experience and increasing the likelihood of making a purchase. Faceted
search can be built with most popular search engine backends like Solr and Elasticsearch.
The filters such as color,size etc are the key that defines the faceted search. However, they
may not always be readily available for all products. Some reasons for that are:

• The seller didn’t upload all the required information while listing the product on the
e-commerce website.

• Some of the filters are difficult to obtain, or the seller may not have the complete
information to provide—for example, the caloric value of a food product.

Building an E-Commerce Catalog:

For building an informative catalog, it can be split into several subproblems:

• Attribute extraction

• Product categorization and taxonomy creation

• Product enrichment

• Product deduplication and matching

Attribute Extraction:

Attributes are properties that define a product. Attributes play a crucial role in providing a
comprehensive overview of products on e-commerce platforms, impacting click-through rates
and sales.

The algorithms that extract the attribute information from various product descriptions are
generally called attribute extraction algorithms.There are two types of attribute extraction
algorithms: direct and derived.

Direct attribute extraction algorithms assume the presence of the attribute value in the

input text. For example, “Sony XBR49X900E 49-Inch 4K Ultra HD Smart LED TV

(2017 Model)” contains the brand “Sony.”

On the other hand, derived attribute extraction algorithms do not assume that the attribute of
interest is present in the input text. They derive that information from the context. Gender is
one such attribute that is usually not present in the product title, but from the input text, the
algorithm should identify. Consider the product description: “YunJey Short Sleeve Round
Neck Triple Color Block Stripe T-Shirt Casual Blouse.” The product is for women, but the
gender “female” is not explicitly mentioned in the product description or title. Sequence
labeling models, like LSTM, and text classification are used for direct and indirect attribute
extraction, respectively.
Product Categorization and Taxonomy:

Product categorization is a process of dividing products into groups. These groups can be
defined based on similarity—e.g., products of the same brand or products of the same type
can be grouped together. Generally, e-commerce has pre-defined broad categories of
products, such as electronics, personal care products, and foods. When a new product arrives,
it should be categorized into the taxonomy before it’s put in the catalog.

Text classification techniques, including hierarchical classification, are employed to automate


product categorization, with rule-based methods used for high-level categories and machine
learning techniques for deeper context. Image and text data are combined using convolutional
neural networks (CNNs) and LSTM models to improve categorization accuracy. For new
e-commerce platforms, APIs like Semantics3, eBay, and Lucidworks offer solutions for
product categorization

Product Enrichment:

o improve search accuracy and recommendations, gathering rich product information is


crucial. Sources such as product titles, images, and descriptions often contain incorrect or
incomplete information, impacting search results and conversion rates. For instance, overly
long and misleading product titles can hinder faceted search. Product enrichment involves
refining product titles by removing irrelevant tokens and structuring them according to a
predefined template based on attributes from the taxonomy tree. It provide a framework for
systematically enhancing product information, ensuring consistency and accuracy across the
e-commerce platform.

Product Deduplication and Matching:

Product deduplication is crucial in e-commerce to prevent the same product from being listed
multiple times due to variations in naming conventions. For example, “Garmin nuvi
2699LMTHD GPS Device” and “nuvi 2699LMTHD Automobile Portable GPS Navigator”
refer to the same product. Deduplication can be achieved through attribute match, title match,
and image match.

Attribute match involves comparing attribute values of similar products, using string
matching or similarity metrics to identify overlaps and discrepancies.

Title match identifies variants of the same product by comparing bigrams and trigrams among
titles, or by generating title-level features and calculating Euclidean distances between them.

Image match relies on computer vision techniques like pixel-to-pixel matching or Siamese
networks to match product images, reducing duplication and improving catalog accuracy.

Review Analysis

Reviews are crucial for e-commerce portals as they provide direct customer feedback on
products. Leveraging this information can improve the customer experience and directly
affect product sales. This section delves into different aspects of review sentiment analysis,
ensuring a comprehensive understanding of customer feedback.

Sentiment Analysis

Sentiment analysis for e-commerce reviews varies based on aspects and attributes. A
screenshot of aspect-level reviews of iPhone X on Amazon illustrates how to slice and dice
reviews.

Figure 5- 4. Analysis of customer reviews: ratings, keywords, and sentiments

The data reveals that 67% of reviews have five stars, while 22% have one star. Understanding
customer behaviour is crucial for e-commerce companies, as shown in Figure two with
extreme reviews of the same product.

Both positive and negative reviews provide retailers with insights into customer opinions.
Negative reviews, such as those about defective screens, are more crucial to understand.
Positive reviews express generic positive sentiment, while negative reviews highlight specific
user preferences. Understanding reviews is essential due to their unstructured format, which
includes errors like spelling and incomplete words, making analysis more challenging.
Ratings are directly proportional to the overall sentiment of reviews, and understanding
emotions directly from the text can help retailers rectify anomalies during analysis. Reviews
often cover most aspects of a product, reflecting everything in the rating. Understanding
customer emotions or feedback is crucial for retailers to improve their products. However,
providing a high-level emotion index for the entire review is insufficient for deeper
understanding. Aspect-level understanding of reviews is needed, which can be pre-defined or
extracted from the review data. Approaches will be supervised or unsupervised based on this
understanding.

Aspect-Level Sentiment Analysis

An aspect is a semantically rich collection of words that indicate certain properties or


characteristics of a product, such as location, value, and cleanliness. These aspects can
include supply, presentation, delivery, return, and quality. Retailers can use supervised
algorithms to identify aspects, such as seed words or seed lexicons, which hint at crucial
tokens under a particular aspect. For example, screen resolution, touch, and response time
could be considered aspects for iPhone X. The retailer can operate at their desired level of
granularity, such as screen quality alone. In the next sections, we will explore supervised and
unsupervised techniques of aspect-level sentiment analysis.

Supervised approach

A supervised approach identifies seed words in sentences and tags them with corresponding
aspects. Sentiment analysis is done at the sentence level, allowing for filtering and aggregated
sentiments to understand customer feedback. For example, review sentences related to screen
quality, touch, and response time can be grouped together. An example from a travel website
shows aspect-level sentiment analysis, extracting semantic concepts like location, check-in,
value, and cleanliness to provide a more detailed view of reviews.

Unsupervised approach

Topic modeling is an unsupervised method for identifying latent topics in a document, such
as aspects. It groups sentences discussing the same aspect and outputs the probability of each
word being in all topics. This allows for the grouping of words with a high chance of
belonging to a certain aspect and identifying characteristic words for that aspect. Another
unsupervised approach is to create sentence representations and perform clustering instead of
LDA, which may yield better results when there are fewer review sentences. This method can
help predict ratings for all aspects and provide a more granular view of user preferences.

Connecting Overall Ratings to Aspects

The text discusses the use of latent rating regression analysis (LARA) to connect overall user
ratings to individual aspect-level sentiment. It provides an example of a system generating
aspect-level ratings for a hotel review, with a table provided for further details.

Figure 5-5. Aspect-wise sentiment prediction using LARA

The final rating is a weighted combination of individual aspect-level sentiments, with the
objective of estimating the weights and aspect-level sentiment together. This information is
crucial for e-retailers to take action before taking any action, as it indicates the importance a
reviewer places on a specific topic. This can be performed sequentially or based on various
sentiments.

Understanding Aspects

Retailers aim to analyze product aspects and sentiments in reviews to understand user
interest. However, due to the large volume of reviews, summarization algorithms like
LexRank can help. LexRank, similar to PageRank, connects sentences based on similarity
and selects the most central sentences for an extractive summary. An example pipeline for
review analysis covers both overall and aspect-level sentiments. After detecting review-level
aspects, sentiment analysis is conducted for every aspect and aggregated based on aspects.
Summarization algorithms like LexRank are then used to summarize these sentiments. This
helps retailers remove overall sentiment for an aspect and provide a summary of opinions
explaining the sentiment.

Figure 5-6. The complete flowchart of review analysis: overall sentiments, aspect-level
sentiments, and aspect-wise significant reviews

Recommendations for E-Commerce

The study explores textual data-based recommendations in e-commerce, highlighting the


importance of product search and review analysis. It provides a comprehensive analysis of
algorithms and data utilization for recommendations.
Figure 5-7. Comprehensive study of techniques for various e-commerce recommenda tion
scenarios

E-commerce recommends products based on a user's purchase profile, which can be inferred
from their behavior on the platform. Neighborhood-based methods help determine the
products the user will be interested in next. This involves analyzing similar products based on
attributes, purchase history, and customer interactions. In addition to numerical data,
e-commerce also has a large amount of textual data that can be used in product
recommendations. The recommendation algorithm can include product descriptions in text to
provide better understanding and match more granular attributes, such as clothing material.

Reviews provide valuable insights into product quality and user opinions, guiding product
recommendations. For example, user feedback on a product's screen size can filter related
products and make recommendations more useful. A case study demonstrates how
e-commerce can leverage product reviews to build a recommendation system. Reviews not
only help identify better products for recommendation but also reveal product
interrelationships through customer feedback.

A Case Study: Substitutes and Complements

Recommender systems utilize similar products, either content-based or user-profile-based, to


identify item interrelationships in e-commerce settings.
Complements are products bought together, while substitute pairs are bought in lieu of each
other. These concepts capture the behavioral aspect of product purchase. Identifying
substitutes and complements using user interaction data is challenging due to individual
differences. However, aggregating user interactions can reveal properties about substitution
and complementarity between products. An approach primarily based on product reviews can
help identify these pairs.

Latent attribute extraction from reviews

Reviews often contain specific information about product attributes, but extracting these
attributes may have limitations due to the need for an explicit ontology. Instead, we learn
them through a latent vector representation, which is not covered in this book. Each product
is associated with a review, and discussions on various aspects related to the product can be
obtained through latent factor models. This distribution can be modeled on all reviews related
to a product using topic models like LDA, providing a vectorial representation or "topic
vector" of the product's discussion.

Product linking

The task involves understanding the link between two products using topic vectors, which
capture the intrinsic properties of the product in a latent attribute space. A combined feature
vector is created from the respective topic vectors for the products and predicted if there is
any relationship between them. This process is called "link prediction." The objectives of
obtaining topic vectors and link prediction can be solved jointly, learning topic vectors for
each product and the function to combine them for a product pair. This process becomes
expressive enough to capture the intrinsic attributes of the product and reveal hierarchical
dependence, which depicts the product's taxonomy. The latent representation, which has more
expressivity than exact attribute extraction, is efficient for link prediction and revealing
meaningful notions about the product taxonomy, making it useful for making better product
recommendations and obtaining more similar products.

Figure 5-8. Topic vector and topic hierarchy express how different taxonomic identities and
relations are captured in reviews
Wrapping Up

The e-commerce industry's success is largely due to data collection and data-driven decisions.
Natural Language Processing (NLP) techniques have significantly improved user experience
and revenue in the retail and e-commerce sectors. This chapter covers various aspects of NLP
in e-commerce, including faceted search, product attributes, review analysis, and product
recommendations. While most examples are product commerce, techniques can be applied to
travel and food.

Social media

Social platforms have to be the largest generators of unstructured natural language


data. It’s not possible to manually analyze even a fraction of this data.

Applications

Trending topic detection

Trending topics on social networks reveal popular content and noteworthy content, crucial for
media houses, retailers, first responders, and government entities. This information helps
them refine their user engagement strategies and provides valuable insights at specific
geolocations, enhancing their overall user experience.

Opinion mining

Social media allows people to express opinions about products, services, or policies, making
it crucial for brands and organizations to gather and understand this information. Manually
analyzing thousands of posts is impossible, so summarizing and extracting key insights is
highly valuable.

Sentiment detection

Sentiment analysis of social media data is the most popular application of Natural Language
Processing (NLP) for brands to understand user sentiments about their products and
competitors. This helps identify customer cohorts to engage with and understand long-term
shifts in customer sentiment over time.

Rumor/fake news detection

Social networks are often misused to spread false news, swaying masses' opinions through
false propaganda. Understanding and identifying fake news and rumors is crucial for both
preventive and corrective measures to control this menace.
Adult content filtering

Social media platforms are plagued by the spread of inappropriate content, and Natural
Language Processing (NLP) is widely utilized to identify and filter such content.

Customer support

Social media has become a crucial tool for brands to handle customer complaints and
concerns, with Natural Language Processing (NLP) being utilized to understand, categorize,
filter, prioritize, and, in some cases, automatically respond to these complaints.

Unique Challenges
Until now, we’ve (implicitly) assumed that the input text (most of the time, if not always)
follows the basic tenets of any language, namely:
• Single language
• Single script
• Formal
• Grammatically correct
• Few or no spelling errors
• Mostly text-like (very few non-textual elements, such as emoticons, images, smileys, etc.)

Standard Natural Language Processing (NLP) systems assume highly structured and formal
language when dealing with text data from social platforms. However, this assumption is
challenged due to users' brevity in social posts, which creates an informal language with
nonstandard spellings, hashtags, emotions, new words, acronyms, code-mixing, and
transliteration. This unique language, considered the "language of social," is characterized by
nonstandard spellings, hashtags, and new words, making it a distinct and unique language.

NLP tools for standard text data struggle with SMTD due to language differences in tweets
compared to other textual formats like newspapers, blog posts, emails, and book chapters.

Figure 5-9. Examples of new words being introduced in vocabulary


Figure 5-10. New recipe for language: nonstandard spellings, emoticons, code-mixing,
transliteration

These differences pose challenges to standard NLP systems. Let’s look at the key dif ferences
in detail:
No grammar
Social media conversations lack grammar rules, causing difficulties in pre-processing steps
like tokenization and sentence boundary identification. Modules specialized for SMTD are
needed to handle these non-standard language conventions.

Nonstandard spelling
Languages typically have a single way of writing words, but in SMTD, words can have
multiple spelling variations. For an NLP system to function effectively, it must recognize that
all these words refer to the same word, as seen in the example of "tomorrow."

Multilingual
Articles typically originate from a single language, with only a small portion being written in
multiple languages, and often mixed on social media.Consider the following example from a
social media website:
Yaar tu to, GOD hain. Tui
JU te ki korchis? Hail u man!
The text, a mix of English, Hindi, and Bengali, expresses gratitude and admiration for the
individual's achievements in JU.

Transliteration
Languages have their own scripts, but on social media, people often use "transliteration" to
represent characters from different scripts. This is common in SMTD, where the typing
interface is in Roman script but the language is non-English.
Special characters
SMTD, a type of social media content, includes non-textual entities like special characters,
emojis, hashtags, emoticons, images, and non-ASCII characters, which require
pre-processing pipeline modules for NLP.

Ever-evolving vocabulary
Social language vocabulary rapidly increases daily, causing NLP systems to encounter the out
of vocabulary (OOV) problem. This issue affects the performance of the NLP system, as seen
in an infographic. The study showed that 10-15% new words are added to the vocabulary of
social media every month, compared to the previous month's data.

Length of text
Social media text lengths are shorter than other channels like blogs, product reviews, and
emails due to Twitter's 140-character restriction. This shorter writing style has become the
norm, appearing in informal communication like messages and chats, as Twitter's popularity
and adoption have grown.

Noisy data
Social media posts are a vast source of spam, ads, and irrelevant content, making it
impossible to consume raw data. Filtering out noisy data is crucial for NLP tasks like sarcasm
detection, ensuring no spam or irrelevant content is included.

NLP for Social Data:

We’ll now take a deep dive into applying NLP to SMTD to build some interesting

applications that we can apply to a variety of problems.

Word Cloud

A word cloud is a pictorial way of capturing the most significant words in a given

document or corpus. It’s nothing but an image composed of words (in different sizes) from
the text under consideration, where the size of the word is proportional to its importance
(frequency) in the text corpus. It’s a quick way to understand the key terms in a corpus.

Here’s a step-by-step process for building a word cloud:

1. Tokenize a given corpus or document

2. Remove stop words

3. Sort the remaining words in descending order of frequency

4. Take the top k words and plot them “aesthetically”


Tokenizer for SMTD:

One of the crucial steps in processing text data from social media, particularly from platforms
like Twitter, is tokenization.

Consider the tweet: "Hey @NLPer! This is a #NLProc tweet :-D".

The ideal tokenization of this tweet should be: ['Hey', '@NLPer', '!', 'This', 'is', 'a', '#NLProc',
'tweet', ':-D'].

However, using NLTK's word_tokenize, we would get: ['Hey', '@', 'NLPer', '!', 'This', 'is', 'a',
'#', 'NLProc', 'tweet', ':-', 'D'].

Clearly, NLTK's tokenizer fails to correctly tokenize this tweet. It’s important to use a
tokenizer that gives correct tokens. Twokenize is specifically designed to deal with SMTD.
There are a number of specialized tokenizers available to work with SMTD. Some of the
popular ones are nltk.tokenize.TweetTokenizer, Twikenizer , Twokenizer.

Trending Topics:

Given the volume of traffic, what is trending can (and often does) change within a few hours
in social media. Keeping track of what’s trending by the hour may not be so important for an
individual, but for a business entity, it can be very important. In the lingo of social media, any
conversation around a topic is often associated with a hashtag. Thus, finding trending topics
is all about finding the most popular hashtags in a given time window.

One of the simplest ways to do this is using a Python API called Tweepy. Tweepy gives a
simple function, trends_available, to fetch the trending topics. It takes the geolocation
(WOEID, or Where On Earth Identifier) as an input and returns the trending topics in that
geolocation. The function trends_available returns the top-10 trending topics for a given
WOEID, on the condition that the trending information for the given WOEID is available.
The response of this function call is an array of objects that are “trending.” In response, each
object encodes the following information: name of the topic that’s trending, the
corresponding query parameters that can be used to search for the topic using Twitter search,
and the URL to Twitter search.

Understanding Twitter Sentiment:

For businesses and brands across the globe, it’s important to know whether people’s opinion
is positive or negative and if this sentiment polarity is changing over time. In today’s world,
social media is a great way to understand people’s sentiment about a brand. For building a
system for sentiment analysis we’ll use TextBlob, which is a Python-based NLP toolkit built
on top of NLTK and Pattern.
This will give us polarity and subjectivity values of each of the tweets in the corpus.

Polarity is a value between [–1.0, 1.0] and tells how positive or negative the text is.

Pre-Processing SMTD:

Removing markup elements:

In SMTD, and it’s important to remove markup elements. A great way to achieve this is to
use a library called Beautiful Soup:

Handling non-text data:

SMTD is often full of symbols, special characters, etc. In order to understand them, it’s
important to convert the symbols present in the data to simple and easier-to-understand
characters. This is often done by converting to a standard encoding format like UTF-8.

Handling apostrophes:

The way to handle them is to expand apostrophes. This requires a dictionary that can map
apostrophes to full forms:

Handling emojis:

A good way to handle emojis is to replace the emoji with corresponding text explaining the
emoji. For example, replace “” with “fire”. To do so, we need a mapping between emojis and
their corresponding elaboration in text. Demoji is a Python package that does exactly this. It
has a function, findall(), that gives a list of all emojis in the text along with their
corresponding meanings.

Split-joined words:

In SMTD, users sometimes combine multiple words into a single word, where the word
disambiguation is done by using capital letters, for example GoodMorning, RainyDay,
PlayingInTheCold, etc.

The following code snippet does the job for us:

processed_tweet_text = “ “.join(re.findall(‘[A-Z][^A-Z]*’, tweet_text))

For GoodMorning, this will return “Good Morning.”


Removal of URLs:

we might want to remove the URL all together. The code snippet replaces all URLs

with a constant; in this case, constant_url.

import re

def remove_urls(text, replacement_text="[constant_url]"):

# Define a regex pattern to match URLs

url_pattern = re.compile(r'https?://\S+|www\.\S+')

# Use the sub() method to replace URLs with the specified replacement

text_without_urls = url_pattern.sub(replacement_text, text)

return text_without_urls

# Example:

input_text = "Visit on GeeksforGeeks Website: https://www.geeksforgeeks.org/"

output_text = remove_urls(input_text)

print("Original Text:")

print(input_text)

print("\nText with URLs Removed:")

print(output_text)

Nonstandard spellings:

On social media, people often write words that are technically spelling mistakes. We use
TextBlob which itself has some spelling-correction capabilities:

Customer Support on Social Channels:

As the volume of the complaints and issues given by users and customers grew, this prompted
brands to create dedicated handles and pages to handle support traffic. Twitter and Facebook
have launched various features to support brands, and most customer relationship
management (CRM) tools support customer service on social channels. A brand can connect
their social channels to the CRM tool and use the tool to respond to inbound messages.

Owing to the public nature of conversations, brands are obligated to respond quickly.
However, brands’ support pages receive a lot of traffic. Some of this is genuine questions,
grievances, and requests. These are popularly known as “actionable conversations,” as
customer support teams should act on them quickly. On the other hand, a large portion of
traffic is simply noise: promos, coupons, offers, opinions, troll messages, etc. This is
popularly called “noise.” Customer support teams don’t want to respond to noise. Ideally,
they want only actionable messages to be converted into tickets in their CRM tools.

We can build a model that can identify noise and actionable messages. The pipeline will be
very similar:

1. Collect a labeled dataset

2. Clean it

3. Pre-process it

4. Tokenize it

5. Represent it

6. Train a model

7. Test model

8. Put it in production

Memes and Fake News:

Over time, users have evolved to behave beyond community norms; this is known as
“trolling.” A large portion of posts on social platforms are full of controversial content such
as trolls, memes, internet slang, and fake news.

Identifying Memes:

Identifying content that could be a meme that’s heckling others or is otherwise offensive or
violating other group or platform rules is important. There are two

primary ways in which a meme could be identified:


Content-based meme identification relies on the content of the meme to match it with other
memes of similar patterns. For example, if "This is Bill. Be like Bill" has emerged as a meme
in a community, we can extract the text and use similarity metrics like Jaccard distance to
identify posts with similar patterns, such as "This is PersonX. Be like PersonX."

Behavior-based meme identification, on the other hand, relies on the activity surrounding a
post. By analyzing metrics such as the number of shares, comments, and likes, viral content
can be identified.

Fake News:

The number of incidents related to fake news has risen significantly along with the rise in
users on social platforms. Various media houses and content moderators are actively working
on detecting and weeding out such fake news. There are some principled approaches that can
be used to tackle this menace:

Fact verification using external data sources:

Fact verification deals with validating various facts in a news article. Given a sentence and a
set of facts, a system needs to find out if the set of facts supports the claim or not. Amazon
Research at Cambridge created a curated dataset to deal with such cases of misinformation
present in natural text.

Classifying fake versus real:

A simple setup for this problem would be to build a parallel data corpus with instances of
fake and real news excerpts and classify them as real or fake. Researchers from Harvard
recently developed a system [42] to identify which text is written by humans and which text
is generated by machines (and therefore could be fake). This system uses statistical methods
to understand the facts.
Health care
Healthcare as an industry encompasses both goods (i.e., medicines and equipment)
and services (consultation or diagnostic testing) for curative, preventive, palliative, and
rehabilitative care.
Healthcare is a significant part of advanced economies' GDP, often exceeding 10%.
Automating and optimizing processes and systems can provide significant benefits. Natural
Language Processing (NLP) can help in various applications, such as clinical research and
revenue cycle management. The blue cells represent current applications, purple cells are
emerging and being tested, and red cells are next-generation and practical applications.

Figure 5-11. NLP in healthcare use cases by Chilmark Research

Healthcare utilizes Natural Language Processing (NLP) to enhance health outcomes by


analyzing medical records, billing, and ensuring drug safety by utilizing large amounts of
unstructured text.

Health and Medical Records


A large proportion of health and medical data is often collected and stored in
unstructured text formats. This includes medical notes, prescriptions, and audio tran scripts,
as well as pathology and radiology reports.
Data storage lacks standardization, making it difficult to search, organize, study, and
understand in its raw form. Natural Language Processing (NLP) can improve data analysis
and automate workflows, such as building automated question-answering systems, to reduce
time spent on patient information.

Patient Prioritization and Billing


NLP techniques can optimize physician notes by understanding urgency, prioritizing
health procedures, and automating processes, while also identifying medical codes and
facilitating billing by parsing and extracting information from unstructured notes.
Pharmacovigilance
Pharmacovigilance is the process of ensuring the safety of a drug by collecting,
detecting, and monitoring adverse reactions. It is crucial to prevent unintended or noxious
effects in medical procedures. With social media's increasing use, monitoring and identifying
side effects is essential. Natural Language Processing (NLP) techniques can also aid in
pharmacovigilance in medical records.

Clinical Decision Support Systems


Decision support systems aid medical workers in healthcare decisions like screening,
diagnosis, treatments, and monitoring, using text data like electronic health records,
laboratory results, and operative notes, with Natural Language Processing (NLP) utilized for
improvement.

Health Assistants
Health assistants and chatbots, like Woebot, can enhance patient and caregiver
experiences by utilizing expert systems and NLP. Woebot, for instance, helps patients with
mental illness and depression by combining NLP with cognitive therapy.

Figure 5-12. A Woebot conversation


Assistants can diagnose medical issues by assessing patients' symptoms, booking
appointments with doctors, and utilizing existing diagnostic frameworks to build systems
tailored to user needs.

Electronic Health Records


The growing electronic storage of clinical and healthcare data has resulted in a
massive explosion of medical records, making it difficult for doctors and staff to access,
leading to information overload, errors, delays, and patient safety concerns.

HARVEST: Longitudinal report understanding


HARVEST, a tool developed by Columbia University, is a solution to information
overload in clinical information systems. It parses medical data, making it easy to analyze
and can be integrated into any medical system. HARVEST provides a timeline of each visit to
the clinic or hospital, along with a word cloud of important medical conditions for the patient.
Users can drill down to detailed notes and history if needed, and summaries of each report to
provide a quick overview of a patient's medical history. This tool is not just a reformatted
novelty; it is highly useful for providing real-time, informative snapshots of a patient's
medical status to doctors, general medical staff, and caregivers.
The HealthTermFinder system is a tool that uses a named entity recognizer to find
healthcare-related terms related to a patient's medical history. These terms are mapped to the
Unified Medical Language System (UMLS) semantic group and visualized in a word cloud.
The system helps medical professionals identify root issues and avoid biased misdiagnoses. A
study at New York Presbyterian Hospital found that over 75% of participants would use
HARVEST regularly in the future. HARVEST provides understandable summaries and
conclusions by collating a patient's history of healthcare issues across their lifetime. Its
unique selling point is its ability to mine, extract, and visually present content at a macro
level, regardless of where and by whom a patient might have been seen in the hospital. NLP
techniques play a key role in analytics and information visualization tools when the
underlying knowledge base is unstructured text, such as in EHRs.

Question answering for health


To enhance user experience in healthcare, a question-answering (QA) system can be
built on records to address specific healthcare-specific questions. These questions can include
medication dosage, test results, and lab test confirmations. The QA system in healthcare can
use a dataset called emrQA, created by IBM Research Center, MIT, and UIUC. This dataset
extracts correct answers from past health records, to answer questions like "Has the patient
ever had an abnormal BMI?". Building the right dataset is crucial for solving NLP problems
in healthcare.
Figure 5-13. Example of question-answer pair in emrQA

A general question-answering dataset creation framework involves collecting


domain-specific questions and normalizing them, mapping question templates with expert
domain knowledge, and assigning logical forms to them. Existing annotations and
information collected from these steps are used to create a range of question-and-answer
pairs, reducing manual effort in the creation of a quality assurance (QA) dataset. For
example, the Veterans Administration's emrQA process involved polling physicians to gather
prototypical questions, which were normalized to around 600. These prototypical questions
were then logically mapped to an i2b2 dataset, which is already expertly annotated with a
range of fine-grained information. The process is closely supervised by a set of physicians to
ensure the quality of the dataset. To build a baseline QA system, neural seq-to-seq models
and heuristic-based models were used. The emrQA team divided the dataset into two sets:
emrQL-1 and emrQL-2. Heuristic models performed better than neural models for emrQL-1,
while neural models performed better for emrQL-2. This is an interesting use case on how to
build complex datasets using heuristics, mapping, and other simpler annotated datasets,
which can be applied to other problems beyond processing health records that require
generating a QA-like dataset.

Outcome prediction and best practices


The study focuses on predicting health outcomes using electronic health records (EHRs), a
set of attributes that explain the consequences of a disease for a patient. Health outcomes are
crucial in measuring the efficacy of different treatments and are a key focus of scalable and
accurate deep learning with EHRs. Scalability is essential in healthcare, as data collected
from different hospitals or departments can vary. Accuracy is crucial to avoid false alarms
and ensure people's lives are on the line.

To handle the nuances and complexity of EHRs, an open Fast Healthcare Interoperability
Resources (FHIR) standard was created, which uses a standardized format with unique
locators for consistency and reliability. The data is fed into a model based on Recurrent
Neural Networks (RNNs), which predicts the outcome from the start of the record to its end.
The model was evaluated on various health outcomes and achieved an AUC score of 0.86 on
patient stay longer in the hospital, 0.77 on unexpected readmissions, and 0.95 on predicting
patient mortality. Interpretability is essential in healthcare, as models should pinpoint why
they suggest a particular outcome. Attention, a concept in deep learning, is used to
understand the most important data points and incidents for an outcome.

Google AI team has identified best practices for building ML models for healthcare, covering
the entire machine learning life cycle from problem definition to data collection and
validation. These suggestions are applicable to Natural Language Processing (NLP),
computer vision, and structured data problems. While these techniques focus on managing
physical well-being, mental well-being remains unquantifiable.

Mental Healthcare Monitoring


The rapid pace of economic and technological change has led to a significant number
of people experiencing mental health issues, with over 790 million people affected globally.
The National Institutes of Health estimates that one in four Americans is likely to be affected
by mental health conditions in a given year. In 2017, over 47,000 Americans committed
suicide, and this number has been increasing rapidly. With social media usage at an all-time
high, it is possible to use signals from social media to track the emotional state and mental
balance of individuals and across various demographic groups, including age and gender.
Glen Coppersmith et al.'s study focuses on using social media to identify individuals at risk
for suicide, aiming to develop an early warning system and identify the root causes of the
issues.
Each user’s tweets were analyzed with the following perspectives:
• Is the user’s statement of attempting to take their life apparently genuine?
• Is the user is speaking about their own suicide attempt?
• Is the suicide attempt localizable in time?

for a few example tweets

The study analyzes Twitter data to classify tweets based on their content. The first two tweets
are genuine suicide attempts, the bottom two are sarcastic or false statements, and the middle
two mention an explicit suicide attempt date. The data is normalized and cleaned, and
character models are used to classify tweets. Emotional states are estimated using hashtags,
with tweets containing #anger but not #sarcasm and #jk classified as emotional or no
emotional content.
The models effectively identified 70% of individuals highly likely to attempt suicide, with
only 10% false alarms.
Figure 5-4 shows a confusion matrix detailing misclas sification of various emotions that
were modeled.

Figure 5-14. Confusion matrix for emotion classification

Medical Information Extraction and Analysis


Health records are essential for building applications, and one of the first steps is to
extract medical entities and relations from them. Medical information extraction (IE) helps
identify clinical syndromes, medical conditions, medication, dosage, strength, and common
biomedical concepts from health records, radiology reports, discharge summaries, nursing
documentation, and medical education documents.
Amazon Comprehend Medical, part of AWS's suite, allows for popular NLP tasks like
key phrase extraction, sentiment and syntax analysis, language and entity recognition in the
cloud. It helps process medical data, including medical named entity and relationship
extraction and medical ontological linking. To test Comprehend Medical, health records from
FHIR are taken as input and a sample electronic health record from a hypothetical Good
Health Clinic is removed. As a starting input, let’s consider a small sequence of this medical
record:
Good Health Clinic Consultation Note Robert Dolin MD Robert Dolin MD Good
Health Clinic Henry Levin the 7th Robert Dolin MD History of Present Illness Henry Levin,
the 7th is a 67 year old male referred for further asthma management. Onset of asthma in his
twenties teens. He was hospitalized twice last year, and already twice this year. He has not
been able to be weaned off steroids for the past several months. Past Medical History Asthma
Hypertension (see HTN.cda for details) Osteoarthritis, right knee Medications Theodur
200mg BID Proventil inhaler 2puffs QID PRN Prednisone 20mg qd HCTZ 25mg qd Theodur
200mg BID Proventil inhaler 2puffs QID PRN Prednisone 20mg qd HCTZ 25mg qd

When we provide this as an input to Comprehend Medical, we get the output shown in Figure

Figure 5-15. Comprehend Medical output for the FHIR record example

The text describes the process of extracting medical information from various sources,
including clinic and doctor details, diagnosis, medications, frequency, dosage, and route.
Access to these features is provided through an AWS boto library. Cloud APIs and libraries
can be useful for building medical information extraction, but BioBERT is recommended for
specific requirements. BioBERT is a bidirectional encoder model that adapts to biomedical
texts, addressing the differences in word distributions between regular English and medical
records. The domain adaptation phase initializes model weights with a standard BERT model
and pre-trained biomedical texts, including texts from PubMed. Figure 5-6 shows the process
of pre-training and fine-tuning BioBERT.
BioBERT, an open-sourced model and weights, is used for medical named entity recognition,
relation extraction, and question-answering on healthcare texts. It outperforms BERT and
other state-of-the-art techniques and can be adapted to specific tasks and datasets. NLP can
be used in various healthcare applications, including health records, social media monitoring
for mental health issues, and finance and law. The model and weights can be found on
GitHub.
Figure 5-16. BioBERT pre-training and fine-tuning

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy