0% found this document useful (0 votes)
7 views

lec2

Uploaded by

Aravind S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

lec2

Uploaded by

Aravind S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Business Analytics & Text Mining

Modeling Using Python


INTRODUCTION
Dr. GAURAV DIXIT
DEPARTMENT OF MANAGEMENT STUDIES

1
INTRODUCTION

• Understanding text characteristics


– For a large enough collection of documents, the tabular/matrix layout
would be too sparse
• Any individual document will use only a tiny subset of the potential set of words in
a dictionary
– Techniques used to process text expect the sparse data
• Store only positive cell values in their actual implementations
• Tabular/matrix/spreadsheet layout is used mainly for conceptual clarity
– All the values in a text mining spreadsheet are positive
• Text mining programs use this characteristic to simplify processing

2
INTRODUCTION

• Understanding text characteristics


– Missing values in datasets is a big problem in data mining
– For text, missing values are a nonissue
• All the cells are completely filled since we have to only indicate presence or
absence of words

• Numerous variations of the simple tabular (matrix or


spreadsheet) layout for text
– Have been suggested by researchers

3
INTRODUCTION

• Predictive text analytics consists of


– Primarily two types of tasks
• Prediction
• Classification
– Another task of ‘clustering’ is required in certain scenarios
• For example, when labels required for text categorization or document
classification are not known already
– Another task of ‘extraction’ of information is required in certain
scenarios

4
INTRODUCTION

• Text mining problems


– Document Classification or Text categorization
• Business Problem: Folder/File Management (online or device)
– Documents are organized into folders, one folder for each topic
– Analytics Component:
» When a new document is presented,
» Task is to place this document in the appropriate folders
– A binary classification problem
» A document can belong to multiple categories
• Can be considered as a form of indexing
• Examples: automatically forwarding e-mail to the appropriate team/department,
detecting spam mail, future stock movements based on pre- or post-event news
articles and financial data

5
Binary Classifiers

Home

Home vs ~Home

New
Finance vs ~Finance Finance
Document

Office vs ~Office

Office

Document Classification or Text Categorization

6
INTRODUCTION

• Text mining problems


– Information Retrieval
• Business Problem: Document matcher (online or device)
– Given a large collection of documents, finding relevant documents
– Analytics Component
» Task is to retrieve the relevant documents based on the best matches of input document with
the collection of documents
» New document is compared to all the other rows (documents), and the most similar rows and
their associated documents are the answers
• Similar to a search engine function
– A few words are presented, and these words are matched to others
– Best matches are presented as the responses
• Based on measuring similarity as in nearest-neighbor methods

7
Document Collection

Input
Document Matched Documents

Text mining Document Matcher

Retrieving matched documents

8
INTRODUCTION

• Text mining problems


– Clustering and Organizing Documents
• Business Problem: Unknown document structure (online or device)
– Given a collection of documents with no known structure, find a set of folders such that each folder
holds similar documents
– Analytics Component
» Task is to cluster the similar documents in the collection and assign labels to each cluster
• Examples: learn about the categories and types of help-desk complaints
– Might lead to identification of complaints which have no existing solution

9
Document Collection

Document
Organizer

Group 1 Group 2 Group 3 Group 4 Group 5

Organizing documents into groups

10
INTRODUCTION

• Text mining problems


– Information Extraction
• Business Problem: Populating database from unstructured data
– Given a collection of documents, automatically filling the relevant values associated with certain defined
variables in a database
– Analytics Component
» Task is to extract data from an unstructured format based on words which can be higher-level
concepts or real-valued variables
» The variable that is being measured will not have a fixed position in the text and may not be
described in the same way in different documents
• Examples: extracting the sales volumes and industry codes from company
documents

11
Input Document

Spreadsheet

… Revenue … Profit …

… … … … …

… on revenues of twenty five … … … … …


crore rupees, the company
reported a profit of 45 lakhs … … … … …
for the fiscal year
… 250000000 … 4500000 …

… … … … …

… … … … …

Extracting information from a document

12
INTRODUCTION

• Prediction and Evaluation


– Text mining modeling process is similar to data mining modeling
process
• Process is about building models based on prior cases (from training partition)
• Then the built model is used to predict the unseen cases (from test partition)
– Evaluation of the model success is
• Based on its performance on the test partition which is not part of the model
building process
– This mechanism works well for most of the text mining scenarios
• However, there might be few special scenarios

13
Key References

• Fundamentals of Predictive Text Mining


– By Sholom M. Weiss, Nitin Indurkhya, & Tong Zhang (2015)
• Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython
– By Wes McKinney (2017)

14
Thanks…

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy