0% found this document useful (0 votes)
46 views3 pages

Coba Coba Upload

This document summarizes tokenization and filtering processes in the text mining software RapidMiner. It discusses how documents are processed from files, tokenized by splitting text into words and phrases, and filtered to remove stopwords and tokens below a certain length. The goal is to extract meaningful keywords for applications like information retrieval, natural language processing, and data mining. RapidMiner operators like "Process Documents from Files", "Tokenize", and "Filter Tokens by Length" are used to implement these preprocessing steps on documents. The output is a list of meaningful tokens to use for further text mining tasks.

Uploaded by

Ory Jefry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views3 pages

Coba Coba Upload

This document summarizes tokenization and filtering processes in the text mining software RapidMiner. It discusses how documents are processed from files, tokenized by splitting text into words and phrases, and filtered to remove stopwords and tokens below a certain length. The goal is to extract meaningful keywords for applications like information retrieval, natural language processing, and data mining. RapidMiner operators like "Process Documents from Files", "Tokenize", and "Filter Tokens by Length" are used to implement these preprocessing steps on documents. The output is a list of meaningful tokens to use for further text mining tasks.

Uploaded by

Ory Jefry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868

Foundation of Computer Science FCS, New York, USA


Volume 7– No. 2, April 2014 – www.ijais.org

Tokenization and Filtering Process in RapidMiner


Tanu Verma Renu Deepti Gaur
Student Student Associate Professor
CSE, ITM University CSE, ITM University CSE, ITM University

ABSTRACT in the text mining process is to find the body of documents


Text mining is defined as a knowledge-intensive process in that are relevant to the research question(s).
which a user interacts with a document collection. As in data  Natural language processing (NLP) analyzes the
mining[2,4,9], text mining seeks to extract useful information text in structures based on human speech and allows the
from data sources through the identification and exploration of computer to perform a grammatical analysis of a sentence to
interesting patterns. A key element of text mining is its focus “read” the text.
on the document collection. A document collection can be any  Information extraction (IE) [3,5,6,7] involves
grouping of text-based documents. Most text mining solutions structuring the data that the NLP system generates.
are aimed at discovering patterns across very large document  Data mining (DM)[1,8,13] is the process of
collections. The number of documents can range from the identifying patterns in large sets of data, to find that new
many thousands to millions. In this paper, we will see how knowledge.
text mining is implemented in Rapidminer.

Keywords
Text mining, Tokenize, Filtering, Stop words, Stemming.

1.INTRODUCTION
Text mining [11, 12] is the analysis of data contained in
natural language text. Text mining can help an organization
derive potentially valuable business insights from text-based
content such as word documents, electronic mail as well as
postings on social media streams. Mining unstructured
data with natural language processing (NLP), statistical
modeling and machine learning techniques can be a challenge,
because natural language text is often inconsistent. It suffers
from ambiguities caused by inconsistent syntax and
semantics.
Fig. 1. Processing document from files in RapidMiner

2.TEXT MINING Figure 1 shows the ‘Process Documents From Files’ in


In this paper Process Documents from Files operator is used. RapidMiner. In the parameter on the right hand side we have a
It generates word vectors from a text collection stored in field ‘text directories’ where we have to enter the text file
multiple files. Parameters used in this operator are :- which we want to tokenize and filter. The text file should be
 text directories:- In this list arbitrary directories can in a folder. Fig. 2 shows the insertion of text file which has to
be specified and All the files that matches the given be tokenize. We have 2 column, the first one is class name(we
file ending will be loaded and assigned to the class can give any class name) and the second is directory(which
value provided with the directory. file pattern: A we have to select from the specific location).
pattern for the file to be read.
 extract text only:- If checked, structural information
like xml or html tags will be ignored and discarded.
3. TOKENIZE
Tokenization is the process of breaking a stream of text up
 use file extension as type:- If checked, the type of
into phrases, words, symbols, or other meaningful elements
the files will be determined by their extensions. The
called tokens. The goal of the tokenization is the exploration
unknown extensions will be considered as text files. of the words in a sentence. Textual data is only a textual
 content type:- The content type of the input texts. interpretation or block of characters at the beginning. In
 encoding:- The encoding used for reading or writing information retrieval require the words of the data set. So we
files. require a parser which processes the tokenization of the
The JISC and National Centre for Text Mining explain documents. This may be trivial as the text is already stored in
how “text mining involves the application of techniques from machine-readable formats. But Still there are some problems
areas such as information retrieval, data mining, information that has been left, for e.g., the removal of punctuation marks
extraction and natural language processing. All of these as well as other characters like brackets, hyphens, etc. The
various stages of a text-mining process can be combined into main use of tokenization is identification of meaningful
a single workflow”. keywords. Another problem are abbreviations and acronyms
 Information retrieval (IR) systems match a user’s which need to be transformed into a standard form.
query to documents in a database or collection. The first step

16
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 7– No. 2, April 2014 – www.ijais.org

filters can be created to use only a part of that data and no


need to build a different structure for each subset of data. We
can use filter by length, Content, English, dictionary and
Region etc.
In this paper tokens are filtered by length. This operator filters
tokens based on their length (i.e. the number of characters
they contain). Parameters used in this operator are:
 min chars:- The minimal number of characters that a
token must contain to be considered.
 max chars:- The maximal number of characters that
a token must contain to be considered.

Fig. 2. Insertion of text file to process

 Tokenize :- This operator splits the text of a


document into a sequence of tokens. There are
several options to define the splitting points. The
options are as follows:
 mode:-This selects the tokenization mode.
Depending on the mode, split points are chosen
differently. The Range is non letters, specify
characters, regular expression and the default value Fig. 3. Tokenization and Filteration in RapidMiner
is non letters
 characters:- The incoming document will be split Figure 3 shows the tokenization and filteration in RapidMiner.
into tokens on each of this characters. For example In this we use two operators “Tokenize” and “Filter Token by
enter a '.' for splitting into sentences. The Range is Length”.
string and the default value is '.:'
 expression:- This regular expression defines the 5. RESULT AND ANALYSIS
splitting point. The Range is string. Now, we run it and get the output. Fig. 3 shows the output in
Rapid Miner. The list of tokens i.e. words, phrases, symbols or
Stopword Elimination: - The most common words that other meaningful elements becomes input for Tokenization
unlikely to help text mining such as prepositions, articles, and process such as parsing or text mining. The document
pro-nouns can be considered as stopwords. Since every text occurrences and total occurrences of the tokens is given in the
document deals with these words which are not necessary for result summary. In this paper, we filter the tokens by length.
application of text mining. All these words are eliminated. We
can choose any group of word for this purpose. It also reduces
the text data and helps to improve the system performance.
For e.g., “a”,”is”, “you”, “an”.

Stemming: - Stemming also known as lemmatisation is a


technique for the reduction of words into their stems, base or
root. Many words in the English language can be reduced to
their base form or stem e.g. like, liking, likely, unlike belong
to like. Moreover, names can be transformed into root by
removing the “s”, for e.g., During the stemming process the
variation “Stem’s” in a sentence is reduced to ”Stem” and
this removal may lead to an incorrect stem or root. However,
if these words are not used for human interaction then,
these stems do not have to be a problem for the stemming
process. But the stem is still useful, because all Fig. 4. Result of Text Mining
other inflections of the root are transformed into the same
root. 6. CONCLUSION AND FUTURE SCOPE
Text Mining is a growing applications field and an area of
4. FILTERING research, using techniques from well-established scientific
Filtering helps you to provide the flexibility when you want to fields such as data mining, natural language processing, case-
design your data sources and mining structure so that a single based reasoning, statistics [10], machine learning[5, 8],
mining structure can be created based on the comprehensive information retrieval [3] and knowledge management. In this
data source view. For training and testing different models, paper, we have presented an approach that uses an

17
International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868
Foundation of Computer Science FCS, New York, USA
Volume 7– No. 2, April 2014 – www.ijais.org

automatically learned Information extraction system to extract [7] C. Cardie, “Empirical methods in information extraction”,
a structured database. AI Magazine, 18(4):65–79, 1997.
[8] C. Cardie and R. J. Mooney, “Machine learning and
7. REFERNCES natural language (Introduction to special issue on natural
[1] R. Agrawal and R. Srikant. Fast algorithms for mining language learning)” Machine Learning, 34:5–9, 1999.
association rules in Proceedings of the 20th International
Conference on Very Large Databases (VLDB-94), Chile, [9] Jiawei Han and Micheline Kamber, “Data Mining
Sept. 1994. Concepts and Techniques”, Morgan Kaufmann
Publisher, 722
[2] Margaret H. Dunham, Data Mining “Introduction and
Advanced Topics”. [10] Yang Y M, “An evaluation of statistical approach to text
categorization [R]” in Technical Report CMU - CS - 97-
[3] R. Baeza-Yates and B. Ribeiro-Neto, “Modern 127. Computer Science Department, Carnegie Mellon
Information Retrieval” ACM Press, New York, 1999. University, 1997
[4] Agrawal , T. lmielinski and A. Swami “ Database mining: [11] C. Choi and Y. Park "R&D proposal screening system
A performance perspective”, IEEE Transactions on based on text-mining approach", Int. J. Technol. Intell.
knowledge and Data Eng. , vol. 5, no. 6. Plan., vol. 2, no. 1, pp.61 -72 2006
[5] M. E. Califf, editor. Papers from the Sixteenth National [12] H. C. Yang and C. H. Lee "A text mining approach for
Conference on Artificial Intelligence(AAAI- automatic construction of hypertexts", Expert Syst.
99) Workshop on Machine Learning for Information Appl., vol. 29, no. 4, pp.723 -734 2005
Extraction, Orlando, FL, 1999. AAAI Press.
[13] Agrawal R, Imielinski T and Swami A, “Mining
[6] M. E. Califf and R. J. Mooney, “ Relational learning association rules between sets of items in large
of pattern-match rules for information extraction” in database[M]”, Washington, DC: SIGMOD, 1993.207-
Proceedings of the 16th National Conference on 216.
Artificial Intelligence(AAAI-99), pages 328–
334, Orlando, FL, July 1999.

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy