IRaMuTeQ Tutorial Translated To English - 17.03.2016
IRaMuTeQ Tutorial Translated To English - 17.03.2016
IRAMUTEQ is a GNU GPL (v2) licensed software that provides users with statistical
analysis on text corpus and tables composed by individuals/words. It is based on R
software and on python language.
In order to install it for free, you must first download and install R software in www.r-
project.org, (without it is not possible to run any analysis); then, download and install
IRAMUTEQ software.
Iramuteq Kit is available on: www.laccos.com.br. In the link “Novidades” press “Clique
AQUI” and download the KIT, which includes the software, references and a tutorial.
2- Install “R” software, base system for IRAMUTEQ alongside with Python
language
3- Update R packages
Open R software. Choose “Packages” + “Update Packages (Picture 2). Choose: the
nearest country/ state you are). Wait a few seconds (depending on the computer and
internet it can take a little longer) and select OK when asked to update some blue
items. Once updated, close the software.
2
Click on IRAMUTEQ icon on your desktop. You must be connected to Internet for
this step. Press ok on the message displayed about the installation status (Picture 3)
and wait for R software archives update.
3
Picture 5- Verification of libraries installation on IRAMUTEQ
Introduction
This software provides the users with different text analyses, either simple ones, such
as the basic lexicography related to lemmatization and word frequency; or more
complex ones such as descending hierarchical classification, post- hoc
correspondence factor analysis and similarity analysis. The vocabulary distribution is
presented in a comprehensive and clear way with graphical representations derived
from the lexicographic analysis.
These analyses can be performed using texts referring to a certain thematic (text
corpus) grouped in one text archive; or data from spreadsheets (matrices with
individuals in a row and words in a column), like the dataset derived from free
evocation tests.
Text analysis enables to explore oral material transcribed, including texts, interviews,
documents, etc, which can be individually or collectively produced. It’s also useful for
comparing different productions according to specific variables described by who
produced the text. To understand Text analysis, some concepts need to be clarified
first:
4
The concepts of Corpus, text and Text segments
Corpus
The corpus is created by the researcher and is the set of units to be analyzed. For
instance, if a researcher wants to analyze beauty related news published on a
magazine during 5 years, the set of these news is the corpus.
Text
These units are defined by the researcher, depending on the research. In the previous
example, each beauty related news is a text. If the analysis is to be applied on a set
of interviews, each interview is a text. If you intend to analyze the answers of “n”
participants to an open question, each answer will be a text and there will be “n” text.
In research on documents, letters, etc.; each document is a text.
A set of text units is a corpus of analysis. The ideal corpus for the Descending
Hierarchical Analysis must be a groups of texts focused on one theme
(monothematic), in order to avoid a replication of the initial structure.
In interviews, usually including larger texts, 20 to 30 texts are sufficient when the
groups is homogenous (Ghiglione & Matalon, 1993). For a comparative design, it is
recommended to have at least 20 texts per group.
Command lines, also called “asterisk lines” divide the text. In interviews, for instance,
given that each of them is a text, they start with a command line which includes the
identification number of the interviewee and some important features (variables) for
the research design (such as: sex, age, groups affiliation , socio-cultural level, etc.).
This may vary according to the research. The number of levels of each variable also
depend on the research design and number of interviews conducted.
5
Text segments
Generally, text segments (TS) have three lines, automatically sized according to the
corpus extension. The text segments are the words contexts. They can be created by
the researcher, or automatically by the software.
Although the texts are limited by the researcher, the corpus division in text segments
(TS) is automatically done in a standard analysis. However, the researcher may adapt
the segments division, for example: in case of a high frequency of short answers and
an open question from a survey, it is recommended to define the text as a single TS.
Text segments
1-Insert all texts (interviews, articles, texts, documents or answers to one question) in
one text archive using OpenOffice.org (http://www.openoffice.org/) or LibreOffice
(Http://pt-br. Libreoffice.org/). Never open these archives or any other output of
IRAMUTEQ with Microsoft word applications (Word, Excel, WordPad or Notepad)
they generate bugs with the Unicode used by IRAMUTEQ (UTF-8)
2- Divide the text with command lines (with asterisk). For example, for each interview to
be recognized by the software as a text, it should start with a line like this: (NOTE: leave
a clear line before the first command line.)
6
of the second variable and so on. This example was extracted from a research
with sex workers about prevention of STD’s and pregnancy. It indicates that the
following text (interview answers) refers to individual nº 01 (2 digits because the
sample has more than 10 and less than 100 individuals), age between 19 and 26
years old (coded as 1= 19 to 26 years old; 2= 27 to 47 years old); she doesn’t
have a fix sexual partner ( boyfriend or husband) ( coded as 1- have a partner
and 2= don’t have a partner); for how long has she been a sex worker, which in
this case is between 13 and 36 months (1= 12 months, 2= from 13 to 36 months
and 3= from 48 to 132 months) and the reason why she is a sex worker which in
this case is “family issues” (1= family issues, 2= financial need, 3= family provider,
4= frustrated love relationship and 5= didn’t answer). Immediately after this line
press ENTER and without tabulation or clear line, enter the answer of this subject
nº1.
3- You have 2 options for creating the text. The first, original or monothematic, where
each line is followed by a joint text. The second one, thematic, contains two or more
themes for line, subordinating lines to a main one. The analysis of a corpus with thematic
divisions (different themes) provides information about relations between themes; and
can be used as a preliminary exploratory analysis (in order to have a snapshot of the text
material). Nonetheless, the monothematic analysis is still needed because it provides a
more in depth understanding of the studied object.
7
afterwards. Sometimes, many people can’t find a girl on the street or a girlfriend so they
come here and we are forced to accept them. The prostitute is considered vulgar. I don’t
think of myself as vulgar. I have my reasons and don’t accept that one comes here and
call me vulgar, I don’t accept, because I’m not, if I were I would be on the streets, doing
everything. It is really different to work on night clubs, street or cabaret. I already had
vaginal discharge, but not STD. Discharge is a natural thing that you can have from the
condom or anything else, like soap, clothes, but never have STD.
8
come here and we are forced to accept them. The prostitute is considered vulgar. I don’t
think of myself as vulgar. I have my reasons and don’t accept that one comes here and
call me vulgar, I don’t accept, because I’m not , if I were I would be on the streets, doing
everything. It is really different to work on night clubs, street or cabaret. I already had
vaginal discharge, but not STD. Discharge is a natural thing that you can have from the
condom or anything else, like soap, clothes, but never have STD.
*** *ind_02 *ida_2 *par_1 *fil_2 *temp_3 *caus_2
-*theme_prevention
I always use condoms, because besides preventing pregnancy it prevents aids and other
diseases, we need to use it. It’s good for everything
-*theme_std
I know many STD’s such as gonorrhea, cancer, genital louse, there are so many. I had
gonorrhea from a boyfriend, couldn’t imagine this. As the time went by, I felt pain, went
to a doctor, and, had to do a surgery and found out I had gonorrhea. I’m not afraid to talk
about this, everyone is at risk of getting these, all these diseases. To protect me from
STD’s, I do oral and regular sex with condom, and never do anal.
*** *ind_03 *ida_1 *par_2 *fil_1 *temp_1 *caus_2
I use contraceptives and condoms. Not the pill, but the injection, is easier to remember.
That’s why I take both, because if something happens (to be continued)
Note: After creating the corpus, a careful reading is recommended, paying special
attention to command lines. This must be verified by the researcher in order for the text
to be processed.
1- Correct and review the archive, to avoid spelling mistakes or errors to be taken
into account as different words
2- Pay attention to the punctuation. In case of doubts concerning the proper way
to insert the paragraphs, it is better to avoid them.
3- In case of interviews or surveys, the questions and oral material from the
interviewer (interventions and notes) must be suppressed, thus excluded from
the analysis. Keep the referents.
4- Don’t justify the text, or use bold, italic or other similar resource.
5- The use of acronyms and abbreviations must be consistent, either you use them
always or write it in full with underscores connecting the words. For example:
WHO or World_Health_Organization.
9
6- The words with hyphen are considered two separate words (the hyphen becomes
a space). In case you need to analyze them, unite them with an underscore. Ex:
“alto-mar” becomes “alto_mar”; “terça-feira” becomes “terça_feira”.
7- All verbs with pronouns must be in proclisis, because the dictionary doesn’t have
the verb-pronominal inflexions. EX: “ tornei-me” becomes “me tornei”
8- If possible, avoid using diminutives, due to the dictionary features.
9- When writing numbers, insert digits, not words. EX: use “2013” and not “two
thousand thirteen”; “70” and not “seventy”.
10- Don’t use in any part of the archive the following signs: quotes (“), apostrophe (‘),
hyphen (-); dollar sign ( $), percentage (%), suspension points (…) The asterisk
(*) can only be used in the command lines.
11- The archive with the corpus from software OpenOffice.org or LibreOffice must be
saved in a new folder on desktop, used only for analysis, with a short name and
as coded text (archive_name.txt). In OpenOffice.org this option displays a
window where you choose “keep the present format”. A second window appears
where the options “group of characters” and “ Paragraph break” should be
respectively “Unicode (UTF-8)” and “LF”.
12- Don’t reutilize the txt archive (coded text) when conducting a new analysis on the
same corpus. Create a new one with odt archive (proper format to archive it).
IRAMUTEQ provides the users with different text analyses, from simple ones such as
basic lexicography (word frequency) to multivariate analysis (descending hierarchical
classification).
10
III) Method of Descending Hierarchical analysis (DHA) - The TS are
clustered according to their vocabularies and distributed according to the
reduced forms frequencies.
Using matrices that cross reduced forms with TS (in repeated texts of X2
type), the DHA method allows you to obtain a definitive classification. It is
aimed at obtaining TS clusters with similar vocabulary within, but different
from other segments. A dendrogram will be displayed showing clusters
relations.
The software calculates descriptive results of each cluster conforming to its
main vocabulary (lexic) and words with asterisk (variables). Furthermore, it
provides the users with another way of presenting data, derived from a
correspondence factor analysis. Based on the chosen clusters, the software
calculates and provides the most typical TS of each cluster, giving context to
them.
These word clusters and TS integrate several segments according to the
vocabulary distribution. On the interpretative level, it depends on the
theoretical scope of the research. Reinert (1990), when studying French
literature, considered each cluster as a “world”, a cognitive-perceptive
framework with a certain temporal stability related to a complex environment.
Research in linguistic considers these clusters as lexical fields (Cros, 1993)
or semantic contexts. For research in social psychology, especially when
interested in studying the common sense knowledge, which take into account
the linguistic expressions, these clusters may indicate social representations,
images about a certain object or aspects of a certain social representation
(Veloz, Nascimento- Schulze & Camargo, 1999).
Generally, the number of clusters and the number of social representations
involved are not the same as in the abovementioned study. What defines if
they refer to different social representations or just to one social
representation is its content and its relation with factors considered in the
research design, through a differentiated selection of the participants
according to their group affiliation, previous social practices, etc.
IV) Similarity analysis- This analysis, based on graph theory, is often used
by social representations researchers. It allows to identify the words co-
occurrences, providing information on the words connectivity thus helping to
11
identify the structure of a text corpus content. It also allows to identify the
shared parts and specificities according to the descriptive variables identified
in the analysis (Marchand and Ratinaud, 2012).
Open the software and import the corpus. Click File and Open a text corpus on the upper
toolbar (see Picture 7). Locate and select the corpus for analysis and click Open.
Once the software imports the corpus, a new window is displayed (Picture 8).
12
PICTURE 8- Analysis configuration- corpus coding
This window (Picture 8) presents the configuration for text analysis. Most of the
configurations in General tab, may be kept according to the standard definitions, except
the following 2: the text coding (define characters), in which you must choose the second
option: “uft-8-all languages”; and the language (Language). According to Picture 9, select
the language corresponding to the language of the text.
Press OK and wait a few seconds for the data import. A brief description of the corpus
will appear on the right large window (Picture 10), where you may verify the number of
texts, text segments, identified forms, occurrences and Hapax frequency.
13
PICTURE 10- Preliminary results, corpus description
Once the corpus import is completed, you can run the analysis. By selecting “Text
Analysis” in the upper toolbar the following options for analysis are displayed. (Picture
11).
Every time you pick an analysis, a new window appears asking if you want to keep the
lemmatization. Keep YES selected so the software can use the reduced forms dictionary
to run the analysis.
The window also allows you to edit the active and supplementary forms. If you wish to
do so, press “Properties”. It is recommended to select which grammatical clusters will be
active in the analysis (0= eliminated words; 1= active words; 2= supplementary words).
Once this change is performed it will be valid for all the further analysis on the same
corpus. The researcher may change them again whenever he wants to. After choosing
the grammatical cluster click OK twice.
14
For research in Psychology, it is suggested to follow the example of Picture 12. These
parameters are the most appropriate for research focused on the text contents. The idea
is to work with some language elements as active: adjective, non -recognized forms,
names, verbs; and names and auxiliary verbs as supplementary; eliminating the “tool
words”. Moreover, select the words in similarity analysis and word cloud, and disregard
the words with high frequency associated with questions.
The first option, “Statistics”, provides the number of texts and texts segments,
occurrences, medium frequencies, as well as the total frequency of each form, and its
grammatical cluster, according to the dictionary of reduced forms. In the results interface
it’s possible to visualize the Zipf diagram (Picture 13) which presents the word frequency
in the corpus on a graphic with a X rang frequency distribution.
15
PICTURE 13- Zipf diagram
On the left column, you identify this analysis as: CORPUS NAME_stat_1. Click on this
name with the mouse’s right button to select other options, such as export the reduced
forms dictionary, which will be saved in the same folder as the initial corpus, in a sub
folder called: Corpus Name_stat_1.
The software classifies the words in grammatical forms, with the following coding, which
will be used for every analysis henceforth:
Adj= adjective
Adv= adverb
Conj= conjunction
Nom= noun
Ono= onomathopea
16
Pre= preposition
Ver= verb
When selecting “Specificities and CA”, choose the categorical variable according to
which you want to conduct the analysis. Select it and click OK. Wait for a few seconds
for the results to appear in the main window (Picture 14).
By pressing with the mouse’s right button on any word presented in the table (Picture
14) and in Concordance, a new window will appear where you can identify the text
segments containing the word, hence getting back its context.
For Clustering (Descending hierarchical classification, Reinert method), you may choose
between 3 options in the window displayed on IRAMUTEQ’s interface.
17
SIMPLE ON TEXTS- performs the analysis on texts, without dividing them in TS.
Recommended for short answers1
Choose one of the options. You don’t need to change any of the other parameters. Click
OK and wait for a few seconds for the analysis to be completed. Some relevant data of
DHA will be displayed (Picture 15), alongside with the dendrogram (Picture 16)
1
In this case, you need the previous parameter. After the corpus import, besides indicating the coding
and language, select “paragraphs” as the method to create TS
18
Picture 16- DHC dendrogram
In the DHC results tab, you can access the dendrogram with the text divisions and final
clusters. It is to be read from left to right. In the following example (Picture 16), the corpus
“Body”, was divided (1º division, or iteration) in two sub corpus, separating cluster 4 from
the rest. On a second moment, the larger sub corpus was divided, generating cluster 3
(2º division or iteration). On a third moment, another division generated clusters 1 and 2.
The DHA stopped here, due to the 4 clusters stability, as text segments units with similar
vocabulary.
19
PICTURE 17- DHC dendrogram
This interface also provides users with the identification of lexical content of each
cluster (click Profiles) and a factor representation of DHC (press CA)
The Profiles tab shows data of each cluster content: n. (number which organizes the
words in the table); eff. st (number of text segments containing the word in the
cluster); eff. Total ( number of text segments containing at least once the cited word);
pourcentage (percentage of word occurrence on the text segments of this cluster in
relation to its occurrence in the corpus); chi2 (X2 of association between word and
cluster); Type (grammatical cluster identifying the word in the forms dictionary);
Forme (identifies the word) and P (identifies the significance level of the association
between word and cluster). Picture 18 shows the “Profiles” bar.
20
PICTURE 18- Forms associated with cluster 1
For the descriptive analysis of each cluster, 2 criteria should be considered: 1) pay
attention to the non- instrumental words with a higher frequency than the medium
frequency of the entire corpus’ set of words (in this example 35.959 occurrences
divided by 3.377 distinct forms, resulting in 10.65) and 2) consider the words with X2
of cluster association ≥ 3.84 (hence p < 0,05).
More results are presented in the left column, by clicking the mouse’s right button
on the analysis - Corpus name_alceste_1. The most important ones are:
So you need to have a presentable hair be well dressed fit It is also related to
identity question because is how you present yourself
21
Not only how you dress up but the way you express yourself the way you walk is
a question of how people perceive you and how it becomes consumerism
People want to look good thus they buy a lot invest on their bodies as an object
this identity question is a thing also socially imposed, you have to be pretty, you
have to be skin you also have to dress up.
The thing I thought was the attempt to create beauty patterns a thing patent in
the video is that 95% of the shots were of beautiful people slim fit men and only
three fatties, only three fatties
Exactly that also impressed me and also if I’m really pleased with my body or if I
try everyone is satisfied because everyone is like that
Do I really like to be like that or I’m like this because everyone is and is the pattern
doesn’t have to be slim I think I’m being influenced I want to be with my belly slim
22
Still in Profiles, the content of each cluster may be explored using other available
resources (see Picture 20), shown when you click with the mouse’s right button
on any word pertaining to a cluster. On the top of the window you can consult
more data referring to the selected word. The lower part provides information
related to the respective cluster.
The resources presented in this window (Picture 20) provide you with the words
related to form (from the reduced forms dictionary); the graphic visualization of
frequency, association and co-occurrence of a specific word, as well as the text
segments where the word appears in the cluster. It’s also possible to visualize a
graph of clusters, repeated segments, typical text segments (see Picture 21), as
well as export the segments related to the cluster.
23
PICTURE 21- Typical text segments of cluster 1
Whit many short answers to an open ended question, it is necessary to adapt the DHC
(see picture 22). When importing this type of corpus, you should identify the coding and
language and select “paragraphs” as method to create text segments (TS). Then, choose
Classification “simple on texts” in order to avoid the segmentation of each answer. Thus,
the text segment considered will be the text itself or the short answer to a questionnaire.
24
Analysis: Similarity
For the similarity analysis, a new window is displayed (Picture 23), where you
can choose the criteria for the co -occurrences tree. In Graph Settings, you may
edit the analysis, change the co- occurrences index, choose if it’s a maximum
tree or not, select a descriptive variable to be highlighted in the tree. Click on
Communities + Halo where you may ask for the most related words to be
presented in a colorful cloud. And on the Score on Edges, you can visualize the
co –occurrences values.
You can select words to integrate the analysis on the left column. Click on “Select
a variable” to choose a categorical variable for the similarity analysis, identifying
the differences between groups.
After choosing the criteria click OK and wait for the analysis to finish.
25
PICTURE 24- Similarity Analysis results
The tree is displayed on results interface which has 2 buttons on the upper left
corner (Picture 24). The first one (*) with red lines and black dots allows you to
change the analysis parameters, reopening the edition window. The second
button (**), EXPORT, will export the image for the analysis folder, in a sub folder
called Corpus Name_simitxt_1.
A new window is displayed when choosing word cloud. Like the one of similarity
analysis you may also choose some parameters, which don’t need necessarily
to be edited.
This is a simpler analysis which represents graphically the word frequency. After
choosing the criteria, press ok on both windows and wait a few seconds.
26
PICTURE 25- Word cloud results
You can visualize the word cloud in the results interface (Picture 25) as well as
inside the analysis folder (sub folder “Corpus Name_wordcloud_1”, as an image
archive called “nuage_1”).
All the results, including the pictures and graphics will be inside the folder
containing the corpus of analysis as well as each analysis run (statistics,
specificities, DHC, similarity and word cloud).
27
PICTURE 26- Database model for matrices analysis
Archive format should be: ods; csv; xls ( don’t use xlsx- excel because its
incompatible with IRAMUTEQ). The coding must be the same used for text
analysis: UTF8 all languages
Avoid the following characters : ; ‘ “
Don’t insert blank spaces (use underscore to connect more than 2 words)
The archive’s name can’t include accentuation or special characters
The numeric variables can be presented in the archive but can’t be used in the
analysis (except rangs in prototypical analysis).
If you know the order or importance of the words, this may be added in a column
immediately after the word.
A broad corpus revision is necessary, since this type of analysis doesn’t do
lemmatization.
Save the database inside an exclusive folder for the analysis, open IRAMUTEQ and
select File and Open a matrix. Locate the archive containing the database and press
Open. To import the data, another window will appear (see Picture 27) and you’ll be able
to indicate some parameters of the database: the first line of the spreadsheet must
include the column names (indicated); the first column is an identifier (indicated); column
delimiter (will be , in case of CSV format); text delimiter (“) character coding ( utf-8-all
languages)
28
PICTURE 27- Matrix database import
Select the parameters and press ok to access the imported matrix (Picture 28). The
available analysis are frequencies, descending hierarchical classification (recommended
only with a high number of participants), similarity analysis and prototypical analysis.
To run the analyses, click Matrix analysis icon and select (Picture 29).
29
PICTURE 29- Possible analysis for the matrices
Frequency analysis is the simpler one. Click Frequency analysis to access frequencies
of the matrix’s categorical variables and Multiple Frequencies analysis to obtain a report
for the absolute and relative frequency of the words in the matrix. It’s necessary to
choose which variables are going to be calculated. In this case, there is no interest in
Rang (evocation order) but only in words and eventually on descriptive variables inserted
in the matrix.
Picture 30 shows a multiple frequency report related to evoked words and a free
association test.
As shown in Picture 30, the analysis provides a table with words ordered by frequency,
also including the absolute frequency on the second column, followed by its proportion
30
in relation to the total of evocations. Also includes the number of lines with this word as
well as the proportion related to the total number of lines. Each line represents a
participant.
In the configuration window, select the variables related to evocations (on the left side)
and the variables corresponding to RANG (on the right side) - (depending on your
criteria, evocation order or attributed importance). The other parameters refer to the
criteria for calculate the prototypical analysis, you can keep the automatic definitions (see
Picture 31).
Standards defined, press ok to access the prototypical analysis (Picture 32). This 4
quadrants diagram represents four dimensions of the social representations structure. In
this example, using a free evocation task with the inductive term AIDS, the first quadrant
(upper left side) indicate the words with high frequency (higher frequency than the media)
and low evocation frequency (those immediately evoked). These probably correspond to
the central nucleus of a representation.
31
PICTURE 32- four quadrants diagram- prototypical analysis
The second quadrant (upper right side), corresponds to the first periphery, including
words with high frequency but with a higher media, thus not so readily evoked. The third
quadrant (bottom left side) corresponds to the contrast zone with elements readily
evoked but with a lower frequency. The second periphery in the fourth quadrant (bottom
right side) indicates the elements with lower frequency and higher evocation order.
At last, the similarity analysis, also an indicator of a social representations structure, can
be conducted from Matrix analysis and Similarity analysis.
The analysis is analogous to the one performed with text material (see Picture 33).
32
PICTURE 33- Similarity analysis definitions
A similarity analysis is presented in Picture 34, where the colorful vertices size is
proportional to the words frequency and the edges indicate the words co -
occurrence strength.
33
References
34
des Données Textuelles (835–844). Presented at the 11eme Journées
internationales d’Analyse statistique des Données Textuelles. JADT 2012, Liège.
Reinert, M. (1990). ALCESTE, une méthodologie d'analyse des données textuelles et
une application: Aurélia de G. de Nerval. Bulletin de méthodologie sociologique,
(28) 24- 54.
Veloz, M. C. T.; Nascimento-Schulze, C. M.; Camargo, B. V. (1999). Representações
sociais do envelhecimento. Psicologia: Reflexão e Crítica, 12 (2), 479-501.
Wachelke, J. F. R. & Wolter, R. (2011). Critérios de construção e relato da análise
prototípica para representações sociais. Psicologia Teoria e Pesquisa, 27 (4),
521-526.
35