0% found this document useful (0 votes)
18 views21 pages

RM 2

The document discusses research methodology focusing on data collection and measurement levels, including nominal, ordinal, interval, and ratio. It explains the importance of understanding measurement levels for data interpretation and appropriate statistical analysis, as well as the design and types of questionnaires used in research. Additionally, it covers sampling methods, differentiating between probability and non-probability sampling, and emphasizes the significance of selecting a representative sample for valid research conclusions.

Uploaded by

nehrusowmi94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

RM 2

The document discusses research methodology focusing on data collection and measurement levels, including nominal, ordinal, interval, and ratio. It explains the importance of understanding measurement levels for data interpretation and appropriate statistical analysis, as well as the design and types of questionnaires used in research. Additionally, it covers sampling methods, differentiating between probability and non-probability sampling, and emphasizes the significance of selecting a representative sample for valid research conclusions.

Uploaded by

nehrusowmi94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

RESEARCH METHODOLOGY AND IPR

UNIT – II DATA COLLECTION AND SOURCES

Measurement

Measurement is the process of observing and recording the observations that are collected as part of a
research effort. There are two major issues that will be considered here.

First, you have to understand the fundamental ideas involved in measuring. Here we consider two of
major measurement concepts. In Levels of Measurement, I explain the meaning of the four major levels
of measurement: nominal, ordinal, interval and ratio. Then we move on to the reliability of measurement,
including consideration of true score theory and a variety of reliability estimators.

Second, you have to understand the different types of measures that you might use in social research. We
consider four broad categories of measurements. Survey research includes the design and implementation
of interviews and questionnaires. Scaling involves consideration of the major methods of developing and
implementing a scale. Qualitative research provides an overview of the broad range of non-numerical
measurement approaches. And unobtrusive measures presents a variety of measurement methods that
don’t intrude on or interfere with the context of the research.

Why is Level of Measurement Important?

First, knowing the level of measurement helps you decide how to interpret the data from that variable.
When you know that a measure is nominal (like the one just described), then you know that the numerical
values are just short codes for the longer names. Second, knowing the level of measurement helps you
decide what statistical analysis is appropriate on the values that were assigned. If a measure is nominal,
then you know that you would never average the data values or do a t-test on the data.

There are typically four levels of measurement that are defined:

Nominal
Ordinal
Interval
Ratio

In nominal measurement the numerical values just “name” the attribute uniquely. No ordering of the cases
is implied. For example, jersey numbers in basketball are measures at the nominal level. A player with
number 30 is not more of anything than a player with number 15, and is certainly not twice whatever
number 15 is.

In ordinal measurement the attributes can be rank-ordered. Here, distances between attributes do not have
any meaning. For example, on a survey you might code Educational Attainment as 0=less than high
school; 1=some high school.; 2=high school degree; 3=some college; 4=college degree; 5=post college. In
this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? Of course
not. The interval between values is not interpretable in an ordinal measure.

In interval measurement the distance between attributes does have meaning. For example, when we
measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval
between values is interpretable. Because of this, it makes sense to compute an average of an interval
variable, where it doesn’t make sense to do so for ordinal scales. But note that in interval measurement
ratios don’t make any sense - 80 degrees is not twice as hot as 40 degrees (although the attribute value is
twice as large).

Finally, in ratio measurement there is always an absolute zero that is meaningful. This means that you can
construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social
research most “count” variables are ratio, for example, the number of clients in past six months. Why?
Because you can have zero clients and because it is meaningful to say that “…we had twice as many clients
in the past six months as we did in the previous six months.”

It’s important to recognize that there is a hierarchy implied in the level of measurement idea. At lower
levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive.
At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds
something new. In general, it is desirable to have a higher level of measurement (e.g., interval or ratio)
rather than a lower one (nominal or ordinal).

Questionnaire: Definition, Examples, Design and Types

A questionnaire is a research instrument consisting of a series of questions for the purpose of gathering
information from respondents. Questionnaires can be thought of as a kind of written interview. They can
be carried out face to face, by telephone, computer or post.

Questionnaires provide a relatively cheap, quick and efficient way of obtaining large amounts of
information from a large sample of people.

Data can be collected relatively quickly because the researcher would not need to be present when the
questionnaires were completed. This is useful for large populations when interviews would be impractical.

However, a problem with questionnaires is that respondents may lie due to social desirability. Most people
want to present a positive image of themselves and so may lie or bend the truth to look good, e.g., pupils
would exaggerate revision duration.

Questionnaires can be an effective means of measuring the behavior, attitudes, preferences, opinions and,
intentions of relatively large numbers of subjects more cheaply and quickly than other methods.

Often a questionnaire uses both open and closed questions to collect data. This is beneficial as it means
both quantitative and qualitative data can be obtained.

Closed Questions

Closed questions structure the answer by only allowing responses which fit into pre-decided categories.

Data that can be placed into a category is called nominal data. The category can be restricted to as few as
two options, i.e., dichotomous (e.g., 'yes' or 'no,' 'male' or 'female'), or include quite complex lists of
alternatives from which the respondent can choose (e.g., polytomous).

Closed questions can also provide ordinal data (which can be ranked). This often involves using a
continuous rating scale to measure the strength of attitudes or emotions.
For example, strongly agree / agree / neutral / disagree / strongly disagree / unable to answer.

Closed questions have been used to research type A personality (e.g., Friedman & Rosenman, 1974), and
also to assess life events which may cause stress (Holmes & Rahe, 1967), and attachment (Fraley, Waller, &
Brennan, 2000).

Strengths

They can be economical. This means they can provide large amounts of research data for relatively
low costs. Therefore, a large sample size can be obtained which should be representative of the
population, which a researcher can then generalize from.
The respondent provides information which can be easily converted into quantitative data (e.g., count
the number of 'yes' or 'no' answers), allowing statistical analysis of the responses.
The questions are standardized. All respondents are asked exactly the same questions in the same
order. This means a questionnaire can be replicated easily to check for reliability. Therefore, a second
researcher can use the questionnaire to check that the results are consistent.

Limitations

They lack detail. Because the responses are fixed, there is less scope for respondents to supply answers
which reflect their true feelings on a topic.

Open Questions

Open questions allow people to express what they think in their own words. Open-ended questions enable
the respondent to answer in as much detail as they like in their own words.

For example: “can you tell me how happy you feel right now?”

If you want to gather more in-depth answers from your respondents, then open questions will work better.
These give no pre-set answer options and instead allow the respondents to put down exactly what they like
in their own words.

Open questions are often used for complex questions that cannot be answered in a few simple categories
but require more detail and discussion.

Lawrence Kohlberg presented his participants with moral dilemmas. One of the most famous concerns a
character called Heinz who is faced with the choice between watching his wife die of cancer or stealing the
only drug that could help her.

Participants were asked whether Heinz should steal the drug or not and, more importantly, for their
reasons why upholding or breaking the law is right.

Strengths

Rich qualitative data is obtained as open questions allow the respondent to elaborate on their answer.
This means the research can find out why a person holds a certain attitude.

Limitations

Time-consuming to collect the data. It takes longer for the respondent to complete open questions.
This is a problem as a smaller sample size may be obtained.
Time-consuming to analyze the data. It takes longer for the researcher to analyze qualitative data as
they have to read the answers and try to put them into categories by coding, which is often subjective
and difficult. However, Smith (1992) has devoted an entire book to the issues of thematic content
analysis the includes 14 different scoring systems for open-ended questions.
Not suitable for less educated respondents as open questions require superior writing skills and a
better ability to express one's feelings verbally.

Questionnaire Design

With some questionnaires suffering from a response rate as low as 5%, it is essential that a questionnaire is
well designed.

There are a number of important factors in questionnaire design.

Aims

Make sure that all questions asked address the aims of the research. However, use only one feature of the
construct you are investigating in per item.

Length

The longer the questionnaire, the less likely people will complete it. Questions should be short, clear, and
be to the point; any unnecessary questions/items should be omitted.

Pilot Study

Run a small scale practice study to ensure people understand the questions. People will also be able to give
detailed honest feedback on the questionnaire design.

Question Order

Questions should progress logically from the least sensitive to the most sensitive, from the factual and
behavioral to the cognitive, and from the more general to the more specific.

The researcher should ensure that the answer to a question is not influenced by previous questions.

Terminology

There should be a minimum of technical jargon. Questions should be simple, to the point and easy to
understand.
The language of a questionnaire should be appropriate to the vocabulary of the group of people being
studied. Use statements which are interpreted in the same way by members of different subpopulations of
the population of interest.

For example, the researcher must change the language of questions to match the social background of
respondents' age / educational level / social class / ethnicity etc.

Presentation

Make sure it looks professional, include clear and concise instructions. If sent through the post make sure
the envelope does not signify ‘junk mail.’

Ethical Issues

The researcher must ensure that the information provided by the respondent is kept confidential, e.g.,
name, address, etc.

This means questionnaires are good for researching sensitive topics as respondents will be more honest
when they cannot be identified.

Keeping the questionnaire confidential should also reduce the likelihood of any psychological harm, such
as embarrassment.

Participants must provide informed consent prior to completing the questionnaire, and must be aware that
they have the right to withdraw their information at any time during the survey/ study.

An introduction to sampling methods

When you conduct research about a group of people, it’s rarely possible to collect data from every person
in that group. Instead, you select a sample. The sample is the group of individuals who will actually
participate in the research.

To draw valid conclusions from your results, you have to carefully decide how you will select a sample that
is representative of the group as a whole. There are two types of sampling methods:

Probability sampling involves random selection, allowing you to make strong statistical inferences
about the whole group.
Non-probability sampling involves non-random selection based on convenience or other criteria,
allowing you to easily collect data.

You should clearly explain how you selected your sample in the methodology section of your paper or
thesis.

Population vs sample

First, you need to understand the difference between a population and a sample, and identify the target
population of your research.

The population is the entire group that you want to draw conclusions about.
The sample is the specific group of individuals that you will collect data from.

The population can be defined in terms of geographical location, age, income, and many other
characteristics.

It can be very broad or quite narrow: maybe you want to make inferences about the whole adult
population of your country; maybe your research focuses on customers of a certain company, patients
with a specific health condition, or students in a single school.

It is important to carefully define your target population according to the purpose and practicalities of
your project.

If the population is very large, demographically mixed, and geographically dispersed, it might be difficult
to gain access to a representative sample.

Sampling frame

The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally, it should
include the entire target population (and nobody who is not part of that population).

Example

You are doing research on working conditions at Company X. Your population is all 1000 employees of
the company. Your sampling frame is the company’s HR database which lists the names and contact
details of every employee.
Sample size

The number of individuals you should include in your sample depends on various factors, including the
size and variability of the population and your research design. There are different sample size
calculators and formulas depending on what you want to achieve with statistical analysis.

Probability sampling methods

Probability sampling means that every member of the population has a chance of being selected. It is
mainly used in quantitative research. If you want to produce results that are representative of the whole
population, probability sampling techniques are the most valid choice.

There are four main types of probability sample.

1. Simple random sampling


In a simple random sample, every member of the population has an equal chance of being selected. Your
sampling frame should include the whole population.

To conduct this type of sampling, you can use tools like random number generators or other techniques
that are based entirely on chance.

Example

You want to select a simple random sample of 100 employees of Company X. You assign a number to
every employee in the company database from 1 to 1000, and use a random number generator to select 100
numbers.

2. Systematic sampling

Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct.
Every member of the population is listed with a number, but instead of randomly generating numbers,
individuals are chosen at regular intervals.

Example

All employees of the company are listed in alphabetical order. From the first 10 numbers, you randomly
select a starting point: number 6. From number 6 onwards, every 10th person on the list is selected (6, 16,
26, 36, and so on), and you end up with a sample of 100 people.

If you use this technique, it is important to make sure that there is no hidden pattern in the list that might
skew the sample. For example, if the HR database groups employees by team, and team members are
listed in order of seniority, there is a risk that your interval might skip over people in junior roles, resulting
in a sample that is skewed towards senior employees.

3. Stratified sampling

Stratified sampling involves dividing the population into subpopulations that may differ in important
ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented
in the sample.

To use this sampling method, you divide the population into subgroups (called strata) based on the
relevant characteristic (e.g. gender, age range, income bracket, job role).

Based on the overall proportions of the population, you calculate how many people should be sampled
from each subgroup. Then you use random or systematic sampling to select a sample from each subgroup.

Example

The company has 800 female employees and 200 male employees. You want to ensure that the sample
reflects the gender balance of the company, so you sort the population into two strata based on gender.
Then you use random sampling on each group, selecting 80 women and 20 men, which gives you a
representative sample of 100 people.

4. Cluster sampling

Cluster sampling also involves dividing the population into subgroups, but each subgroup should have
similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you
randomly select entire subgroups.

If it is practically possible, you might include every individual from each sampled cluster. If the clusters
themselves are large, you can also sample individuals from within each cluster using one of the techniques
above. This is called multistage sampling.

This method is good for dealing with large and dispersed populations, but there is more risk of error in the
sample, as there could be substantial differences between clusters. It’s difficult to guarantee that the
sampled clusters are really representative of the whole population.

Example

The company has offices in 10 cities across the country (all with roughly the same number of employees in
similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random
sampling to select 3 offices – these are your clusters.

Non-probability sampling methods

In a non-probability sample, individuals are selected based on non-random criteria, and not every
individual has a chance of being included.

This type of sample is easier and cheaper to access, but it has a higher risk of sampling bias. That means
the inferences you can make about the population are weaker than with probability samples, and your
conclusions may be more limited. If you use a non-probability sample, you should still aim to make it as
representative of the population as possible.

Non-probability sampling techniques are often used in exploratory and qualitative research. In these types
of research, the aim is not to test a hypothesis about a broad population, but to develop an initial
understanding of a small or under-researched population.
1. Convenience sampling

A covenience sample simply includes the individuals who happen to be most accessible to the researcher.

This is an easy and inexpensive way to gather initial data, but there is no way to tell if the sample is
representative of the population, so it can’t produce generalizable results.

Example

You are researching opinions about student support services in your university, so after each of your
classes, you ask your fellow students to complete a survey on the topic. This is a convenient way to gather
data, but as you only surveyed students taking the same classes as you at the same level, the sample is not
representative of all the students at your university.

2. Voluntary response sampling

Similar to a convenience sample, a voluntary response sample is mainly based on ease of access. Instead
of the researcher choosing participants and directly contacting them, people volunteer themselves (e.g. by
responding to a public online survey).
Voluntary response samples are always at least somewhat biased, as some people will inherently be more
likely to volunteer than others.

Example

You send out the survey to all students at your university and a lot of students decide to complete it. This
can certainly give you some insight into the topic, but the people who responded are more likely to be
those who have strong opinions about the student support services, so you can’t be sure that their opinions
are representative of all students.

3. Purposive sampling

This type of sampling, also known as judgement sampling, involves the researcher using their expertise to
select a sample that is most useful to the purposes of the research.

It is often used in qualitative research, where the researcher wants to gain detailed knowledge about a
specific phenomenon rather than make statistical inferences, or where the population is very small and
specific. An effective purposive sample must have clear criteria and rationale for inclusion.

Example

You want to know more about the opinions and experiences of disabled students at your university, so you
purposefully select a number of students with different support needs in order to gather a varied range of
data on their experiences with student services.

4. Snowball sampling

If the population is hard to access, snowball sampling can be used to recruit participants via other
participants. The number of people you have access to “snowballs” as you get in contact with more people.

Example

You are researching experiences of homelessness in your city. Since there is no list of all homeless people
in the city, probability sampling isn’t possible. You meet one person who agrees to participate in the
research, and she puts you in contact with other homeless people that she knows in the area.

Data - Preparing, Exploring, examining and displaying

What is Data Preparation?

Good data preparation allows for efficient analysis, limits errors and inaccuracies that can occur to data
during processing, and makes all processed data more accessible to users. It’s also gotten easier with new
tools that enable any user to cleanse and qualify data on their own.
What is Data Preparation?

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It
is an important step prior to processing and often involves reformatting data, making corrections to data
and the combining of data sets to enrich data.

Data preparation is often a lengthy undertaking for data professionals or business users, but it is essential
as a prerequisite to put data in context in order to turn it into insights and eliminate bias resulting from
poor data quality.

For example, the data preparation process usually includes standardizing data formats, enriching source
data, and/or removing outliers.

Benefits of Data Preparation + The Cloud

76% of data scientists say that data preparation is the worst part of their job, but the efficient, accurate
business decisions can only be made with clean data. Data preparation helps:

Fix errors quickly — Data preparation helps catch errors before processing. After data has been
removed from its original source, these errors become more difficult to understand and correct.
Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in analysis
will be high quality.
Make better business decisions — Higher quality data that can be processed and analyzed more
quickly and efficiently leads to more timely, efficient and high-quality business decisions.

Additionally, as data and data processes move to the cloud, data preparation moves with it for even
greater benefits, such as:

Superior scalability — Cloud data preparation can grow at the pace of the business. Enterprise don’t
have to worry about the underlying infrastructure or try to anticipate their evolutions.
Future proof — Cloud data preparation upgrades automatically so that new capabilities or problem
fixes can be turned on as soon as they are released. This allows organizations to stay ahead of the
innovation curve without delays and added costs.
Accelerated data usage and collaboration — Doing data prep in the cloud means it is always on,
doesn’t require any technical installation, and lets teams collaborate on the work for faster results.

Additionally, a good, cloud-native data preparation tool will offer other benefits (like an intuitive and
simple to use GUI) for easier and more efficient preparation.

Data Preparation Steps

The specifics of the data preparation process vary by industry, organization and need, but the framework
remains largely the same.
1. Gather data

The data preparation process begins with finding the right data. This can come from an existing data
catalog or can be added ad-hoc.

2. Discover and assess data

After collecting the data, it is important to discover each dataset. This step is about getting to know the
data and understanding what has to be done before the data becomes useful in a particular context.

Discovery is a big task, but Talend’s data preparation platform offers visualization tools which help users
profile and browse their data.

3. Cleanse and validate data

Cleaning up the data is traditionally the most time consuming part of the data preparation process, but it’s
crucial for removing faulty data and filling in gaps. Important tasks here include:

Removing extraneous data and outliers.


Filling in missing values.
Conforming data to a standardized pattern.
Masking private or sensitive data entries.

Once data has been cleansed, it must be validated by testing for errors in the data preparation process up
to this point. Often times, an error in the system will become apparent during this step and will need to be
resolved before moving forward.

4. Transform and enrich data

Transforming data is the process of updating the format or value entries in order to reach a well-defined
outcome, or to make the data more easily understood by a wider audience. Enriching data refers to adding
and connecting data with other related information to provide deeper insights.
5. Store data

Once prepared, the data can be stored or channeled into a third party application—such as a business
intelligence tool—clearing the way for processing and analysis to take place.

Self-Service Data Preparation Tools

Data preparation is a very important process, but it’s also requires an intense investment of resources.
Data scientists and data analysts report that 80% of their time is spent doing data prep, rather than
analysis.

Do your data team have time for thorough data preparation? What about organizations that don’t have a
team of data scientists or data analysts at all?

That’s where self-service data preparation tools like Talend Data Preparation come in. Cloud-native
platforms with machine learning capabilities simplify the data preparation process. This means that data
scientists and business users can focus on analyzing data, instead of just cleaning it.

But it also allows business professionals, who may lack advanced IT skills, to run the process themselves.
This makes data preparation more of a team sport, rather than wasting valuable resources and cycles with
IT teams.

To get the best value out of a self-service data preparation tool, look for a platform with:

Data access and discovery from any datasets — from Excel and CSV files to data warehouses, data
lakes, and cloud apps such as Salesforce.com.
Cleansing and enrichment functions.
Auto-discovery, standardization, profiling, smart suggestions, and data visualization.
Export functions to files (Excel, Cloud, Tableau, etc.) together with controlled export to data
warehouses and enterprise applications.
Shareable data preparations and data sets.
Design and productivity features like automatic documentation, versioning, and operationalizing into
ETL processes.

The Future of Data Preparation

Initially focused on analytics, data preparation has evolved to address a much broader set of uses cases
and can be used by a larger range of users.

Although it improves the personal productivity of whoever uses it, it has evolved into an enterprise tool
that fosters collaboration between IT professionals, data experts, and business users.

Getting Started with Data Preparation

Data preparation creates higher quality data for analysis and other data management related tasks by
eradicating errors and normalizing raw data before it is processed. It is critical, but takes a lot of time and
might require specific skills.

Now, however, with a smart data preparation tool, the process has become faster and more accessible to a
wider variety of users.

CASE STUDY

What is Data Exploration?

Data exploration definition: Data exploration refers to the initial step in data analysis in which data
analysts use data visualization and statistical techniques to describe dataset characterizations, such as size,
quantity, and accuracy, in order to better understand the nature of the data.

Data exploration techniques include both manual analysis and automated data exploration software
solutions that visually explore and identify relationships between different data variables, the structure of
the dataset, the presence of outliers, and the distribution of data values in order to reveal patterns and
points of interest, enabling data analysts to gain greater insight into the raw data.

Data is often gathered in large, unstructured volumes from various sources and data analysts must first
understand and develop a comprehensive view of the data before extracting relevant data for further
analysis, such as univariate, bivariate, multivariate, and principal components analysis.

Data Exploration Tools

Manual data exploration methods entail either writing scripts to analyze raw data or manually filtering
data into spreadsheets. Automated data exploration tools, such as data visualization software, help data
scientists easily monitor data sources and perform big data exploration on otherwise overwhelmingly large
datasets. Graphical displays of data, such as bar charts and scatter plots, are valuable tools in visual data
exploration.

A popular tool for manual data exploration is Microsoft Excel spreadsheets, which can be used to create
basic charts for data exploration, to view raw data, and to identify the correlation between variables. To
identify the correlation between two continuous variables in Excel, use the function CORREL() to return
the correlation. To identify the correlation between two categorical variables in Excel, the two-way table
method, the stacked column chart method, and the chi-square test are effective.

There is a wide variety of proprietary automated data exploration solutions, including business
intelligence tools, data visualization software, data preparation software vendors, and data exploration
platforms. There are also open source data exploration tools that include regression capabilities and
visualization features, which can help businesses integrate diverse data sources to enable faster data
exploration. Most data analytics software includes data visualization tools.


Why is Data Exploration Important?

Humans process visual data better than numerical data, therefore it is extremely challenging for data
scientists and data analysts to assign meaning to thousands of rows and columns of data points and
communicate that meaning without any visual components.

Data visualization in data exploration leverages familiar visual cues such as shapes, dimensions, colors,
lines, points, and angles so that data analysts can effectively visualize and define the metadata, and then
perform data cleansing. Performing the initial step of data exploration enables data analysts to better
understand and visually identify anomalies and relationships that might otherwise go undetected.

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), similar to data exploration, is a statistical technique to analyze data
sets for their broad characteristics. Visualization tools for exploratory data analysis such as
OmniSci’s Immerse platform enable interactivity with raw data sets, giving analysts increased visibility into
the patterns and relationships within the data.
Exploratory Data Analysis Example

Data Exploration in GIS

GIS (Geographic Information Systems) is a framework for gathering and analyzing data connected to
geographic locations and their relation to human or natural activity on Earth. With so much of the
world's data now being location-enriched, geospatial analysts are faced with a rapidly increasing volume
of geospatial data.

Advanced GIS software solutions and tools can facilitate the incorporation of spatio-temporal
analysis into existing big data analytics workflows, enabling data analysts to easily create and share
intuitive data visualizations that will aid in spatial data exploration. The ability to characterize and narrow
down raw data is an essential step for spatial data analysts who may be faced with millions of polygons
and billions of mapped points. For example, learn about the ways GIS technologies are improving disaster
response operations.

Data Exploration in Machine Learning

A Machine Learning project is as good as the foundation of data on which it is built. In order to perform
well, machine learning data exploration models must ingest large quantities of data, and model accuracy
will suffer if that data is not thoroughly explored first. Data exploration steps to follow before building a
machine learning model include:

Variable identification: define each variable and its role in the dataset
Univariate analysis: for continuous variables, build box plots or histograms for each variable
independently; for categorical variables, build bar charts to show the frequencies
Bi-variable analysis - determine the interaction between variables by building visualization tools
~Continuous and Continuous: scatter plots
~Categorical and Categorical: stacked column chart
~Categorical and Continuous: boxplots combined with swarmplots
Detect and treat missing values
Detect and treat outliers

The ultimate goal of data exploration machine learning is to provide data insights that will inspire
subsequent feature engineering and the model-building process. Feature engineering facilitates the
machine learning process and increases the predictive power of machine learning algorithms by creating
features from raw data.

Interactive Data Exploration

Advanced visualization techniques are employed throughout a variety of disciplines to empower users to
visualize patterns and gain insight from complex data flows, and make subsequent data-driven decisions.
Industries from engineering to medicine to education are learning how to do data exploration.

In big data exploration tools, interactivity is an important component in the perception of data
exploration visual technologies and the dissemination of insights. The manner in which users perceive and
interact with visualizations can heavily influence their understanding of the data as well as the value they
place on the visualization system in general.

Interactive data exploration emphasizes the importance of collaborative work and facilitates human
interaction with the integration of advanced interaction and visualization technologies. Accelerated
multimodal interaction platforms equipped with graphical user interfaces that prioritize human-to-human
properties facilitate big data exploration through visual analytics, accelerate the sharing of opinions,
remove the data bottleneck of individual analysis, and reduce discovery time.

What is the Best Language for Data Exploration?

The most popular programming tools for data science are currently R and Python, both highly flexible,
open source data analytics languages. R is generally best suited for statistical learning as it was built as a
statistical language. Python is generally considered the best choice for machine learning with its flexibility
for production. The best language for data exploration depends entirely on the application at hand and
available tools and technologies.
Data Exploration in Python

Data exploration with python has the advantage in ease of learning, production readiness, integration
with common tools, an abundant library, and support from a huge community. Nearly every tool kit and
functionality is packaged and can be executed by simply calling the name of a method.

Python data exploration is made easier with Pandas, the open source Python data analysis library that can
single-handedly profile any dataframe and generate a complete HTML report on the dataset. Once Pandas
is imported, it allows users to import files in a variety of formats, the most popular format being CSV. The
pandas data exploration library provides:

Efficient dataframe object for data manipulation with integrated indexing


Tools for reading and writing data between disparate formats
Integrated handling of missing data and intelligent data alignment
Flexible pivoting and reshaping of datasets
Time series-functionality
Intelligent label-based slicing, fancy indexing, and subsetting of large datasets
Columns can be inserted and deleted from data structures for size mutability
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine
operations on datasets
High performance merging and joining of datasets
Hierarchical axis indexing

Techniques for how to improve data exploration using Pandas are discussed at length in expansive Python
community forums.

Data Exploration in R

The data exploration and visualization with R process looks like:

Loading the data: Due to the availability of predefined libraries and simple syntax, loading data from
a variety of formats, such as .XLS, TXT, CSV, and JSON, is very straightforward
Converting variables: The process of converting a variable into a different data type in R entails
adding a character string to a numeric vector, converting all the elements in the vector to the
character
Transpose a dataset: R provides code to transpose a dataset from a wide structure to a much
narrower structure
Sorting of dataframe: accomplished by using order as an index
Create plots or histograms
Generate frequency tables to best understand the distribution across categories
Generate a sample set with just a few random indices
Remove duplicate values of a variable
Find class-level count average and sum: R data exploration techniques include apply functions to
accomplish this
Recognize and treat missing values and outliers by inputting with the mean of other numbers
Merge and join datasets: R includes an appending datasets function and a bind function

What is the Relationship Between Data Exploration and Data Mining?

There are two primary methods for retrieving relevant data from large, unorganized pools: data
exploration, which is the manual method, and data mining, which is the automatic method. Data mining,
a field of study within machine learning, refers to the process of extracting patterns from data with the
application of algorithms. Data exploration and visualization provide guidance in applying the most
effective further statistical and data mining treatment to the data.

Once the relationships between the different variables have been revealed, analysts can proceed with the
data mining process by building and deploying data models equipped with the new insights gained. Data
exploration and data mining are sometimes used interchangeably.

Data Discovery vs Data Exploration

Once data exploration has refined the data, data discovery can begin. Data discovery is the business-user-
oriented process for exploring data and answering highly specific business questions. This iterative process
seeks out patterns and looks at clusters, sequences of events, specific trends, and time-series analysis, and
plays an integral part in business intelligence systems, providing visual navigation of data and facilitating
the consolidation of all business information.

Most popular data discovery tools provide data exploration and preparation and modeling capabilities,
support visual and digestible data representations, allow interactive navigation and sharing options,
support access to data sources, and offer seamless integration of data preparation, analysis, and analytics.
Learn how OmniSci's converged analytics platform integrates these capabilities to derive insights from
your largest datasets at the speed of curiosity.

Data Examination vs Data Exploration

Data examination and data exploration are effectively the same process. Data examination assesses the
internal consistency of the data as a whole for the purpose of confirming the quality of the data for
subsequent analysis. Internal consistency reliability is an assessment based on the correlations between
different items on the same test. This assessment gauges the reliability of a test or survey that is designed
to measure the same construct for different items.

Describing and Examining Data

Information is the currency of research. However, unlike real money, there is often too much information.
In most cases, data must be summarized to be useful.

The most common method of summarizing data is with descriptive statistics and graphs. Even if you're
planning to analyze your data using a statistical technique such as a £ test, analysis of variance, or logistic
regression, you should always begin by examining your data. This preliminary step helps you determine
which statistical analysis techniques should be used to answer your research questions.

In fact, this process of examining your data often reveals information that will surprise or inform you. You
may discover unusually high or low values in your data. Perhaps these “outliers” are caused by incorrectly
coded data, or they may reveal information about your data (or subjects) that you have not anticipated.
You might observe that your data are not normally distributed. You might notice that a histogram of your
data shows two distinct peaks, causing you to realize that your data show a difference between genders.
Insights such as these often result from the proper use of descriptive techniques.

The following sections of this chapter discuss the most commonly used tactics for understanding your
data for reporting information or preparing your data for further analysis. The two major topics discussed
in this chapter are as follows:
Describing quantitative data using statistics and graphs
Describing categorical data using statistics and graphs

How To Present Research Data?

The result section of an original research paper provides answer to this question “What was found?” The
amount of findings generated in a typical research project is often much more than what medical journal
can accommodate in one article. So, the first thing the author needs to do is to make a selection of what is
worth presenting. Having decided that, he/she will need to convey the message effectively using a mixture
of text, tables and graphics. The level of details required depends a great deal on the target audience of the
paper. Hence it is important to check the requirement of journal we intend to send the paper to (e.g. the
Uniform Requirements for Manuscripts Submitted to Medical Journals1). This article condenses some
common general rules on the presentation of research data that we find useful.

SOME GENERAL RULES

Keep it simple. This golden rule seems obvious but authors who have immersed in their data
sometime fail to realise that readers are lost in the mass of data they are a little too keen to present.
Present too much information tends to cloud the most pertinent facts that we wish to convey.
First general, then specific. Start with response rate and description of research participants (these
information give the readers an idea of the representativeness of the research data), then the key
findings and relevant statistical analyses.
Data should answer the research questions identified earlier.
Leave the process of data collection to the methods section. Do not include any discussion. These
errors are surprising quite common.
Always use past tense in describing results.
Text, tables or graphics? These complement each other in providing clear reporting of research
findings. Do not repeat the same information in more than one format. Select the best method to
convey the message.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy