Language Detection & Translation
Language Detection & Translation
DETECTION &
TRANSLATION
CONTENTS
Problem Statement
Dataset
Methodology
Results
Conclusion
Language detection & translation 3
PROBLEM STATEMENT
In the modern interconnected world where information effortlessly traverses through borders
and language divides the need for automatic language detection and translation systems has
become important. Our task is to create a system that can easily identify the language of any
given text input and translate it to English. With the ability to handle a variety of languages or
system aims to bridge linguistic gaps enabling effective communication and collaboration
across diverse cultural landscapes.
Language detection & translation 4
• The dataset contains total 22 languages, each
represented by 1000 text samples.
• It covers a diverse range of languages from
DATASET •
various regions and linguistic families.
The dataset includes languages such as
English, Russian, Spanish, Hindi, Chinese,
French, Portuguese, Urdu, Arabic, and others.
• Each language is represented by an equal
number of samples, to ensure balanced
distribution across classes.
• All languages are equally represented in the
dataset, with each language class having
exactly 1000 samples.
• This balanced distribution facilitates unbiased
model training and evaluation across different
languages.
Language detection & translation 5
METHODOLOGY
Vectorization
• TF-IDF(Term Frequency- Inverse Document Frequency).
• This technique considers frequency of a word in a document and also its frequency across other
documents.
• Assigns higher weights to frequently used words in a document
text data numerical feature vectors
Language detection & translation 6
METHODOLOGY
METHODOLOGY
Model 2- RandomForestClassifier
• RandomForestClassifier builds a forest of decision trees using the TF-IDF feature vectors as input and the
language labels as targets.
• Each decision tree in the forest is trained on a bootstrap sample of the training data, with replacement.
• At each node of the tree, a random subset of features is considered for splitting, helping to reduce correlation
between trees and improve model diversity.
• Given a new text document, after vectorization the RandomForestClassifier aggregates the predictions from all
decision trees in the forest and assigns the final predicted language based on majority voting or averaging of class
probabilities.
Language detection & translation 8
METHODOLOGY
Model 3- LogisticRegression
• Logistic Regression calculates the probability that the text belongs to each language class using the learned
weights and bias terms.
• The final predicted language is typically the one with the highest predicted probability.
• While both MNB and Logistic Regression can perform well on text classification tasks Multinomial Naive Bayes is
great for big text datasets because it's fast and works well even when there are lots of different words. But if you
need to understand more complicated patterns in the data, Logistic Regression might be better because it can
handle different kinds of features and relationships between them.
Language detection & translation 9
RESULTS
Language detection & translation
Accuracy score 10
RANDOM
1.0
• Accuracy score on testing data:
FOREST 0.9225757575757576
Language detection & translation
Accuracy score 12
LOGISTIC
0.9864935064935065
• Accuracy score on testing data:
REGRESSION 0.9540909090909091
Language detection & translation 13
COMPARISON
Language detection & translation 14
CONCLUSION
In conclusion, our language detection and translation project aimed to automatically identify the language of a given text input and translate it
into English. We experimented with three different machine learning models: Multinomial Naive Bayes (MNB), Logistic Regression, and Random
Forest. However, despite our efforts, none of these models were able to provide accurate predictions for all languages. This indicates that the
problem at hand may require more sophisticated techniques beyond traditional machine learning approaches.
Considering the limitations encountered with the existing models, it is evident that Natural Language Processing (NLP) techniques could offer a
more effective solution. NLP methods, such as deep learning models like recurrent neural networks (RNNs) or transformer-based architectures
like BERT, have demonstrated superior performance in language-related tasks, including language identification and translation. These models
can learn complex linguistic patterns and relationships, capturing details that traditional machine learning models may struggle with.
THANK
YOU
Nikita Chorge
Prn: 23070243006