0% found this document useful (0 votes)
13 views23 pages

Ai Final Print

The document discusses building a deep neural network for machine translation using recurrent neural networks. It describes preprocessing text data, building and training a model, generating translations, and iterating on the model architecture. Key steps involve loading data, text cleaning, tokenization, padding, creating encoder and decoder RNN layers, training the model, and comparing output translations to ground truths.

Uploaded by

UNIQUE STAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Ai Final Print

The document discusses building a deep neural network for machine translation using recurrent neural networks. It describes preprocessing text data, building and training a model, generating translations, and iterating on the model architecture. Key steps involve loading data, text cleaning, tokenization, padding, creating encoder and decoder RNN layers, training the model, and comparing output translations to ground truths.

Uploaded by

UNIQUE STAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

LANGUAGE TRANSLATION USING MACHINE LEARNING

A MINI PROJECT REPORT


18CSC305J - ARTIFICIAL INTELLIGENCE

Submitted by
Jaya Lohith(RA2111026010206)
S.D Azhar(RA2111026010220)
Aamir Mayan(RA2111026010257)

Under the guidance of

Dr. A Robert Singh


Assistant Professor, Department of Computational Intelligence
in partial fulfillment for the award of the
degree of

BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE & ENGINEERING


of

FACULTY OF ENGINEERING AND TECHNOLOGY

S.R.M. Nagar, Kattankulathur, Chengalpattu District


MAY 2024

1
SRM INSTITUTE OF SCIENCE AND
TECHNOLOGY
(Under Section 3 of UGC Act, 1956)

BONAFIDE CERTIFICATE

Certified that Mini project report titled “LANGUAGE TRANSLATION


USING MACHINE LEARNING” is the bona fide work of Jaya
Lohith(RA2111026010206), S.D Azhar(RA2111026010220), Aamir
Mayan(RA2111026010257) who carried out the minor project under my
supervision. Certified further, that to the best of my knowledge, the work reported
herein does not form any other project report or dissertation on the basis of which
a degree or award was conferred on an earlieroccasion on this or any other
candidate.

SIGNATURE SIGNATURE

Dr. A Robert Singh Dr. R Annie Uthra


Assistant Professor Head of the Department
CINTEL CINTEL

2
ABSTRACT

 Language translation has undergone a significant transformation with


the advent of machine learning techniques, particularly neural machine
translation (NMT). Traditional rule-based and statistical approaches have
been supplanted by deep learning architectures like recurrent neural
networks (RNNs) and transformers. These advancements have vastly
improved translation quality and efficiency by leveraging mechanisms
such as attention and self-attention, which enable models to capture
complex linguistic structures and nuances effectively.

 However, challenges persist, including data scarcity for low-resource


languages, domain adaptation issues, and the need to mitigate bias
and cultural nuances. Ongoing research efforts are addressing these
challenges through techniques like data augmentation, multi-task
learning, and adversarial training. Despite these challenges, machine
learning-based translation systems have found wide-ranging
applications in domains such as e-commerce, healthcare, legal, and
diplomacy,
facilitating cross-cultural communication and global collaboration.

 In conclusion, recent developments in machine learning have


revolutionized language translation, empowering NMT systems to handle
diverse linguistic contexts with remarkable accuracy. While challenges
remain, ongoing research is driving innovation to overcome these hurdles
and further enhance the capabilities and applicability of machine
learning-based translation systems in diverse real-world scenarios.

3
TABLE OF CONTENTS

1. ABSTRACT 1

2. TABLE OF CONTENTS 4

3. INTRODUCTION 5

4. LITERATURE SURVEY 7

5. SYSTEM ARCHITECTURE AND DESIGN 9

6. METHODOLOGY 13

7. CODING AND TESTING 15

8. OUTPUT 19

9. CONCLUSION AND FUTURE ENHANCEMENT 21

10. REFERENCES 22

4
CHAPTER I
INTRODUCTION

• In this project, we build a deep neural network that functions as part of


a machine translation pipeline. The pipeline accepts English text as
input and returns the French translation. The goal is to achieve the
highest
translation accuracy possible.

• To translate a corpus of English text to French, we need to build a


recurrent neural network (RNN). Before diving into the
implementation, let’s first build some intuition of RNNs and why
they’re useful for NLP tasks.

• Depending on the use-case, you’ll want to set up your RNN to handle


inputs and outputs differently. For this project, we’ll use a many-to-many
process where the input is a sequence of English words and the output is a
sequence of French words.

• Below is a summary of the various preprocessing and modeling


steps. The high-level steps include:

1. Preprocessing: load and examine data, cleaning,


tokenization, padding.

2. Modelling: build, train, and test the model.

3. Prediction: generate specific translations of English to French,


and compare the output translations to the ground truth
translations.

4. Iteration: iterate on the model, experimenting with


different architectures.
5
• We use Keras for the frontend and TensorFlow for the backend in this
project. I prefer using Keras on top of TensorFlow because the syntax is
simpler, which makes building the model layers more intuitive.
However, there is a trade-off with Keras as you lose the ability to do
fine-grained customizations. But this won’t affect the models we’re
building in this project.

6
CHAPTER II
LITERATURE SURVEY

 Digitalization is not a new process, per se, but the speed and scope
of digital processes, and their integration in companies and labour,
have significantly increased, according to the latest research. The
digital
transformation is quickly changing the world of work in the
developed societies of Europe and North America. Although the view
that new
technologies are not deterministic currently prevails, their
implementation impacts both labour and employment. The debate on the
future of work in recent years has been dominated by pessimistic
scenarios of job
extinction.
 The translating profession is considered creative 7 due to the fact that
translators solve communication problems in different cultural and
communicational environments, which is a highly creative activity.
In
recent years, this particular profession has changed in a few ways. Firstly,
the share of self-employed translators is increasing worldwide at the
expense of permanently employed ones. Secondly, a decent share of job
and task seeking is taking place through online work-finding platforms.
According to a forecast, in 2025 online platforms are expected to be
accountable for about one third of all labour relationships 8.
 Finally, interest in new technologies has been growing in translation
work. This kind of digitalization started in the 1990s, with the
implementation of machine translation technologies. This was most
7
marked in agencies, government services, and multinational companies,
where translations, primarily of technical documentation, were produced

8
on a large scale. This was the major market for the mainframe systems
Systran, Logos, METAL, and ATL Currently, the opinion that the
translating profession will be heavily impacted by digitalization,
automatization, and AI prevails in the literature. However, in practice, the
data on technologies’ implementation in translation is not exactly
deterministic.
 For example, a relatively recent survey shows that digital tools are
widely used in Denmark, but machine translation tools are still less
common.
Further, relatively recent research in Spanish-speaking countries shows
that 45% of researched companies offering language services in Spanish
already use machine translation. There is a predominant view that these
technologies will further penetrate translation work.
 This process opens up a number of debates. For example, some
authors point out that it is necessary to re-think the freedom of
translation, as
translation technology is quickly ousting human translators. According to
others, this process has to change the way translators are educated, and
the way they work.
 Some publications even show that translators are leaving the profession
because of the technological impact, but there no large-scale research
results have been published that would shed light on the numbers or
reasons behind this process. In this respect, the present study of their
motivation, views, and expectations contributes not only to the body of
research on a specific creative profession, but to the future
development of labour in general.

9
CHAPTER III
SYSTEM ARCHITECTURE AND DESIGN

 First, let’s breakdown the architecture of an RNN at a high level.


o Inputs:
Input sequences are fed into the model with one word for
every time step. Each word is encoded as a unique integer or
one-hot encoded vector that maps to the English dataset
vocabulary.
o Embedding Layers:
Embeddings are used to convert each word to a vector. The size
of the vector depends on the complexity of the vocabulary.
o Recurrent Layers (Encoder):
This is where the context from word vectors in previous time
steps is applied to the current word vector.
o Dense Layers (Decoder):
These are typical fully connected layers used to decode
the encoded input into the correct translation sequence.
o Outputs:
The outputs are returned as a sequence of integers or one-hot
encoded vectors which can then be mapped to the French dataset
vocabulary.
 Embeddings allow us to capture more precise syntactic and semantic
word relationships. This is achieved by projecting each word into n-
dimensional space. Words with similar meanings occupy similar
regions of this space; the closer two words are, the more similar they
are. And

10
often the vectors between words represent useful relationships, such as
gender, verb tense, or even geopolitical relationships.

 Since our dataset for this project has a small vocabulary and low
syntactic variation, we’ll use Keras to train the embeddings ourselves.
 Our sequence-to-sequence model links two recurrent networks: an
encoder and decoder. The encoder summarizes the input into a context
variable, also called the state. This context is then decoded and the
output sequence is generated.

11
 Since both the encoder and decoder are recurrent, they have loops
which process each part of the sequence at different time steps. To
picture this, it’s best to unroll the network so we can see what’s
happening at each
time step.
 In the example below, it takes four timesteps to encode the entire input
sequence. At each time step, the encoder “reads” the input word and
performs a transformation on its hidden state. Then it passes that
hidden state to the next time step. The bigger the hidden state, the
greater the learning capacity of the model, but also the greater the
computation requirements.

12
 For now, notice that for each time step after the first word in the sequence
there are two inputs: the hidden state and a word from the sequence. For
the encoder, it’s the next word in the input sequence. For the decoder, it’s
the previous word from the output sequence.
 To implement bidirectional communication, we train two RNN
layers simultaneously. The first layer is fed the input sequence as-is
and the second is fed a reversed copy.

13
CHAPTER IV
METHODOLOGY

 Load & Examine Data


o The inputs are sentences in English; the outputs are the
corresponding translations in French.
o When we run a word count, we can see that the vocabulary for
the dataset is quite small. This was by design for this project. This
allows us to train the models in a reasonable time.

 Cleaning

o No additional cleaning needs to be done at this point. The data


has already been converted to lowercase and split so that there are
spaces between all words and punctuation.
o For other NLP projects you may need to perform additional
steps such as: remove HTML tags, remove stop words, remove
punctuation or convert to tag representations, label the parts of
speech, or perform entity extraction.

 Tokenization
o Next, we need to tokenize the data — i.e., convert the text to
numerical values. This allows the neural network to perform
operations on the input data. For this project, each word and
punctuation mark will be given a unique ID. (For other NLP
projects, it might make sense to assign each character a unique
ID.)
o When we run the tokenizer, it creates a word index, which is
then used to convert each sentence to a vector.
14
 Padding
o When we feed our sequences of word IDs into the model, each
sequence needs to be the same length. To achieve this, padding
is added to any sequence that is shorter than the max length (i.e.
shorter than the longest sentence).

 Encoding and Decoding


o The encoder summarizes the input into a context variable,
also called the state. This context is then decoded and the
output sequence is generated.

15
CHAPTER V
CODING AND TESTING
import helper
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation,
RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
import tensorflow as tf
english_path='https://raw.githubusercontent.com/projjal1/datasets/master/
small_vocab_en.txt'
french_path='https://raw.githubusercontent.com/projjal1/datasets/master/
small_vocab_fr.txt'
import os
def load_data(path):
input_file = os.path.join(path)
with open(input_file, "r") as f:
data = f.read()
return data.split('\n')

#Using helper to inport dataset


english_data=tf.keras.utils.get_file('file1',english_path)
french_data=tf.keras.utils.get_file('file2',french_path)
#Now loading data
english_sentences=load_data(english_data)
french_sentences=load_data(french_data)
for i in range(5):
print('Sample :',i)
print(english_sentences[i])
print(french_sentences[i])
print('-'*50)

import collections
english_words_counter = collections.Counter([word for sentence in
english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in
french_sentences for word in sentence.split()])
print('English Vocab:',len(english_words_counter))
print('French Vocab:',len(french_words_counter))
def tokenize(x):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
return tokenizer.texts_to_sequences(x), tokenizer
16
# Tokenize Sample output
text_sentences = [
'The quick brown fox jumps over the lazy dog .',
'By Jove , my quick study of lexicography won a prize .',
'This is a short sentence .']

text_tokenized, text_tokenizer = tokenize(text_sentences)


print(text_tokenizer.word_index)
print()

for sample_i, (sent, token_sent) in enumerate(zip(text_sentences,


text_tokenized)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(sent))
print(' Output: {}'.format(token_sent))

def pad(x, length=None):


return pad_sequences(x, maxlen=length, padding='post')

def preprocess(x, y):


"""
Preprocess x and y
:param x: Feature List of sentences
:param y: Label List of sentences
:return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y
tokenizer)
"""
preprocess_x, x_tk = tokenize(x)
preprocess_y, y_tk = tokenize(y)
preprocess_x = pad(preprocess_x)
preprocess_y = pad(preprocess_y)
# Keras's sparse_categorical_crossentropy function requires the labels to
be in 3 dimensions
#Expanding dimensions
preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer,


french_tokenizer =\
preprocess(english_sentences, french_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)
print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
17
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
def logits_to_text(logits, tokenizer):
index_to_words = {id: word for word, id in tokenizer.word_index.items()}
index_to_words[0] = ''
#So basically we are predicting output for a given word and then selecting
best answer
#Then selecting that label we reverse-enumerate the word from id
return ' '.join([index_to_words[prediction] for prediction in
np.argmax(logits, 1)])
def embed_model(input_shape, output_sequence_length, english_vocab_size,
french_vocab_size):
"""
Build and train a RNN model using word embedding on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
# TODO: Implement
# Hyperparameters
learning_rate = 0.005
# TODO: Build the layers
model = Sequential()
model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1],
input_shape=input_shape[1:]))
model.add(GRU(256, return_sequences=True))
model.add(TimeDistributed(Dense(1024, activation='relu')))
model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))
# Compile model
model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(learning_rate),
metrics=['accuracy'])
return model

# Reshaping the input to work with a basic RNN


tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
simple_rnn_model = embed_model(
tmp_x.shape,
preproc_french_sentences.shape[1],
len(english_tokenizer.word_index)+1,
len(french_tokenizer.word_index)+1)

simple_rnn_model.summary()
18
history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024,
epochs=20, validation_split=0.2)
simple_rnn_model.save('model.h5')
def final_predictions(text):
y_id_to_word = {value: key for key, value in
french_tokenizer.word_index.items()}
y_id_to_word[0] = ''
sentence = [english_tokenizer.word_index[word] for word in text.split()]
sentence = pad_sequences([sentence], maxlen=preproc_french_sentences.shape[-
2], padding='post')
print(sentence.shape)
print(logits_to_text(simple_rnn_model.predict(sentence[:1])[0],
french_tokenizer))

import re
txt=input().lower()
final_predictions(re.sub(r'[^\w]', ' ', txt))

19
CHAPTER VI
OUTPUT

20
21
CHAPTER VII
CONCLUSION AND FUTURE ENHANCEMENT

 We have successfully constructed a machine learrning method to


provide a bidrectional translation of english to french using an RNN and
it has yielded a 79% accuracy value in a training time of 27 epochs.

 Future Improvements:
o Do proper data split (training, validation, test):
Currently, there is no test set, only training and validation.
o LSTM + attention:
This has been the de facto architecture for RNNs over the past few
years, although there are some limitations.
o Train on a larger and more diverse text corpus:
The text corpus and vocabulary for this project are quite small with
little variation in syntax. As a result, the model is very brittle. To
create a model that generalizes better, you’ll need to train on a
larger dataset with more variability in grammar and sentence
structure.
o Residual layers:
You could add residual layers to a deep LSTM RNN, as described
in this paper. Or, use residual layers as an alternative to LSTM
and GRU, as described here.

22
REFERENCES
https://towardsdatascience.com/language-translation-with-rnns-
d84d43b40571
https://tommytracey.github.io/AIND-
Capstone/machine_translation.html

23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy