Ai Final Print
Ai Final Print
Submitted by
Jaya Lohith(RA2111026010206)
S.D Azhar(RA2111026010220)
Aamir Mayan(RA2111026010257)
BACHELOR OF TECHNOLOGY
in
1
SRM INSTITUTE OF SCIENCE AND
TECHNOLOGY
(Under Section 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
2
ABSTRACT
3
TABLE OF CONTENTS
1. ABSTRACT 1
2. TABLE OF CONTENTS 4
3. INTRODUCTION 5
4. LITERATURE SURVEY 7
6. METHODOLOGY 13
8. OUTPUT 19
10. REFERENCES 22
4
CHAPTER I
INTRODUCTION
6
CHAPTER II
LITERATURE SURVEY
Digitalization is not a new process, per se, but the speed and scope
of digital processes, and their integration in companies and labour,
have significantly increased, according to the latest research. The
digital
transformation is quickly changing the world of work in the
developed societies of Europe and North America. Although the view
that new
technologies are not deterministic currently prevails, their
implementation impacts both labour and employment. The debate on the
future of work in recent years has been dominated by pessimistic
scenarios of job
extinction.
The translating profession is considered creative 7 due to the fact that
translators solve communication problems in different cultural and
communicational environments, which is a highly creative activity.
In
recent years, this particular profession has changed in a few ways. Firstly,
the share of self-employed translators is increasing worldwide at the
expense of permanently employed ones. Secondly, a decent share of job
and task seeking is taking place through online work-finding platforms.
According to a forecast, in 2025 online platforms are expected to be
accountable for about one third of all labour relationships 8.
Finally, interest in new technologies has been growing in translation
work. This kind of digitalization started in the 1990s, with the
implementation of machine translation technologies. This was most
7
marked in agencies, government services, and multinational companies,
where translations, primarily of technical documentation, were produced
8
on a large scale. This was the major market for the mainframe systems
Systran, Logos, METAL, and ATL Currently, the opinion that the
translating profession will be heavily impacted by digitalization,
automatization, and AI prevails in the literature. However, in practice, the
data on technologies’ implementation in translation is not exactly
deterministic.
For example, a relatively recent survey shows that digital tools are
widely used in Denmark, but machine translation tools are still less
common.
Further, relatively recent research in Spanish-speaking countries shows
that 45% of researched companies offering language services in Spanish
already use machine translation. There is a predominant view that these
technologies will further penetrate translation work.
This process opens up a number of debates. For example, some
authors point out that it is necessary to re-think the freedom of
translation, as
translation technology is quickly ousting human translators. According to
others, this process has to change the way translators are educated, and
the way they work.
Some publications even show that translators are leaving the profession
because of the technological impact, but there no large-scale research
results have been published that would shed light on the numbers or
reasons behind this process. In this respect, the present study of their
motivation, views, and expectations contributes not only to the body of
research on a specific creative profession, but to the future
development of labour in general.
9
CHAPTER III
SYSTEM ARCHITECTURE AND DESIGN
10
often the vectors between words represent useful relationships, such as
gender, verb tense, or even geopolitical relationships.
Since our dataset for this project has a small vocabulary and low
syntactic variation, we’ll use Keras to train the embeddings ourselves.
Our sequence-to-sequence model links two recurrent networks: an
encoder and decoder. The encoder summarizes the input into a context
variable, also called the state. This context is then decoded and the
output sequence is generated.
11
Since both the encoder and decoder are recurrent, they have loops
which process each part of the sequence at different time steps. To
picture this, it’s best to unroll the network so we can see what’s
happening at each
time step.
In the example below, it takes four timesteps to encode the entire input
sequence. At each time step, the encoder “reads” the input word and
performs a transformation on its hidden state. Then it passes that
hidden state to the next time step. The bigger the hidden state, the
greater the learning capacity of the model, but also the greater the
computation requirements.
12
For now, notice that for each time step after the first word in the sequence
there are two inputs: the hidden state and a word from the sequence. For
the encoder, it’s the next word in the input sequence. For the decoder, it’s
the previous word from the output sequence.
To implement bidirectional communication, we train two RNN
layers simultaneously. The first layer is fed the input sequence as-is
and the second is fed a reversed copy.
13
CHAPTER IV
METHODOLOGY
Cleaning
Tokenization
o Next, we need to tokenize the data — i.e., convert the text to
numerical values. This allows the neural network to perform
operations on the input data. For this project, each word and
punctuation mark will be given a unique ID. (For other NLP
projects, it might make sense to assign each character a unique
ID.)
o When we run the tokenizer, it creates a word index, which is
then used to convert each sentence to a vector.
14
Padding
o When we feed our sequences of word IDs into the model, each
sequence needs to be the same length. To achieve this, padding
is added to any sequence that is shorter than the max length (i.e.
shorter than the longest sentence).
15
CHAPTER V
CODING AND TESTING
import helper
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation,
RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
import tensorflow as tf
english_path='https://raw.githubusercontent.com/projjal1/datasets/master/
small_vocab_en.txt'
french_path='https://raw.githubusercontent.com/projjal1/datasets/master/
small_vocab_fr.txt'
import os
def load_data(path):
input_file = os.path.join(path)
with open(input_file, "r") as f:
data = f.read()
return data.split('\n')
import collections
english_words_counter = collections.Counter([word for sentence in
english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in
french_sentences for word in sentence.split()])
print('English Vocab:',len(english_words_counter))
print('French Vocab:',len(french_words_counter))
def tokenize(x):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
return tokenizer.texts_to_sequences(x), tokenizer
16
# Tokenize Sample output
text_sentences = [
'The quick brown fox jumps over the lazy dog .',
'By Jove , my quick study of lexicography won a prize .',
'This is a short sentence .']
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)
print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
17
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
def logits_to_text(logits, tokenizer):
index_to_words = {id: word for word, id in tokenizer.word_index.items()}
index_to_words[0] = ''
#So basically we are predicting output for a given word and then selecting
best answer
#Then selecting that label we reverse-enumerate the word from id
return ' '.join([index_to_words[prediction] for prediction in
np.argmax(logits, 1)])
def embed_model(input_shape, output_sequence_length, english_vocab_size,
french_vocab_size):
"""
Build and train a RNN model using word embedding on x and y
:param input_shape: Tuple of input shape
:param output_sequence_length: Length of output sequence
:param english_vocab_size: Number of unique English words in the dataset
:param french_vocab_size: Number of unique French words in the dataset
:return: Keras model built, but not trained
"""
# TODO: Implement
# Hyperparameters
learning_rate = 0.005
# TODO: Build the layers
model = Sequential()
model.add(Embedding(english_vocab_size, 256, input_length=input_shape[1],
input_shape=input_shape[1:]))
model.add(GRU(256, return_sequences=True))
model.add(TimeDistributed(Dense(1024, activation='relu')))
model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))
# Compile model
model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(learning_rate),
metrics=['accuracy'])
return model
simple_rnn_model.summary()
18
history=simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024,
epochs=20, validation_split=0.2)
simple_rnn_model.save('model.h5')
def final_predictions(text):
y_id_to_word = {value: key for key, value in
french_tokenizer.word_index.items()}
y_id_to_word[0] = ''
sentence = [english_tokenizer.word_index[word] for word in text.split()]
sentence = pad_sequences([sentence], maxlen=preproc_french_sentences.shape[-
2], padding='post')
print(sentence.shape)
print(logits_to_text(simple_rnn_model.predict(sentence[:1])[0],
french_tokenizer))
import re
txt=input().lower()
final_predictions(re.sub(r'[^\w]', ' ', txt))
19
CHAPTER VI
OUTPUT
20
21
CHAPTER VII
CONCLUSION AND FUTURE ENHANCEMENT
Future Improvements:
o Do proper data split (training, validation, test):
Currently, there is no test set, only training and validation.
o LSTM + attention:
This has been the de facto architecture for RNNs over the past few
years, although there are some limitations.
o Train on a larger and more diverse text corpus:
The text corpus and vocabulary for this project are quite small with
little variation in syntax. As a result, the model is very brittle. To
create a model that generalizes better, you’ll need to train on a
larger dataset with more variability in grammar and sentence
structure.
o Residual layers:
You could add residual layers to a deep LSTM RNN, as described
in this paper. Or, use residual layers as an alternative to LSTM
and GRU, as described here.
22
REFERENCES
https://towardsdatascience.com/language-translation-with-rnns-
d84d43b40571
https://tommytracey.github.io/AIND-
Capstone/machine_translation.html
23