0% found this document useful (0 votes)
15 views6 pages

2023 Aug How To Produce Data For A Neural networkORG

The document describes the 7 steps to produce data for training a neural network to classify airline sentiment: 1. The document extracts customer comments and words from an Excel file and cleans the text. 2. It removes duplicate words and converts each word to a float vector using word2vec. 3. It retrieves the sentiment labels and converts the text to word vectors to create the x input data. 4. It splits the x and y data into training and validation sets using train_test_split and saves the arrays. 5. It loads the saved training and validation data to retrieve the x_train, y_train, x_val, and y_val for modeling.

Uploaded by

Ali Riza SARAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

2023 Aug How To Produce Data For A Neural networkORG

The document describes the 7 steps to produce data for training a neural network to classify airline sentiment: 1. The document extracts customer comments and words from an Excel file and cleans the text. 2. It removes duplicate words and converts each word to a float vector using word2vec. 3. It retrieves the sentiment labels and converts the text to word vectors to create the x input data. 4. It splits the x and y data into training and validation sets using train_test_split and saves the arrays. 5. It loads the saved training and validation data to retrieve the x_train, y_train, x_val, and y_val for modeling.

Uploaded by

Ali Riza SARAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

How to produce data for a neural network

Ali Riza SARAL

A neural network accepts input and output values as its learning input. The aim of the learning
process is to teach the neural network how to produce the given outputs when the corresponding
inputs are given. This is not a tit for tat process as the neural network learns how to react to not only
given cases but also predict results for similar input cases.

The differencce of neural network programming from traditional programming is: Traditional
programming requires human intervention to improve/develop/tune the program after the design
phase. Neural network programming requires no active human programming after the creation of
the network except the tuning of parameters.

The learning process uses a set of input/output tuples. It also divides these data into two groups for
learning and evaluation:

# Define the ratios for training and validation


train_ratio = 0.8 # 80% of the data for training
val_ratio = 0.2 # 20% of the data for validation

# Split the data into training and validation sets


x_train, x_val, y_train, y_val = train_test_split(
x_values_array, y_values_array, test_size=val_ratio, random_state=42
)

# Save arrays to a file


np.savez("saXtrainYtrainData.npz", x_train=x_train, x_val=x_val, y_train=y_train, y_val=y_val)

# Load arrays from the file with allow_pickle=True


loaded_data = np.load("saXtrainYtrainData.npz", allow_pickle=True)
x_train_loaded = loaded_data['x_train']
x_val_loaded = loaded_data['x_val']
y_train_loaded = loaded_data['y_train']
y_val_loaded = loaded_data['y_val']

STEP_ 0:
I took ‘airline twitter sentiment’ Excel data from Data World (https://data.world/datasets/sentiment).
STEP_1
I extracted customer comments from the ‘text’ field, cleaned them from commas, dots, numbers etc.
and wrote them to a txt file.
saPreprocessClean.py
from saPreprocessClean import clean_string
df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx'
writesaPreprocess.txt
saPreprocessSentences.txt
column_name = 'text'
with open("saPreprocessSentences.txt", "w") as file2:
with open("saPreprocess.txt", "w") as file:
if isinstance(value, str):
cleanString=clean_string(word)
cleaned_comment+=' ' + cleanString
print(f"word={word} cleaned={cleanString}", file=file)

print(f"{cleaned_comment}", file=file2)
except UnicodeEncodeError as e:
print(f"UnicodeEncodeError: {e}")

Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx'
saPreprocess.txt
saPreprocessSentences.txt
STEP_2
I took all the words that are in the ‘text’ comments field, cleaned them from commas, dots, numbers
etc. and put them into a file.
saPreprocessWords.py
from saPreprocessClean import clean_string
df = pd.read_excel('/Users/ARS/ARStensorflow/sentimentAnalysis/Airline-SentimentARS1.xlsx'
with open("saPreprocessWords.txt", "w") as file2:
with open("saPreprocess2.txt", "w") as file:
if isinstance(value, str):
cleanString=clean_string(word)
print(f"word={word} cleaned={cleanString}", file=file)
except UnicodeEncodeError as e:
print(f"UnicodeEncodeError: {e}")
Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx'
saPreprocess2.txt
saPreprocessWords.txt
STEP_3
removeDUP batch program to remove double entries
saPreprocessWords.txt = saRemoveDUPInput.txt 1,629Kb
saRemoveDUPOutput.txt 283Kb
saRemoveDUPOutput2.txt 260Kb cut by hand beginning numbers etc stuff
STEP_4
I converted each word to a float number7word2vec value and put them into a dictionary model.
I will use this dictionary to get the word2vec values of words one by one.
saCreateW2VDictModel.py
with open("saRemoveDUPoutput2.txt", "r", encoding="utf-16-le") as file:
if any(char.isdigit() for char in new_word):
print(f"includes number-->{new_word}")
else:
model.build_vocab([new_words], update=True)
model.train([new_words], total_examples=1, epochs=1)
13154 words
model.save("word2vec_dict_model.model")
model.wv.save_word2vec_format("saW2Vdict_vectors.txt", binary=False)

model = Word2Vec.load("word2vec_dict_model.model")
input_word="but"
if input_word in model.wv:
word_vector = model.wv[input_word]
print(f"Word vector for '{input_word}': {word_vector}")
else:
print(f"'{input_word}' is not in the vocabulary.")
saRemoveDUPOutput2.txt
word2vec_dict_model
saW2Vdict_vectors.txt
STEP_7
It gets airline_sentiment and text values, converts text to w2v and saves it into
saW2VXtrainYtrainData.txt reads the data from this file and puts into arrays and then produces
x_train and y_train to x_val and y_val by splitting them, it saves these values into
saXtrainYtrainData.npz and loads this file and then retrieves x_train y_train and x_val and y_val
def get_w2v_sentence(sentence):
word_vectors = [model.wv[word] for word in words if word in model.wv]
model = Word2Vec.load("word2vec_model_updated.model")
df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx',...
target_column = 'airline_sentiment'
for value in column_values:
if value == "positive":
sentiment.append(1)
if j<3: print(f"sentiment = {value} -->{sentiment[j]}")
================================
with open("saPreprocessSentences.txt", "r") as file:
with open("saW2VXtrainYtrainData.txt", "w") as file2:
i=0
w2v_Sentence = []
for line in file:
sentence = line.strip()
print(f"i={i} = {sentence}")
w2v_sentence_vectors = get_w2v_sentence(sentence)
w2v_sentence_lists = [vector.tolist() for vector in w2v_sentence_vectors]
print(f"i={i} w2v={w2v_sentence_lists} sentiment={sentiment[i]}")
print(f"w2v_sentence_lists={w2v_sentence_lists} sentiment={sentiment[i]}",
file=file2)

w2v_Sentence.append(w2v_sentence_vectors)
i += 1
if i > 3:
break
****************************************
with open("saW2VXtrainYtrainData.txt", "r") as file:
for line in file:
parts = line.strip().split("&")
if len(parts) == 2:
x_value = parts[0]
y_value = parts[1]
x_values_array = np.array(x_values)
y_values_array = np.array(y_values)
*****************************************
from sklearn.model_selection import train_test_split(x_values_array, y_values_array, test_size=val_ratio,
random_state=42)
x_train, x_val, y_train, y_val = train_test_split(
print("Shape of x_train:", x_train.shape)
print("Shape of y_train:", y_train.shape)
******************************************
np.savez("saXtrainYtrainData.npz", x_train=x_train, x_val=x_val, y_train=y_train, y_val=y_val)
loaded_data = np.load("saXtrainYtrainData.npz", allow_pickle=True)
x_train_loaded = loaded_data['x_train']
print("Shape of x_train_loaded:", x_train_loaded.shape)
Airline-SentimentARS1.xlsx
word2vec_model_updated.model
saPreprocessSentences.txt
saW2VXtrainYtrainData.txt
saXtrainYtrainData.npz

This short note explains how I prepared the airline sentiment data. I will use this data for the transformer Neural Network
to predict wheter new comments are positive or negative. I used chatGPT intensively during my programming process.

runfile('C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7/saProduceXtrainYtrainDataNEW.py',
wdir='C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7')
sentiment = neutral -->0
sentiment = positive -->1
sentiment = neutral -->0
sentiment = negative -->0
sentiment = negative -->0
sentiment = negative -->0
sentiment = positive -->1
sentiment = neutral -->0
sentiment = positive -->1
sentiment = positive -->1
sentiment = neutral -->0
i=0 =
i=0 w2v=[] sentiment=0
i=1 = virginamerica plus you've added commercials to the experience tacky
i=1 w2v=[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-
0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751], [0.10036587715148926]]
sentiment=1
i=2 = virginamerica i didn't today must mean i need to take another trip
i=2 w2v=[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-
0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582],
[0.02413642406463623], [0.9904971122741699], [0.6381527185440063]] sentiment=0
i=3 = virginamerica it's really aggressive to blast obnoxious entertainment in your guests' faces amp they have little recourse
i=3 w2v=[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282],
[0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378],
[0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631],
[0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]] sentiment=0
i=4 = virginamerica and it's a really big bad thing about it
i=4 w2v=[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022],
[0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]]
sentiment=0
i=5 = virginamerica seriously would pay 30 a flight for seats that didn't have this playing it's really the only bad thing about
flying va
i=5 w2v=[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-
0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366],
[0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-
0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126],
[0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]] sentiment=0
i=6 = virginamerica really missed a prime opportunity for men without hats parody there
i=6 w2v=[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-
0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857],
[0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]] sentiment=1
i=7 = virginamerica well i didn'tûbut now i do d
i=7 w2v=[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545],
[0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]] sentiment=0
i=8 = virginamerica it was amazing and arrived an hour early you're too good to me
i=8 w2v=[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-
0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-
0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]]
sentiment=1
i=9 = virginamerica did you know that suicide is the second leading cause of death among teens 1024
i=9 w2v=[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982],
[0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015],
[0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-
0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]] sentiment=1
i=10 = virginamerica i lt3 pretty graphics so much better than minimal iconography d
i=10 w2v=[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-
0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455],
[0.3296027183532715], [0.6342606544494629]] sentiment=0

['[]'
'[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-
0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751],
[0.10036587715148926]]'
'[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-
0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582],
[0.02413642406463623], [0.9904971122741699], [0.6381527185440063]]'
'[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282],
[0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378],
[0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631],
[0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]]'
'[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022],
[0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]]'
'[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-
0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366],
[0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-
0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126],
[0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]]'
'[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-
0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857],
[0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]]'
'[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545],
[0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]]'
'[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-
0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-
0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]]'
'[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982],
[0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015],
[0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-
0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]]'
'[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-
0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455],
[0.3296027183532715], [0.6342606544494629]]']
['0' '1' '0' '0' '0' '0' '1' '0' '1' '1' '0']
------------

Shape of x_train: (8,)


Shape of y_train: (8,)
Shape of x_val: (3,)
Shape of y_val: (3,)
Shape of x_train_loaded: (8,)
Shape of x_val_loaded: (3,)
Shape of y_train_loaded: (8,)
Shape of y_val_loaded: (3,)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy