2023 Aug How To Produce Data For A Neural networkORG
2023 Aug How To Produce Data For A Neural networkORG
A neural network accepts input and output values as its learning input. The aim of the learning
process is to teach the neural network how to produce the given outputs when the corresponding
inputs are given. This is not a tit for tat process as the neural network learns how to react to not only
given cases but also predict results for similar input cases.
The differencce of neural network programming from traditional programming is: Traditional
programming requires human intervention to improve/develop/tune the program after the design
phase. Neural network programming requires no active human programming after the creation of
the network except the tuning of parameters.
The learning process uses a set of input/output tuples. It also divides these data into two groups for
learning and evaluation:
STEP_ 0:
I took ‘airline twitter sentiment’ Excel data from Data World (https://data.world/datasets/sentiment).
STEP_1
I extracted customer comments from the ‘text’ field, cleaned them from commas, dots, numbers etc.
and wrote them to a txt file.
saPreprocessClean.py
from saPreprocessClean import clean_string
df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx'
writesaPreprocess.txt
saPreprocessSentences.txt
column_name = 'text'
with open("saPreprocessSentences.txt", "w") as file2:
with open("saPreprocess.txt", "w") as file:
if isinstance(value, str):
cleanString=clean_string(word)
cleaned_comment+=' ' + cleanString
print(f"word={word} cleaned={cleanString}", file=file)
print(f"{cleaned_comment}", file=file2)
except UnicodeEncodeError as e:
print(f"UnicodeEncodeError: {e}")
Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx'
saPreprocess.txt
saPreprocessSentences.txt
STEP_2
I took all the words that are in the ‘text’ comments field, cleaned them from commas, dots, numbers
etc. and put them into a file.
saPreprocessWords.py
from saPreprocessClean import clean_string
df = pd.read_excel('/Users/ARS/ARStensorflow/sentimentAnalysis/Airline-SentimentARS1.xlsx'
with open("saPreprocessWords.txt", "w") as file2:
with open("saPreprocess2.txt", "w") as file:
if isinstance(value, str):
cleanString=clean_string(word)
print(f"word={word} cleaned={cleanString}", file=file)
except UnicodeEncodeError as e:
print(f"UnicodeEncodeError: {e}")
Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx'
saPreprocess2.txt
saPreprocessWords.txt
STEP_3
removeDUP batch program to remove double entries
saPreprocessWords.txt = saRemoveDUPInput.txt 1,629Kb
saRemoveDUPOutput.txt 283Kb
saRemoveDUPOutput2.txt 260Kb cut by hand beginning numbers etc stuff
STEP_4
I converted each word to a float number7word2vec value and put them into a dictionary model.
I will use this dictionary to get the word2vec values of words one by one.
saCreateW2VDictModel.py
with open("saRemoveDUPoutput2.txt", "r", encoding="utf-16-le") as file:
if any(char.isdigit() for char in new_word):
print(f"includes number-->{new_word}")
else:
model.build_vocab([new_words], update=True)
model.train([new_words], total_examples=1, epochs=1)
13154 words
model.save("word2vec_dict_model.model")
model.wv.save_word2vec_format("saW2Vdict_vectors.txt", binary=False)
model = Word2Vec.load("word2vec_dict_model.model")
input_word="but"
if input_word in model.wv:
word_vector = model.wv[input_word]
print(f"Word vector for '{input_word}': {word_vector}")
else:
print(f"'{input_word}' is not in the vocabulary.")
saRemoveDUPOutput2.txt
word2vec_dict_model
saW2Vdict_vectors.txt
STEP_7
It gets airline_sentiment and text values, converts text to w2v and saves it into
saW2VXtrainYtrainData.txt reads the data from this file and puts into arrays and then produces
x_train and y_train to x_val and y_val by splitting them, it saves these values into
saXtrainYtrainData.npz and loads this file and then retrieves x_train y_train and x_val and y_val
def get_w2v_sentence(sentence):
word_vectors = [model.wv[word] for word in words if word in model.wv]
model = Word2Vec.load("word2vec_model_updated.model")
df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx',...
target_column = 'airline_sentiment'
for value in column_values:
if value == "positive":
sentiment.append(1)
if j<3: print(f"sentiment = {value} -->{sentiment[j]}")
================================
with open("saPreprocessSentences.txt", "r") as file:
with open("saW2VXtrainYtrainData.txt", "w") as file2:
i=0
w2v_Sentence = []
for line in file:
sentence = line.strip()
print(f"i={i} = {sentence}")
w2v_sentence_vectors = get_w2v_sentence(sentence)
w2v_sentence_lists = [vector.tolist() for vector in w2v_sentence_vectors]
print(f"i={i} w2v={w2v_sentence_lists} sentiment={sentiment[i]}")
print(f"w2v_sentence_lists={w2v_sentence_lists} sentiment={sentiment[i]}",
file=file2)
w2v_Sentence.append(w2v_sentence_vectors)
i += 1
if i > 3:
break
****************************************
with open("saW2VXtrainYtrainData.txt", "r") as file:
for line in file:
parts = line.strip().split("&")
if len(parts) == 2:
x_value = parts[0]
y_value = parts[1]
x_values_array = np.array(x_values)
y_values_array = np.array(y_values)
*****************************************
from sklearn.model_selection import train_test_split(x_values_array, y_values_array, test_size=val_ratio,
random_state=42)
x_train, x_val, y_train, y_val = train_test_split(
print("Shape of x_train:", x_train.shape)
print("Shape of y_train:", y_train.shape)
******************************************
np.savez("saXtrainYtrainData.npz", x_train=x_train, x_val=x_val, y_train=y_train, y_val=y_val)
loaded_data = np.load("saXtrainYtrainData.npz", allow_pickle=True)
x_train_loaded = loaded_data['x_train']
print("Shape of x_train_loaded:", x_train_loaded.shape)
Airline-SentimentARS1.xlsx
word2vec_model_updated.model
saPreprocessSentences.txt
saW2VXtrainYtrainData.txt
saXtrainYtrainData.npz
This short note explains how I prepared the airline sentiment data. I will use this data for the transformer Neural Network
to predict wheter new comments are positive or negative. I used chatGPT intensively during my programming process.
runfile('C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7/saProduceXtrainYtrainDataNEW.py',
wdir='C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7')
sentiment = neutral -->0
sentiment = positive -->1
sentiment = neutral -->0
sentiment = negative -->0
sentiment = negative -->0
sentiment = negative -->0
sentiment = positive -->1
sentiment = neutral -->0
sentiment = positive -->1
sentiment = positive -->1
sentiment = neutral -->0
i=0 =
i=0 w2v=[] sentiment=0
i=1 = virginamerica plus you've added commercials to the experience tacky
i=1 w2v=[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-
0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751], [0.10036587715148926]]
sentiment=1
i=2 = virginamerica i didn't today must mean i need to take another trip
i=2 w2v=[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-
0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582],
[0.02413642406463623], [0.9904971122741699], [0.6381527185440063]] sentiment=0
i=3 = virginamerica it's really aggressive to blast obnoxious entertainment in your guests' faces amp they have little recourse
i=3 w2v=[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282],
[0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378],
[0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631],
[0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]] sentiment=0
i=4 = virginamerica and it's a really big bad thing about it
i=4 w2v=[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022],
[0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]]
sentiment=0
i=5 = virginamerica seriously would pay 30 a flight for seats that didn't have this playing it's really the only bad thing about
flying va
i=5 w2v=[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-
0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366],
[0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-
0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126],
[0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]] sentiment=0
i=6 = virginamerica really missed a prime opportunity for men without hats parody there
i=6 w2v=[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-
0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857],
[0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]] sentiment=1
i=7 = virginamerica well i didn'tûbut now i do d
i=7 w2v=[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545],
[0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]] sentiment=0
i=8 = virginamerica it was amazing and arrived an hour early you're too good to me
i=8 w2v=[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-
0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-
0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]]
sentiment=1
i=9 = virginamerica did you know that suicide is the second leading cause of death among teens 1024
i=9 w2v=[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982],
[0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015],
[0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-
0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]] sentiment=1
i=10 = virginamerica i lt3 pretty graphics so much better than minimal iconography d
i=10 w2v=[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-
0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455],
[0.3296027183532715], [0.6342606544494629]] sentiment=0
['[]'
'[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-
0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751],
[0.10036587715148926]]'
'[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-
0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582],
[0.02413642406463623], [0.9904971122741699], [0.6381527185440063]]'
'[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282],
[0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378],
[0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631],
[0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]]'
'[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022],
[0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]]'
'[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-
0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366],
[0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-
0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126],
[0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]]'
'[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-
0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857],
[0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]]'
'[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545],
[0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]]'
'[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-
0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-
0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]]'
'[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982],
[0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015],
[0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-
0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]]'
'[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-
0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455],
[0.3296027183532715], [0.6342606544494629]]']
['0' '1' '0' '0' '0' '0' '1' '0' '1' '1' '0']
------------