0% found this document useful (0 votes)
110 views10 pages

PSA Lab 3

The document provides solutions to 4 problems involving probability, statistics, and natural language processing of tweets. For problem 1, the solution calculates the probability that an equation has two distinct real roots by randomly generating values and counting favorable outcomes. For problem 2, it calculates conditional probabilities for a dart landing in various regions of a target. For problem 3, it explains that the probability the last theatergoer gets their assigned seat is 1/2. For problem 4, it prints the top 10 most frequently used words and nouns in tweets by tokenizing, tagging parts of speech, and counting.

Uploaded by

Maia Zaica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views10 pages

PSA Lab 3

The document provides solutions to 4 problems involving probability, statistics, and natural language processing of tweets. For problem 1, the solution calculates the probability that an equation has two distinct real roots by randomly generating values and counting favorable outcomes. For problem 2, it calculates conditional probabilities for a dart landing in various regions of a target. For problem 3, it explains that the probability the last theatergoer gets their assigned seat is 1/2. For problem 4, it prints the top 10 most frequently used words and nouns in tweets by tokenizing, tagging parts of speech, and counting.

Uploaded by

Maia Zaica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

TECHNICAL UNIVERSITY OF MOLDOVA

FACULTY OF COMPUTERS, INFORMATICS AND MICROELECTRONICS


SPECIALITY SOFTWARE ENGINEERING

Report
on
Probability & Statistics

Laboratory Work nr.3

Performed: st. gr.


FAF-212
Maia
Zaica

Verified: Cristofor Fiștic

Chişinău 2021
LABORATORY WORK Nr.3
Problem condition:
1. Important Distributions
Let U be a uniformly distributed random variable on [0 ,1] .

What is the probability that the equation: x 2+ 4 Ux+1=0 has two distinct real roots x 1and x 2?
Solution:
To have 2 distinct roots ∆ >0
2 2 2
∆=b −4 ac=( 4 U ) −4=16 U −4

16 U 2−4> 0
1
U > , given interval [0 ,1] .
2
import random

cases = 100000

favorable = 0

for i in range(cases):
u = random.uniform(0, 1)
if 16 * pow(u, 2) - 4 > 0:
favorable += 1

print("Probability that equation has two distinct roots: ", favorable / cases)
print("------------------------------------------------------")
print("Discriminant should be bigger than 0 hence probability: ", 1/2)

Problem condition:
2. Continuous Conditional probability
Suppose you toss a dart at a circular target of radius 10 inches. Given that the dart lands in the upper half of the target,
find the probability that
 it lands in the right half of the target
 its distance from the center is less than 5 inches
 its distance from the center is greater than 5 inches
 it lands within 5 inches of the point (0, 5)

Solution:
Dart
10

-10 10
-5 0 5
it lands in the right half of the target: for x > 0 and x 2+ y 2 ≤ 102
its distance from the center is less than 5 inches: x 2+ y 2 ≤ 52
its distance from the center is greater than 5 inches: x 2+ y 2 ≥ 52 and x 2+ y 2 ≤ 102
it lands within 5 inches of the point (0, 5): x 2+( y−5)2 ≤5 2

import random
import math

throws = 1000000 #triangles


a, b, c = 0, 0, 0
for i in range(throws):
# random angle
alpha = math.pi * random.random() #since upper half only multiplied by pi
r = random.uniform(0, 10)
x = r * math.cos(alpha)
y = r * math.sin(alpha)

if x**2 + (y - 5)**2 <= 25:


c += 1
if 0 <= x <= 10 and 0 <= y <= 10:
a += 1
elif r < 5:
b += 1

print(str(a/throws)+"\n"+ str(b/throws)+"\n"+str(1-b/throws)+"\n"+str(c/throws))

print("-----------------------------------------------------------")
print("It lands in the right half of the target: ", 1/2)
print("Its distance from the center is less than 5 inches: ", 1/4)
print("Its distance from the center is greater than 5 inches: ", 3/4)
print("It lands within 5 inches of the point (0, 5): ", 1/2)

Problem condition:
3. Counting

100 people line up to take their seats in a 100 seat theater. The 1st in line lost her ticket and decided to sit in a random
seat. Each remaining theatergoer sits in their assigned seat unless it’s occupied and then sits in a random seat. What’s
the probability the last person takes their assigned seat?

Solution:
In the 2-seat scenario, person A randomly takes either seat 1 or seat 2, and then person B must take the remaining seat.
The possible outcomes are then:

2-seat theater
1
It’s easy to see that the probability of person B sitting in their assigned seat is then ( ).
2
Approaching this problem inductively it turns out this pattern continues to hold.
When the last person boards, the only possibilities for empty seats are the correct seat, or the seat assigned to the first
person.
Each person who came on the plane and had to make a random choice, was equally likely to choose the first person's
seat or the last person's seat - the random chooser exhibits absolutely no preference towards a particular seat. This
means that the probability that one seat is taken before the other must be 1/2.

(A) the probability that the first person's seat is taken before the last person's seat, (B) the probability that the last
person's seat is taken before the first person's seat, these two expressions would have to be identical, since every time
a random choice is made, the probability of the first person's seat being chosen is equal to the probability of the last
person's seat being chosen. Since A=B, and this covers all possibilities (by the key observation), they must both be
equal to 1/2.

attempt to simulate the situation, first person chooses a random sit, and so on, maybe a recursion function could be
Pitiful

made but I’m not smart enough to be able to do it.


print("Probability the last person takes their assigned seat: ", 1/2)
print("------------------------------------------------------------")
# 100 people first person chooses at random
import random

n = random.randint(2, 100)

list = [n]
for _ in range(2, 101):
list.append(_)
if _ == n:
list.pop()
lista = [1]
for element in range(n+1, 101):
lista.append(element)
m = random.choice(lista)
list.append(m)
for i in range(len(list)+1, 101):
list.append(i)
if i == m:
list.pop()
if m == 1:
listb = [i]
elif m != 1:
listb = [1, i]
for number in range(len(list)+1, 101):
listb.append(number)
k = random.choice(listb)
list.append(k)

for j in range(len(list) + 1, 101):


list.append(j)
if j == k:
list.pop()
if k == 1:
listc = [i, j]
elif k == i:
listc = [1, j]
elif k == j:
listc = [1, i]
else:
listc = [1, i, j]
for number in range(len(list) + 1, 101):
listc.append(number)
l = random.choice(listc)
list.append(l)

print(list)

Problem condition:
4. Networking
In this task you have the opportunity to solve problems closer to the real file. This implies that you have to combine
your probability skills with the natural language processing field, aka NLP. The dataset you are going to use is a list of
tweets, found here tweets.json

4.1 Popular
Write a program that prints the first 10 most frequently used words, and the number of times it was mentioned.
Ex:
the 352
a 235
at 120

Solution:
most_common() :
most_common() is used to produce a sequence of the n most frequently encountered input values and their respective
counts.
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter
from nltk.corpus import stopwords
from nltk import ngrams
import numpy as np

with open("tweets.json", "rb") as file:


data = json.loads(file.read())
list = []
for i in data:
# # remove user, https, and RT
# data['clean_tweet'] = np.vectorize(remove_pattern)(data['text'], "https|RT|
@[\w]*")
# # remove punctuations
# data['clean_tweet'] = data['clean_tweet'].str.replace("[^a-zA-Z#]", " ")
# # lowering string
# data['clean_tweet'] = data['clean_tweet'].str.lower()
# # remove stop words
# stop_words = set(stopwords.words('english'))
#
# data['clean_tweet'] = [' '.join([w for w in x.lower().split() if w not in
stop_words])
# for x in data['clean_tweet'].tolist()]
# # remove words with len < 2
# data['clean_tweet'] = data['clean_tweet'].apply(lambda x: ' '.join([w for w
in x.split() if len(w) > 2]))
# # tokenization
# tokenized_tweet = data['clean_tweet'].apply(lambda x: list(ngrams(x.split(),
2)))

tokenizer = RegexpTokenizer(r'\w+')
text = i['text']
tokens = tokenizer.tokenize(text)
list.extend(tokens)

print(Counter(list).most_common(10))

import json

with open("tweets.json", "rb") as f:


data = json.load(f)

import pandas as pd
df = pd.DataFrame(columns=["Tweet"])

for i in range(len(data)):
item = data[i]
df.loc[i] = [data[i]['text']]

print(df.Tweet.str.split(expand=True).stack().value_counts())

4.2 Nouns
Write a program that prints the first 10 most frequently used nouns, and the number of times it was mentioned.

Solution:
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a
speech based on its definition and context. It is responsible for text reading in a language and assigning some specific
token (Parts of Speech) to each word. It is also called grammatical tagging.
Steps Involved in the POS tagging example:
 Tokenize text (word_tokenize)
 apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

NN noun, singular
NNS noun plural
NNP proper noun, singular
NNPS proper noun, plural

We open the json file, make an empty list where we store the tweets tokenized form.
With the help of str() function transform our list into a string. Exclude the punctuations. For the most popular we’ll
include all.
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter
import nltk
nltk.download('averaged_perceptron_tagger')

with open("tweets.json", "rb") as file:


data = json.loads(file.read())
list = []
for i in data:
tokenizer = RegexpTokenizer(r'\w+')
text = i['text']
tokens = tokenizer.tokenize(text)
list.extend(tokens)

import re
from nltk import word_tokenize, pos_tag

sentence = str(list)
tweet = re.sub(r'[^\w\s]', '', sentence)
tweet = re.sub(r"http\S+|www\S+|https\S+|RT\S+", '', tweet, flags=re.MULTILINE)

nouns = [token for token, pos in pos_tag(word_tokenize(tweet)) if pos.startswith('NN'


or 'NNS' or 'NNP' or 'NNPS')]
print(Counter(nouns).most_common(10))

4.3 Proper nouns


Write a program that prints the first 10 most frequently used proper nouns, and the number of times it was mentioned.

nouns = [token for token, pos in pos_tag(word_tokenize(tweet)) if pos.startswith('NNP'


or 'NNPS')]
NNP proper noun, singular
NNPS proper noun, plural

4.4 Frequency
Write a program that receives a word as an input and draws a frequency bar chart. Every bar should represent the
period of 1 month.

Solution:
The user is asked to enter a word.
We tokenize each tweet, create a list an empty list. For each word in a tweet if it is the same as the user has written
then to the empty list is added the date (Year-month-day) it was tweeted.
To create a Pandas DataFrame from the dates list, made a dictionary, .dt is used with datetimelike values.
Then made a plot that shows the frequency by months.
import nltk
import json
import pandas as pd
import pylab as pl

user_input = input()
with open("tweets.json", "rb") as file:
jsonList = json.loads(file.read())
dates = []
for line in jsonList:
tokens = nltk.word_tokenize(line['text'])
for word in tokens:
if word == user_input:
dates.append(line['created_at'][:10])

dict = {'date': dates}


df = pd.DataFrame(dict)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df.groupby(df['date'].dt.month).count().plot(kind="bar")
pl.show()
4.5 Popularity
In our dataset we also have the number of likes and retweets for every message. This can give us some insight about
the tweet popularity. Hence we can compute some sort of rating. The popularity of nouns is computed by the
following formula frequency * (1.4 + normRetweet) * (1.2 + normLikes). The values normRetweet and normLikes are
the normalized values of retweets and likes for every word. To compute the number of likes and retweets for every
word you just cumulatively collect the numbers from every tweet that the word was mentioned. Ex: There are 2 tweets
that mention the noun program. First tweet has 32 retweets and 87 likes. The second tweet has 42 retweets and 103
likes. The number of retweets of the word program is 32 + 42 and the number of likes is 87 + 103. Write a program
that prints the first 10 most popular nouns. The popularity is defined by the computed rating discussed above.

Solution:
Attempt to solve
import nltk
import re
import json
from nltk import word_tokenize, pos_tag

def getNouns(tweet):
tweet = word_tokenize(tweet) # convert string to tokens
tweet = [word for (word, tag) in pos_tag(tweet)
if tag == "NN" or tag == "NNS" or tag == "NNP" or tag == "NNPS"] # pos_tag
module in NLTK library
return " ".join(tweet) # join words with a space in between them

with open("tweets.json", "rb") as file:


jsonList = json.loads(file.read())
nouns = []
for i in jsonList:
line = i['text']
tweet = re.sub(r'[^\w\s]', '', line)
tweet = re.sub(r"http\S+|www\S+|https\S+|RT\S+", '', tweet, flags=re.MULTILINE)
nouns.append(getNouns(tweet))
likes = []
retweets = []
subs = []
for line in jsonList:
tokens = nltk.word_tokenize(line['text'])
count = 0
for word in tokens:

if word in nouns:
subs.append(word)
likes.append(line['likes'])
retweets.append(line['retweets'])

from collections import Counter

print(subs)
print("\n",likes)
print("\n",retweets)

print(Counter(nouns).most_common(10))

import json

with open("tweets.json", "rb") as f:


data = json.load(f)

import pandas as pd
df = pd.DataFrame(columns=["Tweet", "likes", "retweets"])

for i in range(len(data)):
item = data[i]
df.loc[i] = [data[i]['text'], data[i]['likes'], data[i]['retweets']]

print(df.Tweet.str.split(expand=True).stack().value_counts())
from collections import Counter
result = Counter(" ".join(df['Tweet'].values.tolist()).split(" ")).items()
print(result)

4.6 Suggestion
Write a program that receives as input an uncompleted word and prints 3 word suggestions, followed by their
frequency. The suggestions should be based on the initial dataset and sorted by the word frequency, computed in the
first problem. The input can be any uncompleted word.
Ex. Input: app, Output: application (324), apple (164), appreciate (53). Where application has the highest frequency,
apple the second highest etc.
Ex. Input: pro, Output: programming (196), product (176), program (103). Again programming has the highest
frequency.

Solution:
Reading JSON from a File
Python has a built-in package called json, which can be used to work with JSON data. It’s done by using the JSON
module, which provides us with a lot of methods which among loads() and load() methods are going to help us to read
the JSON file.
The mapping between dictionary contents and a JSON string is straightforward, so it's easy to convert between the
two.

json.loads() function accepts a JSON string and converts it into a dictionary.


Put the tweets into a list after tokenizing, suggestions looks for suggestions based on the user input and stores them
into a list.
With the help of the function most_common from Counter library we get those with the highest frequency.
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter
import nltk
# nltk.download('averaged_perceptron_tagger')

with open("tweets.json", "rb") as file:


data = json.loads(file.read())
list = []
for i in data:
tokenizer = RegexpTokenizer(r'\w+')
text = i['text']
tokens = tokenizer.tokenize(text)
list.extend(tokens)

input = input()
suggestions = [
word
for word in list
if word.lower().startswith(input.lower())
]

print(Counter(suggestions).most_common(5))

4.7 Suggestion occurrences


Write a program that receives as input a word and prints 3 word suggestions, followed by the suggestion occurrences.
The suggestions should be selected in the following way. You have to go through your tweets dataset and identify
every occurrence of the input word. At every occurrence collect the word that follows the input word. That is the
suggestion you are looking for. And also don't forget to count the number of times you get the same suggestions. Ex:
input like and you find 5 occurrences of beer and 2 occurrences of love labs. Your suggestion words would be beer
and labs. But beer has a priority because it occurred more times in your dataset. Your task is to select the most
relevant suggestions as in the one that occurred the most. The input can be any completed word.
Ex. Input: love, Output: programming (5), cars (2), beer (2)
Ex. Input: awesome, Output: party (10), language (4), framework (2)

Solution:
Attempt to solve
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter

user_input = input("> ")


with open("tweets.json", "rb") as file:
data = json.loads(file.read())
list = []
for i in data:
tokenizer = RegexpTokenizer(r'\w+')
text = i['text']
tokens = tokenizer.tokenize(text)
list.extend(tokens)
s = []
for i, word in enumerate(list):
if word == user_input:
s.append(list[i+1])

print(user_input, Counter(s).most_common(3))

import json
import numpy as np

lexicon = {}
def update_lexicon(current: str, next_word: str) -> None:
# Add the input word to the lexicon if it in there yet.
if current not in lexicon:
lexicon.update({current: {next_word: 1}})
return

# Recieve te probabilties of the input word.


options = lexicon[current]

# Check if the output word is in the propability list.


if next_word not in options:
options.update({next_word: 1})
else:
options.update({next_word: options[next_word] + 1})

# Update the lexicon


lexicon[current] = options

# Populate lexicon
with open('tweets.txt', 'r') as dataset:
for line in dataset:
words = line.strip().split(' ')
for i in range(len(words) - 1):
update_lexicon(words[i], words[i + 1])

# Adjust propability
for word, transition in lexicon.items():
transition = dict((key, value / sum(transition.values())) for key, value in
transition.items())
lexicon[word] = transition

# Predict next word


line = input('> ')
word = line.strip().split(' ')[-1]
if word not in lexicon:
print('Word not found')
else:
options = lexicon[word]
predicted = np.random.choice(list(options.keys()), p=list(options.values()))
print(line + ' ' + predicted)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy