PSA Lab 3
PSA Lab 3
Report
on
Probability & Statistics
Chişinău 2021
LABORATORY WORK Nr.3
Problem condition:
1. Important Distributions
Let U be a uniformly distributed random variable on [0 ,1] .
What is the probability that the equation: x 2+ 4 Ux+1=0 has two distinct real roots x 1and x 2?
Solution:
To have 2 distinct roots ∆ >0
2 2 2
∆=b −4 ac=( 4 U ) −4=16 U −4
16 U 2−4> 0
1
U > , given interval [0 ,1] .
2
import random
cases = 100000
favorable = 0
for i in range(cases):
u = random.uniform(0, 1)
if 16 * pow(u, 2) - 4 > 0:
favorable += 1
print("Probability that equation has two distinct roots: ", favorable / cases)
print("------------------------------------------------------")
print("Discriminant should be bigger than 0 hence probability: ", 1/2)
Problem condition:
2. Continuous Conditional probability
Suppose you toss a dart at a circular target of radius 10 inches. Given that the dart lands in the upper half of the target,
find the probability that
it lands in the right half of the target
its distance from the center is less than 5 inches
its distance from the center is greater than 5 inches
it lands within 5 inches of the point (0, 5)
Solution:
Dart
10
-10 10
-5 0 5
it lands in the right half of the target: for x > 0 and x 2+ y 2 ≤ 102
its distance from the center is less than 5 inches: x 2+ y 2 ≤ 52
its distance from the center is greater than 5 inches: x 2+ y 2 ≥ 52 and x 2+ y 2 ≤ 102
it lands within 5 inches of the point (0, 5): x 2+( y−5)2 ≤5 2
import random
import math
print(str(a/throws)+"\n"+ str(b/throws)+"\n"+str(1-b/throws)+"\n"+str(c/throws))
print("-----------------------------------------------------------")
print("It lands in the right half of the target: ", 1/2)
print("Its distance from the center is less than 5 inches: ", 1/4)
print("Its distance from the center is greater than 5 inches: ", 3/4)
print("It lands within 5 inches of the point (0, 5): ", 1/2)
Problem condition:
3. Counting
100 people line up to take their seats in a 100 seat theater. The 1st in line lost her ticket and decided to sit in a random
seat. Each remaining theatergoer sits in their assigned seat unless it’s occupied and then sits in a random seat. What’s
the probability the last person takes their assigned seat?
Solution:
In the 2-seat scenario, person A randomly takes either seat 1 or seat 2, and then person B must take the remaining seat.
The possible outcomes are then:
2-seat theater
1
It’s easy to see that the probability of person B sitting in their assigned seat is then ( ).
2
Approaching this problem inductively it turns out this pattern continues to hold.
When the last person boards, the only possibilities for empty seats are the correct seat, or the seat assigned to the first
person.
Each person who came on the plane and had to make a random choice, was equally likely to choose the first person's
seat or the last person's seat - the random chooser exhibits absolutely no preference towards a particular seat. This
means that the probability that one seat is taken before the other must be 1/2.
(A) the probability that the first person's seat is taken before the last person's seat, (B) the probability that the last
person's seat is taken before the first person's seat, these two expressions would have to be identical, since every time
a random choice is made, the probability of the first person's seat being chosen is equal to the probability of the last
person's seat being chosen. Since A=B, and this covers all possibilities (by the key observation), they must both be
equal to 1/2.
attempt to simulate the situation, first person chooses a random sit, and so on, maybe a recursion function could be
Pitiful
n = random.randint(2, 100)
list = [n]
for _ in range(2, 101):
list.append(_)
if _ == n:
list.pop()
lista = [1]
for element in range(n+1, 101):
lista.append(element)
m = random.choice(lista)
list.append(m)
for i in range(len(list)+1, 101):
list.append(i)
if i == m:
list.pop()
if m == 1:
listb = [i]
elif m != 1:
listb = [1, i]
for number in range(len(list)+1, 101):
listb.append(number)
k = random.choice(listb)
list.append(k)
print(list)
Problem condition:
4. Networking
In this task you have the opportunity to solve problems closer to the real file. This implies that you have to combine
your probability skills with the natural language processing field, aka NLP. The dataset you are going to use is a list of
tweets, found here tweets.json
4.1 Popular
Write a program that prints the first 10 most frequently used words, and the number of times it was mentioned.
Ex:
the 352
a 235
at 120
Solution:
most_common() :
most_common() is used to produce a sequence of the n most frequently encountered input values and their respective
counts.
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter
from nltk.corpus import stopwords
from nltk import ngrams
import numpy as np
tokenizer = RegexpTokenizer(r'\w+')
text = i['text']
tokens = tokenizer.tokenize(text)
list.extend(tokens)
print(Counter(list).most_common(10))
import json
import pandas as pd
df = pd.DataFrame(columns=["Tweet"])
for i in range(len(data)):
item = data[i]
df.loc[i] = [data[i]['text']]
print(df.Tweet.str.split(expand=True).stack().value_counts())
4.2 Nouns
Write a program that prints the first 10 most frequently used nouns, and the number of times it was mentioned.
Solution:
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a
speech based on its definition and context. It is responsible for text reading in a language and assigning some specific
token (Parts of Speech) to each word. It is also called grammatical tagging.
Steps Involved in the POS tagging example:
Tokenize text (word_tokenize)
apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
NN noun, singular
NNS noun plural
NNP proper noun, singular
NNPS proper noun, plural
We open the json file, make an empty list where we store the tweets tokenized form.
With the help of str() function transform our list into a string. Exclude the punctuations. For the most popular we’ll
include all.
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter
import nltk
nltk.download('averaged_perceptron_tagger')
import re
from nltk import word_tokenize, pos_tag
sentence = str(list)
tweet = re.sub(r'[^\w\s]', '', sentence)
tweet = re.sub(r"http\S+|www\S+|https\S+|RT\S+", '', tweet, flags=re.MULTILINE)
4.4 Frequency
Write a program that receives a word as an input and draws a frequency bar chart. Every bar should represent the
period of 1 month.
Solution:
The user is asked to enter a word.
We tokenize each tweet, create a list an empty list. For each word in a tweet if it is the same as the user has written
then to the empty list is added the date (Year-month-day) it was tweeted.
To create a Pandas DataFrame from the dates list, made a dictionary, .dt is used with datetimelike values.
Then made a plot that shows the frequency by months.
import nltk
import json
import pandas as pd
import pylab as pl
user_input = input()
with open("tweets.json", "rb") as file:
jsonList = json.loads(file.read())
dates = []
for line in jsonList:
tokens = nltk.word_tokenize(line['text'])
for word in tokens:
if word == user_input:
dates.append(line['created_at'][:10])
Solution:
Attempt to solve
import nltk
import re
import json
from nltk import word_tokenize, pos_tag
def getNouns(tweet):
tweet = word_tokenize(tweet) # convert string to tokens
tweet = [word for (word, tag) in pos_tag(tweet)
if tag == "NN" or tag == "NNS" or tag == "NNP" or tag == "NNPS"] # pos_tag
module in NLTK library
return " ".join(tweet) # join words with a space in between them
if word in nouns:
subs.append(word)
likes.append(line['likes'])
retweets.append(line['retweets'])
print(subs)
print("\n",likes)
print("\n",retweets)
print(Counter(nouns).most_common(10))
import json
import pandas as pd
df = pd.DataFrame(columns=["Tweet", "likes", "retweets"])
for i in range(len(data)):
item = data[i]
df.loc[i] = [data[i]['text'], data[i]['likes'], data[i]['retweets']]
print(df.Tweet.str.split(expand=True).stack().value_counts())
from collections import Counter
result = Counter(" ".join(df['Tweet'].values.tolist()).split(" ")).items()
print(result)
4.6 Suggestion
Write a program that receives as input an uncompleted word and prints 3 word suggestions, followed by their
frequency. The suggestions should be based on the initial dataset and sorted by the word frequency, computed in the
first problem. The input can be any uncompleted word.
Ex. Input: app, Output: application (324), apple (164), appreciate (53). Where application has the highest frequency,
apple the second highest etc.
Ex. Input: pro, Output: programming (196), product (176), program (103). Again programming has the highest
frequency.
Solution:
Reading JSON from a File
Python has a built-in package called json, which can be used to work with JSON data. It’s done by using the JSON
module, which provides us with a lot of methods which among loads() and load() methods are going to help us to read
the JSON file.
The mapping between dictionary contents and a JSON string is straightforward, so it's easy to convert between the
two.
input = input()
suggestions = [
word
for word in list
if word.lower().startswith(input.lower())
]
print(Counter(suggestions).most_common(5))
Solution:
Attempt to solve
from nltk.tokenize import RegexpTokenizer
import json
from collections import Counter
print(user_input, Counter(s).most_common(3))
import json
import numpy as np
lexicon = {}
def update_lexicon(current: str, next_word: str) -> None:
# Add the input word to the lexicon if it in there yet.
if current not in lexicon:
lexicon.update({current: {next_word: 1}})
return
# Populate lexicon
with open('tweets.txt', 'r') as dataset:
for line in dataset:
words = line.strip().split(' ')
for i in range(len(words) - 1):
update_lexicon(words[i], words[i + 1])
# Adjust propability
for word, transition in lexicon.items():
transition = dict((key, value / sum(transition.values())) for key, value in
transition.items())
lexicon[word] = transition