Important Steps: Gensim: A Python Library For NLP and Word Embeddings
Important Steps: Gensim: A Python Library For NLP and Word Embeddings
Soln:
Important Steps
legal_corpus = [
"The court ruled in favor of the plaintiff.",
"The defendant was found guilty of negligence.",
"A breach of contract case was filed.",
"The agreement between parties must be honored.",
"The lawyer presented compelling evidence.",
"Legal documents must be drafted carefully.",
"The jury deliberated for several hours.",
"A settlement was reached between the parties.",
"The plaintiff claimed damages for losses incurred.",
"The contract outlined the obligations of both parties."
]
# Example legal corpus
legal_corpus = [
"The court ruled in favor of the plaintiff.",
"The defendant was found guilty of negligence.",
"A breach of contract case was filed.",
"The agreement between parties must be honored.",
"The lawyer presented compelling evidence.",
"Legal documents must be drafted carefully.",
"The jury deliberated for several hours.",
"A settlement was reached between the parties.",
"The plaintiff claimed damages for losses incurred.",
"The contract outlined the obligations of both parties."
]
Output:
0.00374052 0.00413726]
word_vectors
Output:
# Dimensionality reduction
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(word_vectors)
reduced_vectors
Output:
array([[ 0.02688162, -0.00792018],
[ 0.00493226, -0.04934309],
[-0.00377306, -0.04936944],
[ 0.02256997, 0.03808062],
[-0.0355795 , -0.01066101],
[ 0.02682294, -0.01050709],
[ 0.01486912, 0.0443972 ],
[ 0.04605154, 0.01166099],
[-0.0482769 , -0.0079725 ],
[-0.05449799, 0.0416345 ]])
# Plot embeddings
plt.figure(figsize=(10, 8))
for i, word in enumerate(words_to_visualize):
plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1])
plt.text(reduced_vectors[i, 0] + 0.002, reduced_vectors[i, 1],
word, fontsize=12)
plt.title("PCA Visualization of Legal Word Embeddings")
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimension 2")
plt.show()
Output:
array([[ 0.02688162, -0.00792018],
[ 0.00493226, -0.04934309],
[-0.00377306, -0.04936944],
[ 0.02256997, 0.03808062],
[-0.0355795 , -0.01066101],
[ 0.02682294, -0.01050709],
[ 0.01486912, 0.0443972 ],
[ 0.04605154, 0.01166099],
[-0.0482769 , -0.0079725 ],
[-0.05449799, 0.0416345 ]])
Output:
# Medical domain
"The patient was admitted to the emergency department with severe
chest pain.",
"The surgeon successfully performed a minimally invasive procedure
to remove the tumor.",
"Clinical trials showed significant improvement in patients treated
with the experimental drug.",
"Regular screening is essential for early detection of chronic
illnesses such as diabetes.",
"The doctor recommended physical therapy to improve mobility after
surgery.",
"The hospital implemented stringent protocols to prevent the spread
of infectious diseases.",
"The nurse monitored the patient's vital signs hourly to ensure
stability.",
"Vaccination campaigns have drastically reduced the prevalence of
polio worldwide.",
"The radiologist identified a small abnormality in the CT scan
requiring further investigation.",
"Proper nutrition and exercise are vital components of a healthy
lifestyle."
]
Output :
[['the',
'court',
'ordered',
'the',
'immediate',
'release',
'of',
'the',
'detained',
'individual',
'due',
'to',
'lack',
'of',
'evidence'],
['new',
'amendment',
'was',
'introduced',
'to',
'ensure',
'the',
'protection',
'of',
'intellectual',
'property',
'rights'],
['the',
'defendant',
'pleaded',
'not',
'guilty',
'citing',
'an',
'alibi',
'supported',
'by',
'credible',
'witnesses'],
['the',
'plaintiff',
'accused',
'the',
'company',
'of',
'violating',
'environmental',
'regulations'],
['settlement',
'agreement',
'was',
'reached',
'through',
'arbitration',
'avoiding',
'lengthy',
'trial'],
['the',
'legal',
'team',
'presented',
'compelling',
'argument',
'to',
'overturn',
'the',
'previous',
'judgment'],
['contractual',
'obligations',
'must',
'be',
'fulfilled',
'unless',
'waived',
'by',
'mutual',
'consent'],
['the',
'jury',
'found',
'the',
'accused',
'guilty',
'of',
'fraud',
'and',
'embezzlement'],
['the',
'appeal',
'was',
'dismissed',
'as',
'the',
'evidence',
'presented',
'was',
'deemed',
'inadmissible'],
['the',
'attorney',
'emphasized',
'the',
'importance',
'of',
'adhering',
'to',
'constitutional',
'rights'],
['the',
'patient',
'was',
'admitted',
'to',
'the',
'emergency',
'department',
'with',
'severe',
'chest',
'pain'],
['the',
'surgeon',
'successfully',
'performed',
'minimally',
'invasive',
'procedure',
'to',
'remove',
'the',
'tumor'],
['clinical',
'trials',
'showed',
'significant',
'improvement',
'in',
'patients',
'treated',
'with',
'the',
'experimental',
'drug'],
['regular',
'screening',
'is',
'essential',
'for',
'early',
'detection',
'of',
'chronic',
'illnesses',
'such',
'as',
'diabetes'],
['the',
'doctor',
'recommended',
'physical',
'therapy',
'to',
'improve',
'mobility',
'after',
'surgery'],
['the',
'hospital',
'implemented',
'stringent',
'protocols',
'to',
'prevent',
'the',
'spread',
'of',
'infectious',
'diseases'],
['the',
'nurse',
'monitored',
'the',
'patient',
'vital',
'signs',
'hourly',
'to',
'ensure',
'stability'],
['vaccination',
'campaigns',
'have',
'drastically',
'reduced',
'the',
'prevalence',
'of',
'polio',
'worldwide'],
['the',
'radiologist',
'identified',
'small',
'abnormality',
'in',
'the',
'ct',
'scan',
'requiring',
'further',
'investigation'],
['proper',
'nutrition',
'and',
'exercise',
'are',
'vital',
'components',
'of',
'healthy',
'lifestyle']]
# Train Word2Vec
domain_word2vec = Word2Vec(
sentences=tokenized_corpus,
vector_size=100, # Higher embedding dimension for better
representation
window=5, # Wider context window
min_count=1, # Include all words
sg=1, # Skip-gram model
epochs=150 # More training iterations
)
word_vectors
Output:
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32),
dtype=float32)]
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(word_vectors)
reduced_vectors
plt.figure(figsize=(12, 8))
for i, word in enumerate(selected_words):
plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1])
plt.text(reduced_vectors[i, 0] + 0.002, reduced_vectors[i, 1],
word, fontsize=12)
plt.title("PCA Visualization of Legal and Medical Word Embeddings")
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimension 2")
plt.show()
Output :