0% found this document useful (0 votes)

105 views

Assignment 3: Named Entity Recognition: Training Dataset

This document provides instructions for Assignment 3 on building a named entity recognition system for diseases and treatments using MALLET. Students are asked to label medical entities in test sentences as disease (D), treatment (T), or other (O) using sequence tagging models. The training data and test sentences are provided in tokenized format. Students should write code that trains a CRF model on the training data using MALLET and outputs labels for the test sentences. Additional features can be added to improve performance. Students are asked to conduct experiments, analyze errors, and write a 2-page report describing their system and experimental results.

Uploaded by

ryder the ryder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

Assignment 3: Named Entity Recognition: Training Dataset

Uploaded by

ryder the ryder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ASSIGNMENT 3: NAMED ENTITY RECOGNITION

Motivation: The motivation of this assignment is to get practice with sequence labeling tasks such as
Named Entity Recognition. More precisely you will experiment with the HMM and/or CRF models and
various features on a subset for a medical corpus with a natural language processing package called
MALLET.

Problem Statement: The goal of the assignment is to build an NER system for diseases and treatments.
The input of the code will be a set of tokenized sentences and the output will be a label for each token
in the sentence. Labels can be D, T or O signifying disease, treatment or other.

Training Data: We are sharing a training dataset of labeled sentences. The format of each line in the
training dataset is “token label”. There is one token per line followed by a space and its label. Blank
lines indicate the end of a sentence. It has a total of 3655 sentences.

The Task: You need to write a sequence tagger that labels the given sentences in a tokenized test file.
The tokenized test file follows the same format as training except that it does not have the final label in
the input. Your output should label the test file in the same format as the training data.

To accomplish this, first download and install MALLET. Get Familiar with the "sequence tagging" part of
MALLET by reading about command line interface for sequence tagging and the developer's guide for
sequence tagging. This documentation is very short and incomplete, but that's all there is.

Run Mallet's SimpleTagger on training.txt with the following command:

 java -cp "/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar"

cc.mallet.fst.SimpleTagger --train true --test lab --threads 2 training.txt

The above command is meant for linux so you may need to adapt the syntax for other operating
systems. Also, make sure to correct the path. Mallet seems to be buggy if you use a single thread so
make sure to set the --threads option to at least 2. Mallet will optimize a CRF model on half of the data
and test it on the other half. If MALLET takes too long, increase the number of threads based on the
number of cores that your computer has. Also, use the --iterations option to reduce the number of
iterations from the 500 default to something smaller like 50.

In order to improve the tagging accuracy create additional features that might be useful for the task. For
example, if you wish to add features of capitalization and whether the current token is a body word, it
may look like this for the disease “chronic coronary Edwards complex”:

chronic D

coronary BODYWORD D
Edwards CAPITAL D

complex D

You may have multiple features space separated before the final label of the token. Insert only the
features that are on before the label. The word itself is treated as a feature. The order of the features
does not matter.

Here are some suggestions on features:

1. Try features from lower level syntactic processing like POS tagging or shallow chunking.
2. Try features that try to assign a semantic label to the current token. A well-known generic
ontology is Wordnet. Another famous medical ontology is MESH. Features based on these
ontologies are likely to help.
3. You may use existing word embeddings as features. One possibility is to use word2vec
embeddings that are trained on a general corpus.
4. You may also train your own word embeddings using unlabeled data. I have collected some
sample in-domain unlabeled data here.
5. You may define word shape features.
6. Your idea here…

You may also experiment with the order of the Markov Chain in CRF model by using the --orders option.
You may also modify the source code of SimpleTagger to experiment with an HMM instead of a CRF.

Methodology and Experiments:

Option 1: Remove 20% of data randomly as test set and 10% as development set. Train on 70% of the
data. If needed, do parameter fitting on the devset and finally test on the test set.

Option 2: Do a 10-fold cross validation. Train on 8 folds, use 9th as devset and 10th as test set. Repeat this
10 times with different folds as test set. This is a more robust option.

As you work on improving your baseline system document its performance. Perform error analysis on a
subset of data and think about what additional knowledge could help the classifier the most. That will
guide you in picking the next feature to add.

Perform ablation study by switching off subsets of your features and see the degradation of
performance. You can perform an alternative experiment by incrementally adding sets of features.
Either way the goal is to identify the most useful features (and the value of each feature) for this task. In
addition to quantitative results, also look at specific examples and try to qualitatively understand value
of each feature by noticing which examples each feature helps in.
What to submit?

1. Submit your best code (best if trained on all training data and not just on a subset) by Tuesday,
4th November 2014, 11:55 PM. The code should not need to train again. You should submit only
the testing code, after the models have been trained. That is, you should not need to access the
training data anymore.

Submit your code is in a .zip file named in the format <EntryNo>.zip. Make sure that when we
run “unzip yourfile.zip” the following files are produced in the present working directory:

compile.sh
run.sh
Writeup.pdf (and not writeup.pdf, Writeup.doc, etc)

You will be penalized if your submission does not conform to this requirement.

Your code will be run as ./run.sh inputfile.txt outputfile.txt. The outputfile.txt should have the
same number of lines as inputfile.txt. And it should have two additional characters per token
line (space and labeling). Here is a format checker. Make sure your code passes format checker
before final submission.

Your code should run on Baadal machines with 2 GB RAM.

2. Your writeup (at most 2 pages, 10 pt font) should describe how you created your best NER
system (about 1 page). Explain any interesting ideas that you used. Describe your ablation study
detailing the value of each feature quantitatively. Give specific examples to describe value of
each feature qualitatively. Also mention the names of people you discussed the assignment with.
The writeup will judged on clarity and innovation as well as experimental results.

Evaluation Criteria

(1) 30 points are for description of the system. Anything innovative may yield bonus points.
(2) 60 points for performance of your code.
(3) 10 points bonus given to outstanding write-ups and best performing systems.

What is allowed? What is not?

1. The assignment is to be done individually.

2. You may use Java, or Python for this assignment.
3. You must not discuss this assignment with anyone outside the class. Make sure you mention
the names in your write-up in case you discuss with anyone from within the class outside your
team. Please read academic integrity guidelines on the course home page and follow them
carefully.
4. Feel free to search the Web for papers or other websites describing how to build named entity
recognizers. Cite the references in your writeup. However, you should not use (or read) other
people’s NER code.
5. We will run plagiarism detection software. Any team found guilty will be awarded a suitable
penalty as per IIT rules.
6. Your code will be automatically evaluated. You get a zero if it is does not conform to output
guidelines. Make sure it satisfies the format checker before you submit.

Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Installation Guide IWS-232D
No ratings yet
Installation Guide IWS-232D
112 pages
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
ConfigMgr - An Administrator's Guide to Deploying Applications using PowerShell
From Everand
ConfigMgr - An Administrator's Guide to Deploying Applications using PowerShell
Owen Smith
5/5 (1)
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
PAM Mastery: IT Mastery, #10
From Everand
PAM Mastery: IT Mastery, #10
Michael W. Lucas
No ratings yet
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
From Everand
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
Mark Garzone
4.5/5 (3)
Python Programming Using Google Colab
From Everand
Python Programming Using Google Colab
AM Govind Kumar
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
From Everand
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
Emrys Callahan
5/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
NER Lab
No ratings yet
NER Lab
65 pages
Windows Batch File Programming
From Everand
Windows Batch File Programming
Michael Elliott
2/5 (2)
FINAL PPT
No ratings yet
FINAL PPT
16 pages
Algorithm Challenges: The Dojo Collection
From Everand
Algorithm Challenges: The Dojo Collection
Martin Puryear
No ratings yet
CODING INTERVIEW: 50+ Tips and Tricks to Better Performance in Your Coding Interview
From Everand
CODING INTERVIEW: 50+ Tips and Tricks to Better Performance in Your Coding Interview
Eric Schmidt
No ratings yet
Python Interview Questions You'll Most Likely Be Asked
From Everand
Python Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
2/5 (1)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
taask
No ratings yet
taask
18 pages
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
JavaScript Introduction
From Everand
JavaScript Introduction
Lisa Saldivar
No ratings yet
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
From Everand
The C++ Template Handbook: Advanced Techniques for Modern C++ Developers
Robert Johnson
No ratings yet
Troubleshooting Puppet
From Everand
Troubleshooting Puppet
Uphill Thomas
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
Getting Started With Quick Test Professional (QTP) And Descriptive Programming
From Everand
Getting Started With Quick Test Professional (QTP) And Descriptive Programming
Gaurav Garg
4.5/5 (2)
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
The Rust Programming Language, 2nd Edition
From Everand
The Rust Programming Language, 2nd Edition
Steve Klabnik
No ratings yet
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
From Everand
PRACTICAL GUIDE TO LEARN ALGORITHMS: Master Algorithmic Problem-Solving Techniques (2024 Guide for Beginners)
MARTY TWITTY
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Production System: Fundamentals and Applications
From Everand
Production System: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python Debugging from Scratch: A Practical Guide with Examples ASIN (Ebook):
From Everand
Python Debugging from Scratch: A Practical Guide with Examples ASIN (Ebook):
William E. Clark
No ratings yet
Programming Puzzles: Python Edition: The Guide to Sharpen Your Coding Skills with Engaging and Challenging Puzzles
From Everand
Programming Puzzles: Python Edition: The Guide to Sharpen Your Coding Skills with Engaging and Challenging Puzzles
Matthew Whiteside
No ratings yet
Chat Bot
No ratings yet
Chat Bot
3 pages
Django 1.1 Testing and Debugging
From Everand
Django 1.1 Testing and Debugging
Karen M. Tracey
4.5/5 (3)
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
From Everand
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
Charlie Masterson
No ratings yet
Java: Tips and Tricks to Programming Code with Java
From Everand
Java: Tips and Tricks to Programming Code with Java
Charlie Masterson
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
237-1-1172-4-10-20240626
No ratings yet
237-1-1172-4-10-20240626
6 pages
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Python Interview Questions
From Everand
Python Interview Questions
equitypress
4.5/5 (6)
COMPUTER SCIENCE FOR ROOKIES
From Everand
COMPUTER SCIENCE FOR ROOKIES
Angel Bahabwa
No ratings yet
Software Architecture with Python
From Everand
Software Architecture with Python
Anand Balachandran Pillai
3/5 (1)
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Nlp Assignment 4(22bce9560)
No ratings yet
Nlp Assignment 4(22bce9560)
12 pages
MTech Computer Syllabus
No ratings yet
MTech Computer Syllabus
43 pages
Fellowship / Research: Ddartha 011 Walk Interview Award
No ratings yet
Fellowship / Research: Ddartha 011 Walk Interview Award
4 pages
Fellowship / Research: Ddartha 011 Walk Interview Award
No ratings yet
Fellowship / Research: Ddartha 011 Walk Interview Award
4 pages
Classic Data Structure by D Samanta PDF
No ratings yet
Classic Data Structure by D Samanta PDF
2 pages
Avionics Navigation Systems
No ratings yet
Avionics Navigation Systems
86 pages
SERAPHIC TV+Portal Datasheet
No ratings yet
SERAPHIC TV+Portal Datasheet
2 pages
Quiz Answers
No ratings yet
Quiz Answers
5 pages
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
No ratings yet
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
9 pages
PDFfiller - Tr 6 Challan Customs Word Format(1)
No ratings yet
PDFfiller - Tr 6 Challan Customs Word Format(1)
1,092 pages
SAP BAPIS Replacement MM
100% (1)
SAP BAPIS Replacement MM
3 pages
Firewall & Traffic Shaping - Meraki Dashboard
No ratings yet
Firewall & Traffic Shaping - Meraki Dashboard
3 pages
Nuggts character maker! Picrew - The Character Maker & Creator
No ratings yet
Nuggts character maker! Picrew - The Character Maker & Creator
1 page
Citibank'S Epay: Online Credit Card Payment. From Any Bank
No ratings yet
Citibank'S Epay: Online Credit Card Payment. From Any Bank
2 pages
Smart Door Lock System
No ratings yet
Smart Door Lock System
2 pages
Unit 01 (1)
No ratings yet
Unit 01 (1)
66 pages
GRADE 3 SCHOOL HOLIDAY WORK_1
No ratings yet
GRADE 3 SCHOOL HOLIDAY WORK_1
9 pages
2022 Summer Model Answer Paper (Msbte Study Resources)
No ratings yet
2022 Summer Model Answer Paper (Msbte Study Resources)
16 pages
PLC Notes-3
No ratings yet
PLC Notes-3
32 pages
Ltop 111
No ratings yet
Ltop 111
1 page
04 NM-D521 Schematic
100% (1)
04 NM-D521 Schematic
57 pages
VPD - Update Product Manual
No ratings yet
VPD - Update Product Manual
46 pages
Project Plan Odessa Mobile Technology Project
No ratings yet
Project Plan Odessa Mobile Technology Project
22 pages
DLD Chapter1
No ratings yet
DLD Chapter1
117 pages
Network Forensics Part 1
No ratings yet
Network Forensics Part 1
11 pages
235345
No ratings yet
235345
2 pages
Notice: IDBI Bank Invites Online Applications From Practicing Partnership Firms of
No ratings yet
Notice: IDBI Bank Invites Online Applications From Practicing Partnership Firms of
3 pages
Case Study-Google
100% (1)
Case Study-Google
16 pages
Color Hunt - Trendy Color Palettes
No ratings yet
Color Hunt - Trendy Color Palettes
1 page
The NFT Revolution: NFT, DAO, and Smart Contracts: 2021 Buzzwords 101
No ratings yet
The NFT Revolution: NFT, DAO, and Smart Contracts: 2021 Buzzwords 101
4 pages
International Journal of Mining Science and Technology
No ratings yet
International Journal of Mining Science and Technology
12 pages
Algebra Assessment Higher (Grades 5-6) Mark Scheme
No ratings yet
Algebra Assessment Higher (Grades 5-6) Mark Scheme
4 pages
Exploring Programming Language Architecture in Perl: Bill Hails
No ratings yet
Exploring Programming Language Architecture in Perl: Bill Hails
368 pages
Jilu Easow Raju: Assistant Manager - CAD Automation
No ratings yet
Jilu Easow Raju: Assistant Manager - CAD Automation
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Assignment 3: Named Entity Recognition: Training Dataset

Uploaded by

Assignment 3: Named Entity Recognition: Training Dataset

Uploaded by

ASSIGNMENT 3: NAMED ENTITY RECOGNITION

Run Mallet's SimpleTagger on training.txt with the following command:

 java -cp "/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar"

Here are some suggestions on features:

Methodology and Experiments:

Your code should run on Baadal machines with 2 GB RAM.

What is allowed? What is not?

1. The assignment is to be done individually.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.