Nutch Installation Guide
Nutch Installation Guide
NUTCH INSTALLATION & CONFIGURATION
GUIDE FOR USE IN THE NTER SYSTEM
Prepared By:
Leigh Moulder, SRI International
leigh.moulder@sri.com
TABLE OF CONTENTS
Document Change Log ................................................................................................................................................... 2
Nutch Server Information .............................................................................................................................................. 3
Account Information ................................................................................................................................................. 3
Installation Locations ................................................................................................................................................ 3
Resources .................................................................................................................................................................. 3
Master Nutch Installation .............................................................................................................................................. 4
Gather Software ........................................................................................................................................................ 4
Nutch Home Directory .............................................................................................................................................. 4
Configure NTER ......................................................................................................................................................... 4
Upgrading Nutch to Release 1.4 .................................................................................................................................... 5
Appendix A – Deployed Configuration .......................................................................................................................... 6
Account Information ................................................................................................................................................. 6
Installation Locations ................................................................................................................................................ 6
Nutch Installation Guide 1
DOCUMENT CHANGE LOG
Release Date Document Version Notes
8/1/2011 1.0 Initial Release
10/1/2011 1.1 Updated document formatting
1/17/2012 1.2 Updated documentation for Nutch 1.4 Release
2/17/2012 1.3 Simplified installation steps
Nutch Installation Guide 2
NUTCH SERVER INFORMATION
ACCOUNT INFORMATION
INSTALLATION LOCATIONS
RESOURCES
Nutch Download page http://www.apache.org/dist/nutch/apache‐nutch‐1.4‐bin.tar.gz
Nutch Installation Guide 3
MASTER NUTCH INSTALLATION
The Master Nutch installation only needs to be performed once per NTER deployment. It is designed to run on the
‘Master’ NTER node and provides full‐text crawling for all other NTER instances.
GATHER SOFTWARE
The majority of Nutch is included with the NTER course‐portlet webapp. As such, these instructions assume NTER
has successfully been deployed.
1. Download and extract the Nutch binary file to the /tmp directory.
cd /tmp
wget http://www.apache.org/dist/nutch/apache-nutch-1.4-bin.tar.gz
tar xzf apache-nutch-1.4-bin.tar.gz
NUTCH HOME DIRECTORY
1. Create the Nutch home and data directories.
cd /
mkdir –p ${nutch.home}
mkdir –p ${nutch.home}/data
mkdir –p ${nutch.home}/urls
2. Copy the Nutch plugins to the Nutch home directory.
cd ${nutch.home}
cp –r /tmp/apache-nutch-1.4/runtime/local/plugins .
3. Set the following permissions on the Nutch home directories
cd ${nutch.home}
chown –R ${tomcat.user}.${tomcat.user} *
4. Once NTER is configured with the correct Nutch home properties (below), all necessary data directories will
automatically be created.
5. Due to the tight integration between Nutch and the course‐portlet, no other configuration or binary files are
needed.
6. During various crawl stages, Nutch needs to create temporary directories. These are automatically located
under the working directory of the calling service, in this case ${tomcat.base}. Ensure that the ${tomcat.user}
is the owner of this and all subdirectories.
CONFIGURE NTER
Nutch Installation Guide 4
1. Make the following updates to NTER’s portal‐ext.properties file.
a. nter.nutch.role : Should only be set if this is the master Nutch node. If so, set to “master”.
b. nter.nutch.home.dir : Set to the Nutch home directory created above.
c. nter.nutch.indexer.type : Determines the type of indexer used by Nutch. Currently, the only valid option
is “solr”.
d. nter.nutch.solr.url : The URL of the Solr index server.
e. nter.nutch.solr.user : The user account used to connect to the Solr index. This is only needed if security
has been configured on the Solr server.
f. nter.nutch.solr.password : The password for the user account used to connect to the Solr index. This is
only needed if security has been configured on the Solr server.
##
## Nutch Settings
##
nter.nutch.role=master
nter.nutch.home.dir=${nutch.home}
nter.nutch.indexer.type=solr
nter.nutch.solr.url=${solr.url}/solr/${solr.core}
nter.nutch.solr.user=${solr.user}
nter.nutch.solr.password=${solr.password}
2. Optionally, update any additional Nutch configuration settings. The following performance configurations
changes can be made in the portlet.xml file, located at ${catalina.base}/webapps/course‐portlet/WEB‐
INF/portlet.xml.
Default Value Description
Property
crawlTimer 30 The interval (in minutes) between Nutch crawls.
The maximum number of concurrent threads used to fetch web
pages. Increasing this value can improve crawl speed since more
threadsLimit 5
threads are used concurrently. However, too high of a value can
cause server performance issues.
The maximum URL depth to traverse. Decreasing this value will
speed up crawling and indexing, but reduce the number of pages
depthLimit 10
crawled. Increasing this value will increase index time, and increase
the depth of information.
3. Restart Tomcat to have the changes take effect.
/etc/init.d/tomcat6 restart
UPGRADING NUTCH TO RELEASE 1.4
Due to NTER’s implementation of Nutch, no data is stored or used for future crawls. Because of this, the simplest
way to upgrade a previous Nutch installation is to remove the existing Nutch directory and perform a clean
installation.
Nutch Installation Guide 5
APPENDIX A – DEPLOYED CONFIGURATION
The following configuration was used for www.nterlearning.org.
ACCOUNT INFORMATION
INSTALLATION LOCATIONS
Nutch Installation Guide 6