Apache Nutch - Step by Step
Search is one of the most fantastic areas of the technology industry, and has been addressed many, many times with different algorithms, producing varying degrees of success. We get so used to it, that often times I wish I had a Cmd-F while reading a real book.
Recently we had our Quarterly Hack Week at Marqeta, and one of the ideas was to build search around our public pages. These pages would include the public website assets, as well as the the API developer guides and documentation. This post is a quick summary of the infrastructure, setup, and gotchas of using Nutch 2.3.1 to build a site search - essentially notes from this hack week project.
If you are not familiar with Apache Nutch Crawler, please visit here. Nutch 2.x and Nutch 1.x are fairly different in terms of set up, execution, and architecture. Nutch 2.x uses Apache Gora to manage NoSQL persistence over many db stores. However, Nutch 1.x has been around much longer, has more features, and has many bug fixes compared to Nutch 2.x. If your search needs are far more advanced, consider Nutch 1.x. If flexibility of db stores is important, then pick Nutch 2.x.
Versions
We will use Apache Nutch 2.3.1, MongoDB 3.4.7, and Solr 6.5.1. I tried using ElasticSearch, but as a simple Google Search will reveal, Nutch ElasticSearch Indexing plugins depend on fairly old versions. Even Solr 6.6.0 did not work due to a field deprecation, so we will stick to the next latest version, 6.5.1. And yes, there are a few hacks we’d need to do to get Solr 6.5.1 working as well.
Operating System
I’ve used Ubuntu 16.04 LTS on Amazon Web Services, and also Debian 8 on Vagrant with minor differences. Your flavor of Linux may vary, as long as you have the correct versions of the main components like MongoDB, Nutch, and Solr, you should be good. I did not try setting this up on the Mac though. We will stick to Ubuntu 16.04 LTS for the rest of this tutorial.
ubuntu@ip-1*2-3*-**-**:~$ uname -a
Linux ip-1*2-3*-**-** 4.4.0-1022-aws #31-Ubuntu SMP Tue Jun 27 11:27:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Java
We will use OpenJDK8, but you can also use Oracle JDK 8.
$ sudo apt-get install openjdk-8-jdk
$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
MongoDB
$ wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu1604-3.4.7.tgz
$ mkdir data
$ mkdir logs
$ tar xvfz mongodb-linux-x86_64-ubuntu1604-3.4.7.tgz
$ cd mongodb-linux-x86_64-ubuntu1604-3.4.7/bin
$ ./mongod --dbpath ~/data/ --logpath ~/logs/mongodb.log --fork
Nutch
Nutch 2.x is only available as a source bundle, so it will need to be built using ant
after configuring.
$ wget http://www-eu.apache.org/dist/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
$ tar xvfz apache-nutch-2.3.1-src.tar.gz
$ cd apache-nutch-2.3.1/conf
Next, we configure Nutch by editing $NUTCH_HOME/conf/nutch-site.xml
. This is where we define the crawldb database driver, enable plugins, and the crawl behavior, to restrict it to only the domain defined.
We then instruct Nutch to use MongoDB via the $NUTCH_HOME/conf/gora.properties
file. Nutch 2.x uses Apache Gora to manage persistence.
We also change $NUTCH_HOME/conf/ivy/ivy.xml
to enable MongoDB driver which will be used by Apache Gora. This is done by uncommenting the MongoDB line in the file.
<!-- Uncomment this to use MongoDB as Gora backend. -->
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
Here is the gist for ivy.xml
Now we build Nutch. Install ant
if it is not installed already.
$ sudo apt-get install ant
And we build Nutch from $NUTCH_HOME
folder.
$ pwd
/home/ubuntu/apache-nutch-2.3.1
$ ant runtime
This will take a while (about 4-5 mins).
Solr
Let us get Solr 6.5.1 set up.
We will download and install Solr, and create a core named nutch
to index the crawled pages. Then, we will copy the schema.xml
from Nutch configuration to this newly created core.
$ wget http://archive.apache.org/dist/lucene/solr/6.5.1/solr-6.5.1.tgz
$ tar xvfz solr-6.5.1.tgz
$ cd solr-6.5.1/bin
$ ./solr start
$ ./solr create_core -c nutch -d basic_configs
$ ./solr stop
$ cd ../server/solr/nutch/conf
$ cp ~/apache-nutch-2.3.1/conf/schema.xml .
Here comes the skullduggery. We will need to “fix” schema.xml
and solrconfig.xml
in this folder. We will also remove the managed-schema
file, as we’re providing the schema configuration externally.
$ rm managed-schema
$ vi schema.xml
It is important to remove all instances of enablePositionIncrements="true"
from every <filter class="solr.StopFilterFactory"
declaration. If not removed, the core will fail to initialize.
Here is the gist for schema.xml
Next, we have to fix the solrconfig.xml
$ vi solrconfig.xml
Locate the section for AddSchemaFieldsUpdateProcessorFactory
and comment out the <lst>
elements, like so-
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">strings</str>
<!--
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str>
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">tlongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">tdoubles</str>
</lst> -->
</processor>
Now, we start solr.
$ cd ~/solr-6.5.1/bin
$ ./solr start
Crawl and Index
Now that we have everything set up, we are ready to put Nutch in action.
First, tell nutch what URL(s) to crawl. We do this by creating a simple text file, and pointing Nutch to it.
$ cd ~/apache-nutch-2.3.1/
$ mkdir urls
$ vi urls/seeds.text
Enter the URL(s) in this file, for example. https://www.wikipedia.org
. One URL per line.
Once the seed file is set up, run the following -
$ runtime/local/bin/nutch inject urls/
InjectorJob: starting at 2017-08-14 07:43:22
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2017-08-14 07:43:25, elapsed: 00:00:03
This has initialized the crawl database - we can use the MongoDB CLI to check out the resulting database and collection.
> show dbs
local 0.000GB
nutchdb 0.005GB
> use nutchdb
switched to db nutchdb
> show collections
webpage
Next, we generate the top 50 URLs. Do not worry if you see a different number like 20 below.
$ runtime/local/bin/nutch generate -topN 50
GeneratorJob: starting at 2017-08-14 08:56:36
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 50
GeneratorJob: finished at 2017-08-14 08:56:38, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1502528196-1091715892 containing 20 URLs
Now that Nutch has selected N URLs, we go ahead fetch them.
$ runtime/local/bin/nutch fetch -all
This will fetch the N URLs, and we’ll see a ton of output.
Once fetched, the content needs to be parsed.
$ runtime/local/bin/nutch parse -all
Next, we update the DB with the current status.
$ runtime/local/bin/nutch updatedb -all
Finally, we index these pages in Solr
$ runtime/local/bin/nutch solrindex http://localhost:8983/solr/nutch -all
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
If you have access to the Solr console (http://host:8983), fire it up in a browser.
If this is an AWS EC2 instance, you’ll need to ensure HTTP access is allowed on port 8983 via the Security Groups + NACL, and if this is in Vagrant, use the port mapper in
Vagrantfile
to map guest port 8983 to any host port you’d like to use.
Once in Solr admin console, we can try firing up queries. If unable to access the admin UX, the same can be done via curls, like so:
$ curl "http://localhost:8983/solr/nutch/select?fl=url,%20meta_description,%20anchor,%20title&indent=on&q=content:test&wt=json"
Here, we’re querying Solr for any content
that matches test
(hence q=content:test
) and only return the url
, meta_description
, and anchor
(hence fl=url,%20meta_description,%20anchor,%20title
). We will get a list of at most 10 results in a JSON format. You may want to play with different values for different fields either via the Solr Console or curl. Refer to Solr query syntax here.
There we have it - a fully functional, end to end crawler and indexer setup!
Please note that the generate-fetch-parse-updatedb-index steps will need to be run frequently. It is a good idea to set up a cron job to execute these steps on a desired interval (based on the velocity of the site being indexed).
In this post, I did not cover alternatives like Scrapy, Beautiful Soup, crawler4j, etc. but I would encourage you to check them out if in discovery/research phase before deciding on Nutch.
Thoughts, feedback, ideas? Please let me know in the comments below.