Before you start

Prerequisites

  • You need to have Apache Ant installed and configured on your system.

  • Grab the newest version of Eclipse available here.

  • All of the following should be available from the Eclipse Marketplace. However if not, you can download them throughout Eclipse as follows.

  • Once you've set up Eclipse, download Subclipse as per here. N.B. If you experience an error with the 1.8.x release, try 1.6.x. This tends to solve compatibility problems.

  • Grab IvyDE plugin for Eclipse as here.

  • Grab m2e plugin for Eclipse here

Steps

Checkout and Build Nutch

  • Get the latest source code from SVN using terminal. For Nutch 1.x (ie.trunk) run this: svn co https://svn.apache.org/repos/asf/nutch/trunk cd trunk svn co https://svn.apache.org/repos/asf/nutch/branches/2.x cd 2.x
  • At this point you should have decided which data store you want to use. See the Apache Gora documentation to get more information about it. Here are few of the available options of storage classes:

      org.apache.gora.hbase.store.HBaseStore
      org.apache.gora.cassandra.store.CassandraStore
      org.apache.gora.accumulo.store.AccumuloStore
      org.apache.gora.avro.store.AvroStore
      org.apache.gora.avro.store.DataFileAvroStore
    <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
  • In ivy/ivy.xml: Uncomment the dependency for the data store that you selected. eg. If you plan to use HBase, uncomment this line:
  • Set the default datastore in conf/gora.properties. eg. For HBase as datastore, put this in conf/gora.properties: gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  • Add “http.agent.name” and “http.robots.agents” with appropiate values in “conf/nutch-site.xml”. See conf/nutch-default.xml for the description of these properties. Also, add “plugin.folders” and set it to {PATH_TO_NUTCH_CHECKOUT}/build/plugins. eg. If Nutch is present at "/home/tejas/Desktop/2.x", set the property to: <property> <name>plugin.folders</name> <value>/home/tejas/Desktop/2.x/build/plugins</value> </property>
  • Run this command: ant eclipse
  • Load project in Eclipse

    1. In Eclipse, click on “File” -> “Import...”

    2. Select “Existing Projects into Workspace”
    3. In the next window, set the root directory to the location where you took the checkout of nutch 2.x (or trunk). Click “Finish”.
    4. You will now see a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes its SVN cache and builds its workspace. You can see the status at the bottom right corner of Eclipse.
    5. In Package Explorer, right click on the project “2.x” (or trunk), select “Build Path” -> “Configure Build Path”

      Run Nutch In Eclipse on Linux and Windows nutch version 0.9

    6. In the “Order and Export” tab, scroll down and select “2.x/conf” (or trunk/conf). Click on “Top” button. Sadly, Eclipse will again build the workspace but this time it won’t take take much.

    Create Eclipse launcher

    • For 1.x ie trunk : Set the main class as: org.apache.nutch.crawl.Injector
    • For 2.x : Set the main class as: org.apache.nutch.crawl.InjectorJob

    Run Nutch In Eclipse on Linux and Windows nutch version 0.9

    In the arguments tab, for program arguments, provide the path of the input directory which has seed urls. Set VM Arguments to “-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log”

    Run Nutch In Eclipse on Linux and Windows nutch version 0.9

    Click "Apply" and then click "Run". If everything was set perfectly, then you should see inject operation progressing on console.

    Run Nutch In Eclipse on Linux and Windows nutch version 0.9

    If you want to find out the java class corresponding to any command, just peek inside "src/bin/nutch" script and at the bottom you would find a switch case with a case corresponding to each command. Here are the important classes corresponding to the crawl cycle:

    Operation

    Class in Nutch 1.x (i.e.trunk)

    Class in Nutch 2.x

    inject

    org.apache.nutch.crawl.Injector

    org.apache.nutch.crawl.InjectorJob

    generate

    org.apache.nutch.crawl.Generator

    org.apache.nutch.crawl.GeneratorJob

    fetch

    org.apache.nutch.fetcher.Fetcher

    org.apache.nutch.fetcher.FetcherJob

    parse

    org.apache.nutch.parse.ParseSegment

    org.apache.nutch.parse.ParserJob

    updatedb

    org.apache.nutch.crawl.CrawlDb

    org.apache.nutch.crawl.DbUpdaterJob

    Debug Nutch in Eclipse

    • Set breakpoints and debug a crawl
    • It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs.
    • Here are a few good places to set breakpoints in the 1.x codebase:

    Fetcher [line: 1115] - run
    Fetcher [line: 530] - fetch
    Fetcher$FetcherThread [line: 560] - run()
    Generator [line: 443] - generate
    Generator$Selector [line: 108] - map
    OutlinkExtractor [line: 71 & 74] - getOutlinks

                                       : line 519 : final ProtocolStatus status = output.getStatus();
    
    GeneratorMapper : map() : line 53
    GeneratorReducer : reduce() : line 53
    OutlinkExtractor : getOutlinks() : line 84

    Remote Debugging in Eclipse

  • create a new Debug Configuration as Remote Java Application and remember the port (here: 37649)

  • launch nutch from command-line but add options to use the Java Debugger JDWP Agent Library, e.g. from bash:

  • % $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/
  • the application will be suspended just after launch
  • now go to Eclipse, set appropriate break-points, and run the previously created Debug Configuration
  • Instead of creating an extra launch configuration for every tool you want to debug, one single configuration is enough to debug any tool (parsechecker, indexchecher, URL filter, etc.) and that even remotely (crawler/tool running on server, Eclipse debugger locally).

    Debugging and Timeouts

    <property>
      <name>parser.timeout</name>
      <value>-1</value>
    </property>

    Display Javadoc for Dependent Libraries

    Connect a Library to the Javadoc URL

    Run Nutch In Eclipse on Linux and Windows nutch version 0.9

    IvyDE

    The repository hosting a library often also provides packages containing javadoc and sources. E.g., the JUnit repository

    junit-4.11-javadoc.jar                             14-Nov-2012 19:21              379344
    junit-4.11-sources.jar                             14-Nov-2012 19:21              151329
    junit-4.11.jar                                     14-Nov-2012 19:21              245039
    junit-4.11.pom                                     14-Nov-2012 19:21                2344

    Troubleshooting

    eclipse: Cannot create project content in workspace

    Plugin directory not found

    <property>
      <name>plugin.folders</name>
      <value>/home/....../trunk/src/plugin</value>

    No plugins loaded during unit tests in Eclipse

    Debugging Hadoop classes

    • Checkout the Hadoop version that should be used within Nutch trunk
    • Configure a Hadoop project similar to the Nutch project within your Eclipse IDE. See this.

    • Add the Hadoop project as a dependent project of Nutch project
    • You can now also set break points within Hadoop classes like inputformat implementations etc.

    Non-ported Plugins to 2.x

    相关文章: