To build Heritrix in Eclipse
This uses Heritrix 1.14.4 (2010 Year 5 dated 10 version is the
latest version of the
current
situation)
1. First of all download from
http://sourceforge.net/projects/archive-crawler/
heritrix-1.14.4.zip
heritrix-1.14.4-src.zip
2. In Eclipse
create a java project
in the
works, respectively,
heritrix-1.14.4.zip
heritrix-1.14.4-src.zip to extract.
3. Will
heritrix-1.14.4-src.zip Unzip
the
src / java in the com, org, st three files under the src folder to the
project.
4. Will heritrix-1.14.4-src.zip Unzip the src in the conf
folder to the project root directory
.
5. Will heritrix-1.14.4-src.zip Unzip in the lib folder to the
project root directory.
6. Will heritrix-1.14.4-src.zip Unzip in
src / resources / org / archive / util in tlds-alpha-by-domain.txt file
to the next project org.archive.util package.
7. Will
heritrix-1.14.4.zip extract the webapps folder to the project root
directory.
If the folder
name is not in the webapps need to make the appropriate changes
Heritrix.java.
/**
* @throws IOException
* @return Returns the directory under which reside the WAR files
* we're to load into the servlet container.
*/
public static File getWarsdir()
throws IOException {
return getSubDir("webapps");
}
8. Configuration file
changes, find the conf file under the heritrix.properties
// Set the user password heritrix.cmdline.admin = admin:admin // Set port heritrix.cmdline.port = 8080
9. Jar works package on the introduction of the all the jar
lib package following the introduction of engineering.
10.
Org.archive.crawler.Heritrix.java found right in the project
configuration options selected operating mode Classpath
Select User
Entries - Advanced
Select Add Folders to add into the conf folder.
Click Start Run Run
05:22:32.875 EVENT Starting Jetty/4.2.23 05:22:32.937 WARN!! Delete existing temp dir C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/D:/workspace/jcjcd/heritrixDemo/webapps/admin.war!/] 05:22:33.062 EVENT Started WebApplicationContext[/,Heritrix Console] 05:22:33.156 EVENT Started SocketListener on 127.0.0.1:8080 05:22:33.156 EVENT Started [email protected] Heritrix version: @[email protected]
So far we have completed the configuration under Heritrix in
Eclipse.
Now we can create a job for testing.
1. Http://127.0.0.1:8080 in your browser and enter the user
input configuration file name password.
Two. Next, we create a job,
select the navigation menu in the jobs, select CreateNewJob With
defaults.
3. Were filled name, description, and to be crawling the url.
4.
Select modules, here we will grab the results to create a mirror image,
the default is compressed, Select Writers of
org.archive.crawler.writer.ARCWriterProcessor remove and re-add a
org.archive.crawler.writer.MirrorWriterProcessor
5. Select Setting
bottom of the page set, many items can be set here, such as the maximum
number of threads, timeout and so on.
There are two must be set
http-headers HTTP headers.
user-agent: Mozilla/5.0 (compatible;
heritrix / @ VERSION @ + PROJECT_URL_HERE)
from:
CONTACT_EMAIL_ADDRESS_HERE
I am here simply to replace @
VERSION @ heritrix version
PROJECT_URL_HERE local ip changed to
http://
CONTACT_EMAIL_ADDRESS_HERE wrote a random email address
above configuration is complete select submitjob.
6. To Console Click to start the beginning of the crawl job.
Crawl under the completed projects to see jobs in the folder can be
found in the folder
文章来自:http://www.codeweblog.com/to-build-heritrix-in-eclipse/
(http://www.codeweblog.com/search/Heritrix/ )