http://www.micmiu.com/opensource/nutch/nutch2x-crawl-first-website/?utm_source=tuicool&utm_medium=referral

下面演示的过程是基于目前 Nutch 2.2.1 自己编译配置的版本。

在编译后 bin目录下有两个脚本文件:nutch 和 crawl ,在命令行下执行各命令即可查看具体使用说明:

 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
nutch
COMMAND
:
database
file
db
generate
fetch
parsing
parsing
database
hostDB
indexer
batches
solr
url
url
)
port
test
or
CLASSNAME
.

 

 
 
 
 
 
Shell
 
1
2
crawl
>

在Nutch2.x版本中,爬取流程所涉及的命令做了优化,整合到了crawl 命令中,使用者只需要执行一个命令 crawl 即可完成爬取流程,而不必像老版本中那样,必须依次地执行 inject、generate、fetch、parse等命令。对于初学者来说仍然可以依次执行相关命令 ,仔细观察每执行一步引起的数据变化。下面以抓取 本人博客网站为例详细说明下抓取的过程:

[准备]:创建需要抓取的URL

  • 首先启动hbase (本文是在单机模式下演示的)
  • mkdir -p urls
  • cd urls
  • touch seed.txt
  • echo ‘http://micmiu.com’ >seed.txt

下面每一步执行后都可以查看HBase中数据的变化情况。

[第一步]:inject

 
 
1
2
3
4
5
6
7
micmiublog
46
urls
SCDynamicStore
.
0
1

查看HBase中得数据:

 
 
1
2
3
4
5
6
7
8
9
'micmiublog_webpage'
CELL
x00
xF2
y
0
x00
x00
seconds

[第二步]:generate

 
 
1
2
3
4
5
6
7
8
9
10
micmiublog
09
.
starting
true
true
5
SCDynamicStore
03
1374349927

查看HBase中得数据:

 
 
1
2
3
4
5
6
7
8
9
10
11
'micmiublog_webpage'
CELL
1374349927
x00
xF2
1374349927
y
0
x00
x00
seconds

[第三步]:fetch

ps:上一步执行的日志中 GenerateorJob batch id 的值 作为下面命令的参数 batchId的值

也可以从hbase中重查询到:

 
 
1
2
3
4
}
                                                                                                    
                                                    
seconds

下面执行 fetch 命令:

 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
10
starting
1374349927
10
false
false
1
SCDynamicStore
byHost
10
0
//micmiu.com/ (queue crawl delay=5000ms)
1
1
1
1
1
1
1
1
1
5
1
0
queues
0
done

查看HBase中得数据:

 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
'micmiublog_webpage'
CELL                                                                                            
//micmiu.com/                                        
                                      
                                                          
x00                                              
//www.micmiu.com/\x00\x00                  
xF2                              
//micmiu.com/                                        
x05                                            
                                  
html                                                  
                                                                                          
close                                              
gzip                                          
                                              
                          
GMT                            
GMT                          
//www.micmiu.com/                                
cache                                                
LiteSpeed                                              
  
Cookie                                                    
//www.micmiu.com/xmlrpc.php                    
                                        
                                
                                
                                                    
                                                        
                                              
x00                                        
//www.micmiu.com/, timestamp=1421027385487, value=                                      
                                                
seconds

[第四步]:parse

 
 
1
2
3
4
5
6
7
8
9
micmiublog
starting
false
false
1374349927
SCDynamicStore
//micmiu.com/
//micmiu.com/ skipped. Content of size 20 was truncated to 0
success

查看HBase中得数据:

 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
'micmiublog_webpage'
CELL                                                                                            
//micmiu.com/                                        
                                      
                                                          
x00                                              
//www.micmiu.com/\x00\x00                  
xF2                              
//micmiu.com/                                        
x05                                            
                                  
html                                                  
                                                                                          
close                                              
gzip                                          
                                              
                          
GMT                            
GMT                          
//www.micmiu.com/                                
cache                                                
LiteSpeed                                              
  
Cookie                                                    
//www.micmiu.com/xmlrpc.php                    
                                        
                                
                                
                                                    
                                                        
                                              
x00                                        
//www.micmiu.com/, timestamp=1421027385487, value=                                      
                                                
seconds

[第五步]:updatedb

 
 
1
2
3
4
micmiublog
starting
SCDynamicStore
done

查看HBase中得数据:

 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
'micmiublog_webpage'
CELL                                                                                            
'\x8D\x00                                              
com.micmiu.www:http/  column=f:st, timestamp=1421027452042, value=\x00\x00\x00\x01                                            
com.micmiu.www:http/  column=f:ts, timestamp=1421027452042, value=\x00\x00\x01J\xDB\xD6$f                                    
com.micmiu.www:http/  column=mk:dist, timestamp=1421027452042, value=1                                                        
com.micmiu.www:http/  column=mtdt:_csh_, timestamp=1421027452042, value=?\x80\x00\x00                                        
com.micmiu.www:http/  column=s:s, timestamp=1421027452042, value=?\x80\x00\x00                                                
com.micmiu:http/      column=f:bas, timestamp=1421027385487, value=http://micmiu.com/                                        
com.micmiu:http/      column=f:bid, timestamp=1421027232815, value=1421027229-1374349927                                      
com.micmiu:http/      column=f:cnt, timestamp=1421027385487, value=                                                          
x00                                              
//www.micmiu.com/\x00\x00                  
xF2                              
//micmiu.com/                                        
x05                                            
                                        
html                                                  
                                                                                          
close                                              
gzip                                          
                                              
                          
GMT                            
GMT                          
//www.micmiu.com/                                
cache                                                
LiteSpeed                                              
  
Cookie                                                    
//www.micmiu.com/xmlrpc.php                    
                                        
                                                    
                                                        
x00                                        
//www.micmiu.com/, timestamp=1421027385487, value=                                      
                                                
seconds

—————–  EOF @Michael Sun —————–

相关文章:

  • 2022-01-08
  • 2021-06-23
  • 2022-12-23
  • 2021-06-30
  • 2021-08-10
  • 2022-12-23
  • 2022-02-07
  • 2022-12-23
猜你喜欢
  • 2022-12-23
  • 2022-03-06
  • 2021-11-10
  • 2022-12-23
  • 2021-10-11
  • 2021-12-18
  • 2022-12-23
相关资源
相似解决方案