稍微玩了一下 Crawlzilla 與 Hadoop , ZooKeeper , Pig

Crawlzilla 是一套簡易的搜尋引擎, 很好上手 因為幾乎是全自動安裝
介紹網站說明得很清楚 : https://code.google.com/p/crawlzilla/
簡易安裝PDF檔: http://crawlzilla.googlecode.com/svn-history/r334/trunk/docs/crawlzilla_Usage_zhtw.pdf

——–

於是我就開始玩弄Crawlzilla 了
但是我在安裝的時候, 發現我無法全自動安裝成功
原因不難發現…
我的JAVA的目錄位置與Crawlzilla安裝程式碼裡面的不同

由於Crawlzilla 是Hard Code寫死在 conf/ntuch_conf/hadoop-env.sh 約第9行

#export JAVA_HOME=/usr/lib/jvm/java-6-sun    <==這是原本寫法
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45    <==我將它改成我的JAVA JDK路徑

同時 hadoop-env.sh 也定義了許多會固定路徑:

export NUTCH_HOME=/opt/crawlzilla/nutch
export HADOOP_HOME=/opt/crawlzilla/nutch
export NUTCH_CONF_DIR=/opt/crawlzilla/nutch/conf
export HADOOP_CONF_DIR=/opt/crawlzilla/nutch/conf
export NUTCH_LOG_DIR=/var/log/crawlzilla/hadoop-logs
export HADOOP_LOG_DIR=/var/log/crawlzilla/hadoop-logs

改完後就可以安裝囉….
對了要先確認你的主機可以連上internel了~

以下為安裝過程:

jamie@jamie-pc:~/share/hadoop/Crawlzilla_Install$ sudo ./install    <==記得加 sudo 才有夠大權限作事
sudo: unable to resolve host jamie-pc

 Identify is root 
 Your system information is: 
 Ubuntu , 12.04 

It will install some packages (expect, ssh, and dialog).

Ign http://tw.archive.ubuntu.com precise InRelease                                        
Ign http://tw.archive.ubuntu.com precise-updates InRelease                                
Ign http://tw.archive.ubuntu.com precise-backports InRelease                              
Hit http://tw.archive.ubuntu.com precise Release.gpg                                      
Hit http://tw.archive.ubuntu.com precise-updates Release.gpg                              
Hit http://tw.archive.ubuntu.com precise-backports Release.gpg                            
Hit http://tw.archive.ubuntu.com precise Release                     
Ign http://ppa.launchpad.net precise InRelease                                            
Ign http://ppa.launchpad.net precise InRelease                                            
Hit http://tw.archive.ubuntu.com precise-updates Release                                  
Hit http://tw.archive.ubuntu.com precise-backports Release                                
Ign http://security.ubuntu.com precise-security InRelease                                 
Hit http://tw.archive.ubuntu.com precise/main Sources                                     
Ign http://extras.ubuntu.com precise InRelease                                            
Hit http://tw.archive.ubuntu.com precise/restricted Sources                               
Hit http://tw.archive.ubuntu.com precise/universe Sources                                 
Hit http://tw.archive.ubuntu.com precise/multiverse Sources                               
Hit http://tw.archive.ubuntu.com precise/main amd64 Packages                              
Hit http://tw.archive.ubuntu.com precise/restricted amd64 Packages                        
Hit http://tw.archive.ubuntu.com precise/universe amd64 Packages                          
Hit http://tw.archive.ubuntu.com precise/multiverse amd64 Packages                        
Hit http://tw.archive.ubuntu.com precise/main i386 Packages                               
Hit http://tw.archive.ubuntu.com precise/restricted i386 Packages                         
Hit http://tw.archive.ubuntu.com precise/universe i386 Packages                           
Hit http://tw.archive.ubuntu.com precise/multiverse i386 Packages                         
Hit http://tw.archive.ubuntu.com precise/main TranslationIndex                            
Hit http://tw.archive.ubuntu.com precise/multiverse TranslationIndex                      
Hit http://tw.archive.ubuntu.com precise/restricted TranslationIndex                      
Hit http://tw.archive.ubuntu.com precise/universe TranslationIndex                        
Hit http://security.ubuntu.com precise-security Release.gpg                               
Hit http://tw.archive.ubuntu.com precise-updates/main Sources                             
Hit http://ppa.launchpad.net precise Release.gpg                     
Hit http://tw.archive.ubuntu.com precise-updates/restricted Sources  
Hit http://tw.archive.ubuntu.com precise-updates/universe Sources                         
Hit http://tw.archive.ubuntu.com precise-updates/multiverse Sources                       
Hit http://tw.archive.ubuntu.com precise-updates/main amd64 Packages                      
Hit http://tw.archive.ubuntu.com precise-updates/restricted amd64 Packages                
Hit http://tw.archive.ubuntu.com precise-updates/universe amd64 Packages                  
Hit http://tw.archive.ubuntu.com precise-updates/multiverse amd64 Packages                
Hit http://tw.archive.ubuntu.com precise-updates/main i386 Packages                       
Hit http://tw.archive.ubuntu.com precise-updates/restricted i386 Packages                 
Hit http://tw.archive.ubuntu.com precise-updates/universe i386 Packages                   
Hit http://tw.archive.ubuntu.com precise-updates/multiverse i386 Packages                 
Hit http://tw.archive.ubuntu.com precise-updates/main TranslationIndex                    
Hit http://tw.archive.ubuntu.com precise-updates/multiverse TranslationIndex              
Hit http://tw.archive.ubuntu.com precise-updates/restricted TranslationIndex              
Hit http://tw.archive.ubuntu.com precise-updates/universe TranslationIndex                
Hit http://extras.ubuntu.com precise Release.gpg                                          
Hit http://tw.archive.ubuntu.com precise-backports/main Sources                           
Hit http://tw.archive.ubuntu.com precise-backports/restricted Sources                     
Hit http://tw.archive.ubuntu.com precise-backports/universe Sources                       
Hit http://tw.archive.ubuntu.com precise-backports/multiverse Sources                     
Hit http://tw.archive.ubuntu.com precise-backports/main amd64 Packages                    
Hit http://tw.archive.ubuntu.com precise-backports/restricted amd64 Packages              
Hit http://tw.archive.ubuntu.com precise-backports/universe amd64 Packages                
Hit http://tw.archive.ubuntu.com precise-backports/multiverse amd64 Packages              
Hit http://tw.archive.ubuntu.com precise-backports/main i386 Packages                     
Hit http://tw.archive.ubuntu.com precise-backports/restricted i386 Packages               
Hit http://tw.archive.ubuntu.com precise-backports/universe i386 Packages                 
Hit http://tw.archive.ubuntu.com precise-backports/multiverse i386 Packages               
Hit http://tw.archive.ubuntu.com precise-backports/main TranslationIndex                  
Hit http://tw.archive.ubuntu.com precise-backports/multiverse TranslationIndex            
Hit http://tw.archive.ubuntu.com precise-backports/restricted TranslationIndex            
Hit http://security.ubuntu.com precise-security Release                                   
Hit http://tw.archive.ubuntu.com precise-backports/universe TranslationIndex              
Hit http://tw.archive.ubuntu.com precise/main Translation-en                              
Hit http://tw.archive.ubuntu.com precise/multiverse Translation-en                        
Hit http://tw.archive.ubuntu.com precise/restricted Translation-en                        
Hit http://tw.archive.ubuntu.com precise/universe Translation-en     
Hit http://tw.archive.ubuntu.com precise-updates/main Translation-en                      
Hit http://tw.archive.ubuntu.com precise-updates/multiverse Translation-en                
Hit http://tw.archive.ubuntu.com precise-updates/restricted Translation-en                
Hit http://tw.archive.ubuntu.com precise-updates/universe Translation-en                  
Hit http://tw.archive.ubuntu.com precise-backports/main Translation-en                    
Hit http://ppa.launchpad.net precise Release.gpg                                          
Hit http://tw.archive.ubuntu.com precise-backports/multiverse Translation-en              
Hit http://tw.archive.ubuntu.com precise-backports/restricted Translation-en              
Hit http://tw.archive.ubuntu.com precise-backports/universe Translation-en                
Hit http://extras.ubuntu.com precise Release                                              
Hit http://security.ubuntu.com precise-security/main Sources          
Hit http://ppa.launchpad.net precise Release   
Hit http://security.ubuntu.com precise-security/restricted Sources    
Hit http://security.ubuntu.com precise-security/universe Sources     
Hit http://security.ubuntu.com precise-security/multiverse Sources   
Hit http://security.ubuntu.com precise-security/main amd64 Packages
Hit http://security.ubuntu.com precise-security/restricted amd64 Packages
Hit http://security.ubuntu.com precise-security/universe amd64 Packages
Hit http://security.ubuntu.com precise-security/multiverse amd64 Packages
Hit http://security.ubuntu.com precise-security/main i386 Packages   
Hit http://security.ubuntu.com precise-security/restricted i386 Packages
Hit http://security.ubuntu.com precise-security/universe i386 Packages
Hit http://extras.ubuntu.com precise/main Sources                    
Hit http://ppa.launchpad.net precise Release   
Hit http://security.ubuntu.com precise-security/multiverse i386 Packages
Hit http://extras.ubuntu.com precise/main amd64 Packages             
Hit http://extras.ubuntu.com precise/main i386 Packages              
Ign http://extras.ubuntu.com precise/main TranslationIndex           
Hit http://security.ubuntu.com precise-security/main TranslationIndex
Hit http://security.ubuntu.com precise-security/multiverse TranslationIndex
Hit http://security.ubuntu.com precise-security/restricted TranslationIndex
Hit http://security.ubuntu.com precise-security/universe TranslationIndex
Hit http://ppa.launchpad.net precise/main Sources                    
Hit http://ppa.launchpad.net precise/main amd64 Packages             
Hit http://ppa.launchpad.net precise/main i386 Packages              
Hit http://ppa.launchpad.net precise/main TranslationIndex           
Hit http://security.ubuntu.com precise-security/main Translation-en  
Hit http://security.ubuntu.com precise-security/multiverse Translation-en
Hit http://security.ubuntu.com precise-security/restricted Translation-en
Hit http://ppa.launchpad.net precise/main Sources
Hit http://ppa.launchpad.net precise/main amd64 Packages             
Hit http://ppa.launchpad.net precise/main i386 Packages              
Ign http://ppa.launchpad.net precise/main TranslationIndex           
Hit http://security.ubuntu.com precise-security/universe Translation-en
Hit http://ppa.launchpad.net precise/main Translation-en             
Ign http://extras.ubuntu.com precise/main Translation-en_US
Ign http://extras.ubuntu.com precise/main Translation-en
Ign http://ppa.launchpad.net precise/main Translation-en_US
Ign http://ppa.launchpad.net precise/main Translation-en
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
expect is already the newest version.
dialog is already the newest version.
ssh is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 605 not upgraded.
 check_sunJava                                         <=== 開始檢查是否安裝所需的軟體套件
 Crawlzilla need Sun Java JDK 1.6.x or above version 
 System has Sun Java 1.6 above version. 
 System has ssh. 
 System has ssh Server (sshd). 
 System has dialog. 
 Welcome to use Crawlzilla, this install program will create a new accunt and to assist you to setup the password of crawler. <== 要自動幫你建立心使用者 crawler
 Set password for crawler: 
password:    <== 這裡請輸入想要使用的密碼

 keyin the password again: 
password:    <== 老規矩, 密碼要再輸入一次已確定沒打錯

 Master IP address is: 10.57.54.168 
 Master MAC address is:  00:24:be:7a:98:18   
 Please confirm the install infomation of above :1.Yes 2.No  <== 提供 IP 與 MAC Address 資訊, 沒問題才按1:YES
1
spawn passwd crawler
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully    <=成功設定密碼
Generating public/private rsa key pair.  <=開始產生ssh key
Created directory '/home/crawler/.ssh'.
Your identification has been saved in /home/crawler/.ssh/id_rsa.
Your public key has been saved in /home/crawler/.ssh/id_rsa.pub.
The key fingerprint is:
4a:04:5d:34:22:87:0a:e0:f2:1a:25:b3:1c:ab:4a:f9 crawler@jamie-pc
The key's randomart image is:
+--[ RSA 2048]----+
|o   oooo+        |
|o   .+.. .       |
|++..  .          |
|o*+  .           |
|oo.   . S        |
|.o.  . .         |
|oo    .          |
|o .              |
|.  E             |
+-----------------+
Could not open a connection to your authentication agent.  <= 好像無傷大雅
 unpack success! 
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
 Make the client installation package  
 Formatting HDFS...            <== 開始建立HDFS
15/01/15 14:16:34 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = java.net.UnknownHostException: jamie-pc: jamie-pc
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.19.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009
************************************************************/
Re-format filesystem in /var/lib/crawlzilla/nutch-crawler/dfs/name ? (Y or N) y   <==想要重新format嗎? 我選yes
Format aborted in /var/lib/crawlzilla/nutch-crawler/dfs/name
15/01/15 14:18:25 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: jamie-pc: jamie-pc
************************************************************/
 start up name node [Namenode] ...  
starting namenode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-namenode-jamie-pc.out  <==啟動 name node
 start up job node [JobTracker] ...  
starting jobtracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-jobtracker-jamie-pc.out <==啟動 job tracker
starting datanode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-datanode-jamie-pc.out <== 啟動 data node
starting tasktracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-tasktracker-jamie-pc.out <=啟動 task tracker
 Start up tomcat... 
.....
Using CATALINA_BASE:   /opt/crawlzilla/tomcat
Using CATALINA_HOME:   /opt/crawlzilla/tomcat
Using CATALINA_TMPDIR: /opt/crawlzilla/tomcat/temp
Using JRE_HOME:       /usr
 Tomcat has been started! 
 Installed successfully! 
 You can visit the manage website :http://10.57.54.168:8080  <==主機IP位址 + port 8080
 For client install, please refer commands as follows:  <= 順便告知我們, 如何安裝client  scp crawler@10.57.54.168:/home/crawler/crawlzilla/source/client_deploy.sh .      ./client_deploy.sh         Finish!!!   ==> 打完收工

附錄:HDFS Web Interface
HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations. By default this is exposed on port 50070 on the NameNode. Accessing http://namenode:50070/ with a web browser will return a page containing overview information about the health, capacity, and usage of the cluster (similar to the information returned by bin/hadoop dfsadmin -report).

The address and port where the web interface listens can be changed by setting dfs.http.address in conf/hadoop-site.xml. It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0.

From this interface, you can browse HDFS itself with a basic file-browser interface. Each DataNode exposes its file browser interface on port 50075. You can override this by setting the dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075. Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

Hadoop 單節點安裝記錄

參考了一些部落格的安裝過程, Hadoop 2.6.0单节点安装参考, hadoop 2.6.0单节点-伪分布式模式安装, 以及 Mac OSX 下 Hadoop 单节点集群配置 . Hadoop快速入门

這網站還有2014競賽的內容耶

1.我的環境:

OS: Ubuntu 12.04.1 LTS
Hadoop: 2.6.0
Java: jdk1.6.0_45

2.下載Hadoop
到這 apache的網站內hadoop部分, 去找出最新版下載吧, 我是使用2.6.0版本

3. 安裝Hadoop後, 直接進入hadoop-2.6.0目錄底下
編輯 etc/hadoop/hadoop-env.sh

# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45

但是為了確保以後每次登入linux主機, 都不用再重新設定 JAVA_HOME (參考)
請執行並修改你的 ~/.bashrc 檔案
(其實也可以寫到 /etc/profile 檔啦, 因為全體使用者環境變數設定檔位於 /etc/profile ,但只有 root 可以修改, 使用者登入後使用 BASH 的同時,第一時間會來執行 /etc/profile 這個檔案,,而才是個人的「~/.bashrc」( ~/.bash_profile, ~/.bash_login , ~/.profile 細節請 man bash)。 系統管理者可透過撰寫 /etc/profile 來提供使用者一個初始化的使用者環境設定, 如果 /etc/profile 與 .bashrc 兩者都寫的話, 會以最後設定得為主, 因為後者會蓋過前者!)

vi ~/.bash_profile

在最末行下加入:

# set JAVA_HOME for hadoop.
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45        <== 我的JDK位置
export PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin   <== 順便加入 jdk的bin檔到PATH

4.測試一下, 如果有輸出help指示就OK

bin/hadoop

5. 編輯 etc/hadoop/core-site.xml:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

6. 編輯etc/hadoop/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

7.先配置成ssh若使用本機登入則免密碼

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

8.測試是否真的可以免密碼登入

ssh localhost
exit

9. 複製一份mapred-site.xml

cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

10. 再编辑 etc/hadoop/mapred-site.xml:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

11. 编辑etc/hadoop/yarn-site.xml:

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

12. 確認本機是否, 避免java.lang.RuntimeException: java.net.UnknownHostException: myhostname: myhostname
編輯 /etc/hosts

127.0.0.1 myhostname

12.格式化HDFS的文件系统

bin/hdfs namenode -format

13. 启动 NameNode 和 DataNode 守护进程
如果想以网页方式查看NameNode: http://localhost:50070/ http://0.0.0.0:50070/

sbin/start-dfs.sh

14. 启动 ResourceManager和NodeManager守护进程
如果想以网页方式查看ResourceManager: http://localhost:8088/

sbin/start-yarn.sh

15. 创建文件夹 input 並 list出來

$ bin/hdfs dfs -mkdir -p /user/jamie/input
$ bin/hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - jamie supergroup          0 2015-01-19 14:30 /user/jamie/input

補充: 刪除文件夾/xxx之指令 $ bin/hdfs dfs -rm -r /xxx

16. 将要处理的文件复制到HDFS文件夹中 (參考)

$ bin/hdfs dfs -put etc/hadoop /user/jamie/input
$ bin/hdfs dfs -ls /user/jamie/input 
Found 1 items
drwxr-xr-x   - jamie supergroup          0 2015-01-19 14:38 /user/jamie/input/hadoop

17. 执行最经典的 wordcount 也算是hadoop中的hello word 了 (參考)

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output

或執行以下指令 (A map/reduce program that estimates Pi using a quasi-Monte Carlo method.)
# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar   pi 2 2

18. 运行一个MapReduce 任务

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output2 'dfs[a-z.]+'

19. 執行 jps 看 processs
(如果找不到jps指令, 就把指令的路徑加入PATH吧 如: PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin/)
如果上面操作一切正确的话,通过"jps"命令查看是否包含ResourceManager、NodeManager、NameNode、SecondaryNameNode、DataNode等5个Java进程,参考如下:

$ jps      
6539 NameNode
9741 NodeManager
7053 SecondaryNameNode
5652 Launcher
7778 Jps
9071 DataNode
9509 ResourceManager
補充: 以上為 hostid / hostname

20. 在分布式文件系统上查看输出文件:

$ bin/hadoop fs -cat output/*

21. 完成全部操作后,停止守护进程:

$ bin/stop-all.sh

安裝設定 ZooKeeper

Apache ZooKeeper 是一個致力於開發與維護的開源伺服器( open-source server), 它能夠實現高度可靠的分佈式協調。

Zookeeper 作為Hadoop 項目中的一個子項目,是Hadoop 集群管理的一個必不可少的模塊,它主要用來控制集群中的數據,如它管理Hadoop 集群中的NameNode,還有Hbase 中Master Election、Server 之間狀態同步等。本文介紹的Zookeeper 的基本知識,以及介紹了幾個典型的應用場景。這些都是Zookeeper 的基本功能,最重要的是Zoopkeeper 提供了一套很好的分佈式集群管理的機制,就是它這種基於層次型的目錄樹的數據結構,並對樹中的節點進行有效管理,從而可以設計出多種多樣的分佈式的數據管理模型,而不僅僅局限於上面提到的幾個常用應用場景。(出處:分佈式服務框架Zookeeper — 管理分佈式環境中的數據)

1. 程式碼下載
老樣子, 找官方網站下載Apache ZooKeeper吧!

2. 解壓縮, 處理conf檔, 啟動

tar zxvf zookeeper-3.4.6.tar.gz
cp  conf
cp  zoo_sample.cfg  zoo.cfg
bin/zkServer.sh start

列出zoo.cfg供參考

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

3. 驗證方式1: 啟動後輸入指令jps 可以正常看到 QuorumPeerMain 的 Java Process

$ jps
32296 Jps
6539 NameNode
9741 NodeManager
32061 QuorumPeerMain
7053 SecondaryNameNode
9071 DataNode
9509 ResourceManager

4. 驗證方式2: 可以快速確認是否 ZooKeeper is running (參考)
登入 ZooKeeper 主機並輸入指令: (請先確定Client Port為2181)
確認你有沒有接收到 imok 回應, 如果沒收到就表示 ZooKeeper is not running.

echo ruok | nc 127.0.0.1 2181

獲得更多資訊關於 Zookeeper

echo status | nc 127.0.0.1 2181

5.用Client端的程式去測試:

$ ./zkCli.sh -server 127.0.0.1:2181
Connecting to 127.0.0.1:2181
2015-01-26 11:42:52,353 [myid:] - INFO  [main:Environment@100] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-01-26 11:42:52,357 [myid:] - INFO  [main:Environment@100] - Client environment:host.name=jamie-pc
2015-01-26 11:42:52,357 [myid:] - INFO  [main:Environment@100] - Client environment:java.version=1.6.0_45
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.vendor=Sun Microsystems Inc.
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.home=/usr/lib/jvm/jdk1.6.0_45/jre
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.class.path=/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/classes:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../conf:
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.library.path=/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server:/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64:/usr/lib/jvm/jdk1.6.0_45/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.io.tmpdir=/tmp
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:java.compiler=
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:os.name=Linux
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:os.arch=amd64
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:os.version=3.2.0-29-generic
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:user.name=jamie
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:user.home=/home/jamie
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:user.dir=/home/jamie/share/hadoop/zookeeper-3.4.6/bin
2015-01-26 11:42:52,362 [myid:] - INFO  [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@6526804e
Welcome to ZooKeeper!
2015-01-26 11:42:52,390 [myid:] - INFO  [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Unable to locate a login configuration)
JLine support is enabled
2015-01-26 11:42:52,395 [myid:] - INFO  [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@852] - Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session
[zk: 127.0.0.1:2181(CONNECTING) 0] 2015-01-26 11:42:52,509 [myid:] - INFO  [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x14b2455a60c0000, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: 127.0.0.1:2181(CONNECTED) 0]
[zk: 127.0.0.1:2181(CONNECTED) 0]  create /test01 abcd  --建立測試
Created /test01
[zk: 127.0.0.1:2181(CONNECTED) 2] ls /
[test01, zookeeper]   ----->這是Master Node的資料,表示成功。
[zk: 127.0.0.1:2181(CONNECTED) 3] delete /test01 --刪除
[zk: 127.0.0.1:2181(CONNECTED) 4] ls /
[zookeeper]  --不見了

6. 停止服務

bin/zkServer.sh stop

安裝HBase (單機安裝且可以與zookeeper互動)

就是Hadoop的資料庫 (安裝參考)
另外要注意Hadoop與HBase版本之間的support與否, 可以看這個地方, 裡面也有提到JDK版本與HBase支援關係! 這個很重要~~

1. 下載

到 http://www.apache.org/dyn/closer.cgi/hbase/ 找下載點

2. 解壓縮 與 進入目錄

tar zxvf hbase-0.98.9-hadoop1-bin.tar.gz
cd hbase-0.98.9-hadoop1

3. 編輯xml檔的設定
現在你已經可以啟動Hbase了。但是你可能需要先編輯 conf/hbase-site.xml 去配置hbase.rootdir,來選擇Hbase將數據寫到哪個目錄
單機配置,只需要如下配置 hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///DIRECTORY/hbase</value>
  </property>
</configuration>

將DIRECTORY 替換成你期望寫文件的目錄. 默認hbase.rootdir 是指向/tmp/hbase-${user.name} ,也就說你會在重啟後丟失數據(重啟的時候操作系統會清理/tmp目錄)

我則改成 /home/jamie/share/hadoop/tmp/hbase : (參考出處)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
 <property>
 <name>hbase.rootdir</name>
 <!--<value>hdfs://localhost:8020/hbase</value>-->
 <value>/tmp/hbase</value>  <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper, 所以把hbase目錄一起放在/tmp 吧! (填寫後執行會自動建立)
 <description> The directory shared by RegionServers. </description>
 </property>

 <property>
 <name>hbase.zookeeper.property.dataDir</name>
 <value>/tmp/zookeeper</value>  <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper (要填寫一致才能互動連線)
 <description>
 Property from ZooKeeper config zoo.cfg.
 The directory where the snapshot is stored.
 </description>
 </property>

 <property>
 <name>hbase.zookeeper.property.clientPort</name>
 <value>2182</value>    <=== 這個是參考zookeeper 的zoo.cfg 裡面 clientPort=2181
 <description>Property from ZooKeeper's config zoo.cfg.
 The port at which the clients will connect.
 </description>
 </property>

</configuration>

4. 編輯 .bashrc 設定 Hbase PATH 環境變數 :
其實也可以寫到 /etc/profile 檔啦, 因為全體使用者環境變數設定檔位於 /etc/profile ,但只有 root 可以修改, 使用者登入後使用 BASH 的同時,第一時間會來執行 /etc/profile 這個檔案,,而才是個人的「~/.bashrc」( ~/.bash_profile, ~/.bash_login , ~/.profile 細節請 man bash)。 系統管理者可透過撰寫 /etc/profile 來提供使用者一個初始化的使用者環境設定, 如果 /etc/profile 與 .bashrc 兩者都寫的話, 會以最後設定得為主, 因為後者會蓋過前者!

# set JAVA_HOME for hadoop.    ==> 文章前面就有設定了, 只是再小改一下
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45
export PATH=$PATH:$JAVA_HOME/bin

# set Hbase PATH Environmental variable  ==> 設定!!
export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1
export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin

改玩重新執行一下 bashrc:

$source ~/.bashrc

5. 啟動 HBase

hbase-0.98.9-hadoop1$ bin/start-hbase.sh
starting master, logging to /home/jamie/share/hadoop/hbase-0.98.9-hadoop1/bin/../logs/hbase-jamie-master-jamie-pc.out

現在你運行的是單機模式的Hbaes。所有的服務都運行在一個JVM上,包括HBase和Zookeeper。 HBase的日誌放在logs目錄,當你啟動出問題的時候,可以檢查這個日誌。

PS: 是否安裝了java ?
你需要確認安裝了Oracle的1.6 版本的java.如果你在命令行鍵入java有反應說明你安裝了Java。如果沒有裝,你需要先安裝,然後編輯 conf/hbase-env.sh,將其中的JAVA_HOME指向到你Java的安裝目錄。 (已在.bashrc檔設定)

6. 進入Hbase Shell模式:

$ ./bin/hbase shell
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.98.9-hadoop1, r96878ece501b0643e879254645d7f3a40eaf101f, Mon Dec 15 22:36:48 PST 2014

hbase(main):001:0>       <==== 等待使用者輸入指令

操作幾個小指令: (參考出處)

hbase(main):006:0> create 'test', 'cf'   <==創建名為 test 的 table
0 row(s) in 0.5290 seconds

=> Hbase::Table - test
hbase(main):007:0> list 'table'  <== 列出table
TABLE
0 row(s) in 0.0110 seconds

=> []
hbase(main):008:0> list  <=就是列出
TABLE
test
1 row(s) in 0.0120 seconds

=> ["test"]
hbase(main):009:0> list 'table'  <== 列出table
TABLE
0 row(s) in 0.0050 seconds

=> []
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'value1' <== 放進test表格的列row1行cf:a, 數值為value1
0 row(s) in 0.1250 seconds

hbase(main):011:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0070 seconds

hbase(main):012:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0100 seconds

hbase(main):013:0> scan 'test'  <==掃描
ROW                                COLUMN+CELL
 row1                              column=cf:a, timestamp=1422256698208, value=value1
 row2                              column=cf:b, timestamp=1422256703257, value=value2
 row3                              column=cf:c, timestamp=1422256709483, value=value3
3 row(s) in 0.0440 seconds

hbase(main):014:0> get 'test', 'row1' <== 得到值
COLUMN                             CELL
 cf:a                              timestamp=1422256698208, value=value1
1 row(s) in 0.0220 seconds

hbase(main):015:0> disable 'test'  <==先diable
0 row(s) in 1.5080 seconds

hbase(main):016:0> drop 'test'  <== 再drop掉
0 row(s) in 0.2270 seconds

hbase(main):070:0> exit  <== 離開此shell (或按 ctrl + C 也可離開)

安裝Pig

1. 下載吧, 找到apache pig網頁

2. 解壓縮

$ tar zxvf pig-0.14.0.tar.gz

3. 再度編輯 .bashrc

export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1
export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf

4. 重新啟動界面後, 試試pig -help指令
如果有出現help說明就OK

pig -help

據說pig有兩種模試: Local & MapReduce

5-1. 試試Local模式

$ pig -x local
2015-01-29 09:34:18,676 INFO  [main] pig.ExecTypeProvider: Trying ExecType : LOCAL
2015-01-29 09:34:18,676 INFO  [main] pig.ExecTypeProvider: Picked LOCAL as the ExecType
2015-01-29 09:34:18,724 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:01:24
2015-01-29 09:34:18,724 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/jamie/share/hadoop/pig-0.14.0/pig_1422495258723.log
2015-01-29 09:34:18,760 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found
2015-01-29 09:34:18,853 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

5-2. 試試MapReduce模式
需要測定HADOOP_HOME的環境變數!
再度編輯 .bashrc

export HADOOP_HOME=/home/jamie/share/hadoop/hadoop-2.6.0
export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1
export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf

即可開啟mapReduce模式

$ pig -x mapreducepig -x mapreduce
15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/01/29 11:22:26 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-01-29 11:22:26,925 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:02:05
2015-01-29 11:22:26,925 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/jamie/pig_1422501746924.log
2015-01-29 11:22:26,941 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found
2015-01-29 11:22:27,546 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-01-29 11:22:27,546 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-01-29 11:22:27,546 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hbase-0.98.9-hadoop1/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2015-01-29 11:22:28,442 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-01-29 11:22:28,443 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: lsn-linux:9001
2015-01-29 11:22:28,443 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

下載Hadoop 程式碼(source code) 並編譯 (compile)

1. 程式碼下載
這裡有管理hadoop source code的github, "Download ZIP"可以直接下載程式碼

2. 安裝工具(@Ubuntu): 參考How to Contribute to Hadoop Common

apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev

3. 安裝JDK7
Installation of Oracle Java JDK 7 (which includes JRE, the Java browser plugin and JavaFX) to Ubuntu

#sudo add-apt-repository ppa:webupd8team/java 
#sudo apt-get update 
#sudo apt-get install oracle-jdk7-installer

4. 更新安装protoc至2.5.0版

wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz  ==> 下載 
tar zxvf protobuf-2.5.0.tar.gz  ==> 解壓縮
sudo ./configure --prefix=/usr   ==> 設定安裝
(若安装报错: cpp: error trying to exec 'cc1plus': execvp: No such file or directory 则安装g++  => sudo apt-get install g++ )
sudo make         ==> 執行make 編譯
sudo make check   ==> 執行make 檢查
sudo make install ==> 執行make 安裝

(info:Libraries have been installed in: /usr/local/lib)

5. 編譯 Build it. (參考)
Change directory to top level directory of extracted source where you will find pom.xml, which is build script in case of maven.

# mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests

2 thoughts on “稍微玩了一下 Crawlzilla 與 Hadoop , ZooKeeper , Pig

  1. 陳政翰

    不好意思,我最近剛好也在裝Crawzilla但遇到了一些問題想要向您請教
    (1)請問在裝Crawlzilla之前您有安裝nutch和tomcat嗎?
    因為我在安裝時並沒有啟動tomcat的服務,所以在猜想是不是要另外先安裝tomcat和nutch。

    (2)安裝Crawlzilla的過程中並沒有讓我輸入unix password那一段,更沒有進入到之後hadoop的部分,因此在猜想是否hadoop路徑與conf/nutch_conf/hadoop-env.sh中的不同。想請問您有更改conf/nutch_conf/hadoop-env.sh中的路徑嗎?

    謝謝您

發表迴響