玩了一下安裝實作大數據Big Data的一些工具軟體: Crawlzilla 與 Hadoop , ZooKeeper , Pig

大數據本身是個很寬泛的概念，Hadoop生態圈（或者泛生態圈）基本上都是為了處理超過單機尺度的數據處理而誕生的。你可以把它比作一個廚房所以需要的各種工具。鍋碗瓢盆，各有各的用處，互相之間又有重合。你可以用湯鍋直接當碗吃飯喝湯，你可以用小刀或者刨子去皮。但是每個工具有自己的特性，雖然奇怪的組合也能工作，但是未必是最佳選擇。大數據，首先你要能存的下大數據。傳統的文件系統是單機的，不能橫跨不同的機器。

HDFS（Hadoop Distributed FileSystem）的設計本質上是為了大量的數據能橫跨成百上千台機器，但是你看到的是一個文件系統而不是很多文件系統。比如你說我要獲取/hdfs/tmp/file1的數據，你引用的是一個文件路徑，但是實際的數據存放在很多不同的機器上。你作為用戶，不需要知道這些，就好比在單機上你不關心文件分散在什麼磁道什麼扇區一樣。

HDFS為你管理這些數據。存的下數據之後，你就開始考慮怎麼處理數據。雖然HDFS可以為你整體管理不同機器上的數據，但是這些數據太大了。一台機器讀取成T上P的數據（很大的數據哦，比如整個東京熱有史以來所有高清電影的大小甚至更大），一台機器慢慢跑也許需要好幾天甚至好幾週。

對於很多公司來說，單機處理是不可忍受的，比如微博要更新24小時熱博，它必須在24小時之內跑完這些處理。那麼我如果要用很多台機器處理，我就面臨瞭如何分配工作，如果一台機器掛瞭如何重新啟動相應的任務，機器之間如何互相通信交換數據以完成複雜的計算等等。這就是MapReduce / Tez / Spark的功能。

MapReduce是第一代計算引擎，Tez和Spark是第二代。 MapReduce的設計，採用了很簡化的計算模型，只有Map和Reduce兩個計算過程（中間用Shuffle串聯），用這個模型，已經可以處理大數據領域很大一部分問題了。那什麼是Map什麼是Reduce？考慮如果你要統計一個巨大的文本文件存儲在類似HDFS上，你想要知道這個文本里各個詞的出現頻率。你啟動了一個MapReduce程序。

Map階段，幾百台機器同時讀取這個文件的各個部分，分別把各自讀到的部分分別統計出詞頻，產生類似（hello, 12100次），（world，15214次）等等這樣的Pair（我這裡把Map和Combine放在一起說以便簡化）；這幾百台機器各自都產生瞭如上的集合，然後又有幾百台機器啟動Reduce處理。

Reducer機器A將從Mapper機器收到所有以A開頭的統計結果，機器B將收到B開頭的詞彙統計結果（當然實際上不會真的以字母開頭做依據，而是用函數產生Hash值以避免數據串化。因為類似X開頭的詞肯定比其他要少得多，而你不希望數據處理各個機器的工作量相差懸殊）。然後這些Reducer將再次匯總，（hello，12100）＋（hello，12311）＋（hello，345881）= （hello，370292）。

HBase：

是一個高可靠性、高性能、面向列、可伸縮的分佈式存儲系統，利用HBase技術可在廉價PC Server上搭建起大規模結構化數據集群。像Facebook，都拿它做大型實時應用 Facebook’s New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

Pig：

Yahoo開發的，並行地執行數據流處理的引擎，它包含了一種腳本語言，稱為Pig Latin，用來描述這些數據流。 Pig Latin本身提供了許多傳統的數據操作，同時允許用戶自己開發一些自定義函數用來讀取、處理和寫數據。在LinkedIn也是大量使用。

Hive：

Facebook領導的一個數據倉庫工具，可以將結構化的數據文件映射為一張數據庫表，並提供完整的sql查詢功能，可以將sql語句轉換為MapReduce任務進行運行。其優點是學習成本低，可以通過類SQL語句快速實現簡單的MapReduce統計。像一些data scientist 就可以直接查詢，不需要學習其他編程接口。

Cascading/Scalding：

Cascading是Twitter收購的一個公司技術，主要是提供數據管道的一些抽象接口，然後又推出了基於Cascading的Scala版本就叫Scalding。 Coursera是用Scalding作為MapReduce的編程接口放在Amazon的EMR運行。

Zookeeper：

一個分佈式的，開放源碼的分佈式應用程序協調服務，是Google的Chubby一個開源的實現。

Oozie：

一個基於工作流引擎的開源框架。由Cloudera公司貢獻給Apache的，它能夠提供對Hadoop MapReduce和Pig Jobs的任務調度與協調。

Azkaban:

跟上面很像，Linkedin開源的面向Hadoop的開源工作流系統，提供了類似於cron 的管理任務。

Tez：

Hortonworks主推的優化MapReduce執行引擎，與MapReduce相比較，Tez在性能方面更加出色。

Crawlzilla 安裝記錄

Crawlzilla 是一套簡易的搜尋引擎, 很好上手因為幾乎是全自動安裝
介紹網站說明得很清楚 : https://code.google.com/p/crawlzilla/
簡易安裝PDF檔: http://crawlzilla.googlecode.com/svn-history/r334/trunk/docs/crawlzilla_Usage_zhtw.pdf

——–

於是我就開始玩弄Crawlzilla 了
但是我在安裝的時候, 發現我無法全自動安裝成功
原因不難發現…
我的JAVA的目錄位置與Crawlzilla安裝程式碼裡面的不同

由於Crawlzilla 是Hard Code寫死在 conf/ntuch_conf/hadoop-env.sh 約第9行

#export JAVA_HOME=/usr/lib/jvm/java-6-sun    &lt;==這是原本寫法
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45    &lt;==我將它改成我的JAVA JDK路徑

1 2	#export JAVA_HOME=/usr/lib/jvm/java-6-sun <==這是原本寫法 export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 <==我將它改成我的JAVA JDK路徑

同時 hadoop-env.sh 也定義了許多會固定路徑:

export NUTCH_HOME=/opt/crawlzilla/nutch
export HADOOP_HOME=/opt/crawlzilla/nutch
export NUTCH_CONF_DIR=/opt/crawlzilla/nutch/conf
export HADOOP_CONF_DIR=/opt/crawlzilla/nutch/conf
export NUTCH_LOG_DIR=/var/log/crawlzilla/hadoop-logs
export HADOOP_LOG_DIR=/var/log/crawlzilla/hadoop-logs

export NUTCH_HOME=/opt/crawlzilla/nutch

export HADOOP_HOME=/opt/crawlzilla/nutch

export NUTCH_CONF_DIR=/opt/crawlzilla/nutch/conf

export HADOOP_CONF_DIR=/opt/crawlzilla/nutch/conf

export NUTCH_LOG_DIR=/var/log/crawlzilla/hadoop-logs

export HADOOP_LOG_DIR=/var/log/crawlzilla/hadoop-logs

改完後就可以安裝囉….
對了要先確認你的主機可以連上internel了~

以下為安裝過程:

jamie@jamie-pc:~/share/hadoop/Crawlzilla_Install$ <strong>sudo ./install </strong>   &lt;==記得加 sudo 才有夠大權限作事
sudo: unable to resolve host jamie-pc

 Identify is root 
 Your system information is: 
 Ubuntu , 12.04 

It will install some packages (expect, ssh, and dialog).

Ign http://tw.archive.ubuntu.com precise InRelease                                        
Ign http://tw.archive.ubuntu.com precise-updates InRelease                                
Ign http://tw.archive.ubuntu.com precise-backports InRelease                              
Hit http://tw.archive.ubuntu.com precise Release.gpg                                      
Hit http://tw.archive.ubuntu.com precise-updates Release.gpg                              
Hit http://tw.archive.ubuntu.com precise-backports Release.gpg                            
Hit http://tw.archive.ubuntu.com precise Release                     
Ign http://ppa.launchpad.net precise InRelease                                            
Ign http://ppa.launchpad.net precise InRelease                                            
Hit http://tw.archive.ubuntu.com precise-updates Release                                  
Hit http://tw.archive.ubuntu.com precise-backports Release                                
Ign http://security.ubuntu.com precise-security InRelease                                 
Hit http://tw.archive.ubuntu.com precise/main Sources                                     
Ign http://extras.ubuntu.com precise InRelease                                            
Hit http://tw.archive.ubuntu.com precise/restricted Sources                               
Hit http://tw.archive.ubuntu.com precise/universe Sources                                 
Hit http://tw.archive.ubuntu.com precise/multiverse Sources                               
Hit http://tw.archive.ubuntu.com precise/main amd64 Packages                              
Hit http://tw.archive.ubuntu.com precise/restricted amd64 Packages                        
Hit http://tw.archive.ubuntu.com precise/universe amd64 Packages                          
Hit http://tw.archive.ubuntu.com precise/multiverse amd64 Packages                        
Hit http://tw.archive.ubuntu.com precise/main i386 Packages                               
Hit http://tw.archive.ubuntu.com precise/restricted i386 Packages                         
Hit http://tw.archive.ubuntu.com precise/universe i386 Packages                           
Hit http://tw.archive.ubuntu.com precise/multiverse i386 Packages                         
Hit http://tw.archive.ubuntu.com precise/main TranslationIndex                            
Hit http://tw.archive.ubuntu.com precise/multiverse TranslationIndex                      
Hit http://tw.archive.ubuntu.com precise/restricted TranslationIndex                      
Hit http://tw.archive.ubuntu.com precise/universe TranslationIndex                        
Hit http://security.ubuntu.com precise-security Release.gpg                               
Hit http://tw.archive.ubuntu.com precise-updates/main Sources                             
Hit http://ppa.launchpad.net precise Release.gpg                     
Hit http://tw.archive.ubuntu.com precise-updates/restricted Sources  
Hit http://tw.archive.ubuntu.com precise-updates/universe Sources                         
Hit http://tw.archive.ubuntu.com precise-updates/multiverse Sources                       
Hit http://tw.archive.ubuntu.com precise-updates/main amd64 Packages                      
Hit http://tw.archive.ubuntu.com precise-updates/restricted amd64 Packages                
Hit http://tw.archive.ubuntu.com precise-updates/universe amd64 Packages                  
Hit http://tw.archive.ubuntu.com precise-updates/multiverse amd64 Packages                
Hit http://tw.archive.ubuntu.com precise-updates/main i386 Packages                       
Hit http://tw.archive.ubuntu.com precise-updates/restricted i386 Packages                 
Hit http://tw.archive.ubuntu.com precise-updates/universe i386 Packages                   
Hit http://tw.archive.ubuntu.com precise-updates/multiverse i386 Packages                 
Hit http://tw.archive.ubuntu.com precise-updates/main TranslationIndex                    
Hit http://tw.archive.ubuntu.com precise-updates/multiverse TranslationIndex              
Hit http://tw.archive.ubuntu.com precise-updates/restricted TranslationIndex              
Hit http://tw.archive.ubuntu.com precise-updates/universe TranslationIndex                
Hit http://extras.ubuntu.com precise Release.gpg                                          
Hit http://tw.archive.ubuntu.com precise-backports/main Sources                           
Hit http://tw.archive.ubuntu.com precise-backports/restricted Sources                     
Hit http://tw.archive.ubuntu.com precise-backports/universe Sources                       
Hit http://tw.archive.ubuntu.com precise-backports/multiverse Sources                     
Hit http://tw.archive.ubuntu.com precise-backports/main amd64 Packages                    
Hit http://tw.archive.ubuntu.com precise-backports/restricted amd64 Packages              
Hit http://tw.archive.ubuntu.com precise-backports/universe amd64 Packages                
Hit http://tw.archive.ubuntu.com precise-backports/multiverse amd64 Packages              
Hit http://tw.archive.ubuntu.com precise-backports/main i386 Packages                     
Hit http://tw.archive.ubuntu.com precise-backports/restricted i386 Packages               
Hit http://tw.archive.ubuntu.com precise-backports/universe i386 Packages                 
Hit http://tw.archive.ubuntu.com precise-backports/multiverse i386 Packages               
Hit http://tw.archive.ubuntu.com precise-backports/main TranslationIndex                  
Hit http://tw.archive.ubuntu.com precise-backports/multiverse TranslationIndex            
Hit http://tw.archive.ubuntu.com precise-backports/restricted TranslationIndex            
Hit http://security.ubuntu.com precise-security Release                                   
Hit http://tw.archive.ubuntu.com precise-backports/universe TranslationIndex              
Hit http://tw.archive.ubuntu.com precise/main Translation-en                              
Hit http://tw.archive.ubuntu.com precise/multiverse Translation-en                        
Hit http://tw.archive.ubuntu.com precise/restricted Translation-en                        
Hit http://tw.archive.ubuntu.com precise/universe Translation-en     
Hit http://tw.archive.ubuntu.com precise-updates/main Translation-en                      
Hit http://tw.archive.ubuntu.com precise-updates/multiverse Translation-en                
Hit http://tw.archive.ubuntu.com precise-updates/restricted Translation-en                
Hit http://tw.archive.ubuntu.com precise-updates/universe Translation-en                  
Hit http://tw.archive.ubuntu.com precise-backports/main Translation-en                    
Hit http://ppa.launchpad.net precise Release.gpg                                          
Hit http://tw.archive.ubuntu.com precise-backports/multiverse Translation-en              
Hit http://tw.archive.ubuntu.com precise-backports/restricted Translation-en              
Hit http://tw.archive.ubuntu.com precise-backports/universe Translation-en                
Hit http://extras.ubuntu.com precise Release                                              
Hit http://security.ubuntu.com precise-security/main Sources          
Hit http://ppa.launchpad.net precise Release   
Hit http://security.ubuntu.com precise-security/restricted Sources    
Hit http://security.ubuntu.com precise-security/universe Sources     
Hit http://security.ubuntu.com precise-security/multiverse Sources   
Hit http://security.ubuntu.com precise-security/main amd64 Packages
Hit http://security.ubuntu.com precise-security/restricted amd64 Packages
Hit http://security.ubuntu.com precise-security/universe amd64 Packages
Hit http://security.ubuntu.com precise-security/multiverse amd64 Packages
Hit http://security.ubuntu.com precise-security/main i386 Packages   
Hit http://security.ubuntu.com precise-security/restricted i386 Packages
Hit http://security.ubuntu.com precise-security/universe i386 Packages
Hit http://extras.ubuntu.com precise/main Sources                    
Hit http://ppa.launchpad.net precise Release   
Hit http://security.ubuntu.com precise-security/multiverse i386 Packages
Hit http://extras.ubuntu.com precise/main amd64 Packages             
Hit http://extras.ubuntu.com precise/main i386 Packages              
Ign http://extras.ubuntu.com precise/main TranslationIndex           
Hit http://security.ubuntu.com precise-security/main TranslationIndex
Hit http://security.ubuntu.com precise-security/multiverse TranslationIndex
Hit http://security.ubuntu.com precise-security/restricted TranslationIndex
Hit http://security.ubuntu.com precise-security/universe TranslationIndex
Hit http://ppa.launchpad.net precise/main Sources                    
Hit http://ppa.launchpad.net precise/main amd64 Packages             
Hit http://ppa.launchpad.net precise/main i386 Packages              
Hit http://ppa.launchpad.net precise/main TranslationIndex           
Hit http://security.ubuntu.com precise-security/main Translation-en  
Hit http://security.ubuntu.com precise-security/multiverse Translation-en
Hit http://security.ubuntu.com precise-security/restricted Translation-en
Hit http://ppa.launchpad.net precise/main Sources
Hit http://ppa.launchpad.net precise/main amd64 Packages             
Hit http://ppa.launchpad.net precise/main i386 Packages              
Ign http://ppa.launchpad.net precise/main TranslationIndex           
Hit http://security.ubuntu.com precise-security/universe Translation-en
Hit http://ppa.launchpad.net precise/main Translation-en             
Ign http://extras.ubuntu.com precise/main Translation-en_US
Ign http://extras.ubuntu.com precise/main Translation-en
Ign http://ppa.launchpad.net precise/main Translation-en_US
Ign http://ppa.launchpad.net precise/main Translation-en
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
expect is already the newest version.
dialog is already the newest version.
ssh is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 605 not upgraded.
 check_sunJava                                         &lt;=== 開始檢查是否安裝所需的軟體套件
 Crawlzilla need Sun Java JDK 1.6.x or above version 
 System has Sun Java 1.6 above version. 
 System has ssh. 
 System has ssh Server (sshd). 
 System has dialog. 
 Welcome to use Crawlzilla, this install program will create a new accunt and to assist you to setup the password of crawler. &lt;== 要自動幫你建立心使用者 crawler
 Set password for crawler： 
password:    &lt;== 這裡請輸入想要使用的密碼

 keyin the password again： 
password:    &lt;== 老規矩, 密碼要再輸入一次已確定沒打錯

 Master IP address is： 10.57.54.168 
 Master MAC address is：  00:24:be:7a:98:18   
 Please confirm the install infomation of above ：1.Yes 2.No  &lt;== 提供 IP 與 MAC Address 資訊, 沒問題才按1:YES
1
spawn passwd crawler
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully    &lt;=成功設定密碼
Generating public/private rsa key pair.  &lt;=開始產生ssh key
Created directory '/home/crawler/.ssh'.
Your identification has been saved in /home/crawler/.ssh/id_rsa.
Your public key has been saved in /home/crawler/.ssh/id_rsa.pub.
The key fingerprint is:
4a:04:5d:34:22:87:0a:e0:f2:1a:25:b3:1c:ab:4a:f9 crawler@jamie-pc
The key's randomart image is:
+--[ RSA 2048]----+
|o   oooo+        |
|o   .+.. .       |
|++..  .          |
|o*+  .           |
|oo.   . S        |
|.o.  . .         |
|oo    .          |
|o .              |
|.  E             |
+-----------------+
Could not open a connection to your authentication agent.  &lt;= 好像無傷大雅
 unpack success! 
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
 Make the client installation package  
 Formatting HDFS...            &lt;== 開始建立HDFS
15/01/15 14:16:34 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = java.net.UnknownHostException: jamie-pc: jamie-pc
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.19.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009
************************************************************/
Re-format filesystem in /var/lib/crawlzilla/nutch-crawler/dfs/name ? (Y or N) y   &lt;==想要重新format嗎? 我選yes
Format aborted in /var/lib/crawlzilla/nutch-crawler/dfs/name
15/01/15 14:18:25 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: jamie-pc: jamie-pc
************************************************************/
 start up name node [Namenode] ...  
starting namenode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-namenode-jamie-pc.out  &lt;==啟動 name node
 start up job node [JobTracker] ...  
starting jobtracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-jobtracker-jamie-pc.out &lt;==啟動 job tracker
starting datanode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-datanode-jamie-pc.out &lt;== 啟動 data node
starting tasktracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-tasktracker-jamie-pc.out &lt;=啟動 task tracker
 Start up tomcat... 
.....
Using CATALINA_BASE:   /opt/crawlzilla/tomcat
Using CATALINA_HOME:   /opt/crawlzilla/tomcat
Using CATALINA_TMPDIR: /opt/crawlzilla/tomcat/temp
Using JRE_HOME:       /usr
 Tomcat has been started！ 
 Installed successfully! 
 You can visit the manage website ：http://10.57.54.168:8080  &lt;==主機IP位址 + port 8080
 For client install, please refer commands as follows：  &lt;= 順便告知我們, 如何安裝client  scp crawler@10.57.54.168:/home/crawler/crawlzilla/source/client_deploy.sh .      ./client_deploy.sh         Finish!!!   ==&gt; 打完收工

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

jamie@jamie-pc:~/share/hadoop/Crawlzilla_Install$ sudo ./install <==記得加 sudo 才有夠大權限作事

sudo: unable to resolve host jamie-pc

Identify is root

Your system information is:

Ubuntu , 12.04

It will install some packages (expect, ssh, and dialog).

Ign http://tw.archive.ubuntu.com precise InRelease

Ign http://tw.archive.ubuntu.com precise-updates InRelease

Ign http://tw.archive.ubuntu.com precise-backports InRelease

Hit http://tw.archive.ubuntu.com precise Release.gpg

Hit http://tw.archive.ubuntu.com precise-updates Release.gpg

Hit http://tw.archive.ubuntu.com precise-backports Release.gpg

Hit http://tw.archive.ubuntu.com precise Release

Ign http://ppa.launchpad.net precise InRelease

Hit http://tw.archive.ubuntu.com precise-updates Release

Hit http://tw.archive.ubuntu.com precise-backports Release

Ign http://security.ubuntu.com precise-security InRelease

Hit http://tw.archive.ubuntu.com precise/main Sources

Ign http://extras.ubuntu.com precise InRelease

Hit http://tw.archive.ubuntu.com precise/restricted Sources

Hit http://tw.archive.ubuntu.com precise/universe Sources

Hit http://tw.archive.ubuntu.com precise/multiverse Sources

Hit http://tw.archive.ubuntu.com precise/main amd64 Packages

Hit http://tw.archive.ubuntu.com precise/restricted amd64 Packages

Hit http://tw.archive.ubuntu.com precise/universe amd64 Packages

Hit http://tw.archive.ubuntu.com precise/multiverse amd64 Packages

Hit http://tw.archive.ubuntu.com precise/main i386 Packages

Hit http://tw.archive.ubuntu.com precise/restricted i386 Packages

Hit http://tw.archive.ubuntu.com precise/universe i386 Packages

Hit http://tw.archive.ubuntu.com precise/multiverse i386 Packages

Hit http://tw.archive.ubuntu.com precise/main TranslationIndex

Hit http://tw.archive.ubuntu.com precise/multiverse TranslationIndex

Hit http://tw.archive.ubuntu.com precise/restricted TranslationIndex

Hit http://tw.archive.ubuntu.com precise/universe TranslationIndex

Hit http://security.ubuntu.com precise-security Release.gpg

Hit http://tw.archive.ubuntu.com precise-updates/main Sources

Hit http://ppa.launchpad.net precise Release.gpg

Hit http://tw.archive.ubuntu.com precise-updates/restricted Sources

Hit http://tw.archive.ubuntu.com precise-updates/universe Sources

Hit http://tw.archive.ubuntu.com precise-updates/multiverse Sources

Hit http://tw.archive.ubuntu.com precise-updates/main amd64 Packages

Hit http://tw.archive.ubuntu.com precise-updates/restricted amd64 Packages

Hit http://tw.archive.ubuntu.com precise-updates/universe amd64 Packages

Hit http://tw.archive.ubuntu.com precise-updates/multiverse amd64 Packages

Hit http://tw.archive.ubuntu.com precise-updates/main i386 Packages

Hit http://tw.archive.ubuntu.com precise-updates/restricted i386 Packages

Hit http://tw.archive.ubuntu.com precise-updates/universe i386 Packages

Hit http://tw.archive.ubuntu.com precise-updates/multiverse i386 Packages

Hit http://tw.archive.ubuntu.com precise-updates/main TranslationIndex

Hit http://tw.archive.ubuntu.com precise-updates/multiverse TranslationIndex

Hit http://tw.archive.ubuntu.com precise-updates/restricted TranslationIndex

Hit http://tw.archive.ubuntu.com precise-updates/universe TranslationIndex

Hit http://extras.ubuntu.com precise Release.gpg

Hit http://tw.archive.ubuntu.com precise-backports/main Sources

Hit http://tw.archive.ubuntu.com precise-backports/restricted Sources

Hit http://tw.archive.ubuntu.com precise-backports/universe Sources

Hit http://tw.archive.ubuntu.com precise-backports/multiverse Sources

Hit http://tw.archive.ubuntu.com precise-backports/main amd64 Packages

Hit http://tw.archive.ubuntu.com precise-backports/restricted amd64 Packages

Hit http://tw.archive.ubuntu.com precise-backports/universe amd64 Packages

Hit http://tw.archive.ubuntu.com precise-backports/multiverse amd64 Packages

Hit http://tw.archive.ubuntu.com precise-backports/main i386 Packages

Hit http://tw.archive.ubuntu.com precise-backports/restricted i386 Packages

Hit http://tw.archive.ubuntu.com precise-backports/universe i386 Packages

Hit http://tw.archive.ubuntu.com precise-backports/multiverse i386 Packages

Hit http://tw.archive.ubuntu.com precise-backports/main TranslationIndex

Hit http://tw.archive.ubuntu.com precise-backports/multiverse TranslationIndex

Hit http://tw.archive.ubuntu.com precise-backports/restricted TranslationIndex

Hit http://security.ubuntu.com precise-security Release

Hit http://tw.archive.ubuntu.com precise-backports/universe TranslationIndex

Hit http://tw.archive.ubuntu.com precise/main Translation-en

Hit http://tw.archive.ubuntu.com precise/multiverse Translation-en

Hit http://tw.archive.ubuntu.com precise/restricted Translation-en

Hit http://tw.archive.ubuntu.com precise/universe Translation-en

Hit http://tw.archive.ubuntu.com precise-updates/main Translation-en

Hit http://tw.archive.ubuntu.com precise-updates/multiverse Translation-en

Hit http://tw.archive.ubuntu.com precise-updates/restricted Translation-en

Hit http://tw.archive.ubuntu.com precise-updates/universe Translation-en

Hit http://tw.archive.ubuntu.com precise-backports/main Translation-en

Hit http://ppa.launchpad.net precise Release.gpg

Hit http://tw.archive.ubuntu.com precise-backports/multiverse Translation-en

Hit http://tw.archive.ubuntu.com precise-backports/restricted Translation-en

Hit http://tw.archive.ubuntu.com precise-backports/universe Translation-en

Hit http://extras.ubuntu.com precise Release

Hit http://security.ubuntu.com precise-security/main Sources

Hit http://ppa.launchpad.net precise Release

Hit http://security.ubuntu.com precise-security/restricted Sources

Hit http://security.ubuntu.com precise-security/universe Sources

Hit http://security.ubuntu.com precise-security/multiverse Sources

Hit http://security.ubuntu.com precise-security/main amd64 Packages

Hit http://security.ubuntu.com precise-security/restricted amd64 Packages

Hit http://security.ubuntu.com precise-security/universe amd64 Packages

Hit http://security.ubuntu.com precise-security/multiverse amd64 Packages

Hit http://security.ubuntu.com precise-security/main i386 Packages

Hit http://security.ubuntu.com precise-security/restricted i386 Packages

Hit http://security.ubuntu.com precise-security/universe i386 Packages

Hit http://extras.ubuntu.com precise/main Sources

Hit http://ppa.launchpad.net precise Release

Hit http://security.ubuntu.com precise-security/multiverse i386 Packages

Hit http://extras.ubuntu.com precise/main amd64 Packages

Hit http://extras.ubuntu.com precise/main i386 Packages

Ign http://extras.ubuntu.com precise/main TranslationIndex

Hit http://security.ubuntu.com precise-security/main TranslationIndex

Hit http://security.ubuntu.com precise-security/multiverse TranslationIndex

Hit http://security.ubuntu.com precise-security/restricted TranslationIndex

Hit http://security.ubuntu.com precise-security/universe TranslationIndex

Hit http://ppa.launchpad.net precise/main Sources

Hit http://ppa.launchpad.net precise/main amd64 Packages

Hit http://ppa.launchpad.net precise/main i386 Packages

Hit http://ppa.launchpad.net precise/main TranslationIndex

Hit http://security.ubuntu.com precise-security/main Translation-en

Hit http://security.ubuntu.com precise-security/multiverse Translation-en

Hit http://security.ubuntu.com precise-security/restricted Translation-en

Hit http://ppa.launchpad.net precise/main Sources

Hit http://ppa.launchpad.net precise/main amd64 Packages

Hit http://ppa.launchpad.net precise/main i386 Packages

Ign http://ppa.launchpad.net precise/main TranslationIndex

Hit http://security.ubuntu.com precise-security/universe Translation-en

Hit http://ppa.launchpad.net precise/main Translation-en

Ign http://extras.ubuntu.com precise/main Translation-en_US

Ign http://extras.ubuntu.com precise/main Translation-en

Ign http://ppa.launchpad.net precise/main Translation-en_US

Ign http://ppa.launchpad.net precise/main Translation-en

Reading package lists... Done

Building dependency tree

Reading state information... Done

expect is already the newest version.

dialog is already the newest version.

ssh is already the newest version.

0 upgraded, 0 newly installed, 0 to remove and 605 not upgraded.

check_sunJava <=== 開始檢查是否安裝所需的軟體套件

Crawlzilla need Sun Java JDK 1.6.x or above version

System has Sun Java 1.6 above version.

System has ssh.

System has ssh Server (sshd).

System has dialog.

Welcome to use Crawlzilla, this install program will create a new accunt and to assist you to setup the password of crawler. <== 要自動幫你建立心使用者 crawler

Set password for crawler：

password: <== 這裡請輸入想要使用的密碼

keyin the password again：

password: <== 老規矩, 密碼要再輸入一次已確定沒打錯

Master IP address is： 10.57.54.168

Master MAC address is： 00:24:be:7a:98:18

Please confirm the install infomation of above ：1.Yes 2.No <== 提供 IP 與 MAC Address 資訊, 沒問題才按1:YES

spawn passwd crawler

Enter new UNIX password:

Retype new UNIX password:

passwd: password updated successfully <=成功設定密碼

Generating public/private rsa key pair. <=開始產生ssh key

Created directory '/home/crawler/.ssh'.

Your identification has been saved in /home/crawler/.ssh/id_rsa.

Your public key has been saved in /home/crawler/.ssh/id_rsa.pub.

The key fingerprint is:

4a:04:5d:34:22:87:0a:e0:f2:1a:25:b3:1c:ab:4a:f9 crawler@jamie-pc

The key's randomart image is:

+--[ RSA 2048]----+

|o oooo+ |

|o .+.. . |

|++.. . |

|o*+ . |

|oo. . S |

|.o. . . |

|oo . |

|o . |

|. E |

+-----------------+

Could not open a connection to your authentication agent. <= 好像無傷大雅

unpack success!

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Make the client installation package

Formatting HDFS... <== 開始建立HDFS

15/01/15 14:16:34 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = java.net.UnknownHostException: jamie-pc: jamie-pc

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 0.19.1

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009

************************************************************/

Re-format filesystem in /var/lib/crawlzilla/nutch-crawler/dfs/name ? (Y or N) y <==想要重新format嗎? 我選yes

Format aborted in /var/lib/crawlzilla/nutch-crawler/dfs/name

15/01/15 14:18:25 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: jamie-pc: jamie-pc

************************************************************/

start up name node [Namenode] ...

starting namenode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-namenode-jamie-pc.out <==啟動 name node

start up job node [JobTracker] ...

starting jobtracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-jobtracker-jamie-pc.out <==啟動 job tracker

starting datanode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-datanode-jamie-pc.out <== 啟動 data node

starting tasktracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-tasktracker-jamie-pc.out <=啟動 task tracker

Start up tomcat...

.....

Using CATALINA_BASE: /opt/crawlzilla/tomcat

Using CATALINA_HOME: /opt/crawlzilla/tomcat

Using CATALINA_TMPDIR: /opt/crawlzilla/tomcat/temp

Using JRE_HOME: /usr

Tomcat has been started！

Installed successfully!

You can visit the manage website ：http://10.57.54.168:8080 <==主機IP位址 + port 8080

For client install, please refer commands as follows： <= 順便告知我們, 如何安裝client scp crawler@10.57.54.168:/home/crawler/crawlzilla/source/client_deploy.sh . ./client_deploy.sh Finish!!! ==> 打完收工

附錄:HDFS Web Interface
HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations. By default this is exposed on port 50070 on the NameNode. Accessing http://namenode:50070/ with a web browser will return a page containing overview information about the health, capacity, and usage of the cluster (similar to the information returned by bin/hadoop dfsadmin -report).

The address and port where the web interface listens can be changed by setting dfs.http.address in conf/hadoop-site.xml. It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0.

From this interface, you can browse HDFS itself with a basic file-browser interface. Each DataNode exposes its file browser interface on port 50075. You can override this by setting the dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075. Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

Hadoop 單節點安裝記錄

參考了一些部落格的安裝過程, Hadoop 2.6.0单节点安装参考, hadoop 2.6.0单节点-伪分布式模式安装, 以及 Mac OSX 下 Hadoop 单节点集群配置 . Hadoop快速入门

這網站還有2014競賽的內容耶

1.我的環境:

OS: Ubuntu 12.04.1 LTS
Hadoop: 2.6.0
Java: jdk1.6.0_45

2.下載Hadoop
到這 apache的網站內hadoop部分, 去找出最新版下載吧, 我是使用2.6.0版本

3. 安裝Hadoop後, 直接進入hadoop-2.6.0目錄底下
編輯 etc/hadoop/hadoop-env.sh

# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45

# The java implementation to use.

#export JAVA_HOME=${JAVA_HOME}

export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45

但是為了確保以後每次登入linux主機, 都不用再重新設定 JAVA_HOME (參考)
請執行並修改你的 ~/.bashrc 檔案
(其實也可以寫到 /etc/profile 檔啦, 因為全體使用者環境變數設定檔位於 /etc/profile ,但只有 root 可以修改，使用者登入後使用 BASH 的同時，第一時間會來執行 /etc/profile 這個檔案，，而才是個人的「~/.bashrc」( ~/.bash_profile, ~/.bash_login , ~/.profile 細節請 man bash)。系統管理者可透過撰寫 /etc/profile 來提供使用者一個初始化的使用者環境設定, 如果 /etc/profile 與 .bashrc 兩者都寫的話, 會以最後設定得為主, 因為後者會蓋過前者!)

vi ~/.bash_profile

1	vi ~/.bash_profile

在最末行下加入:

# set JAVA_HOME for hadoop.
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45        &lt;== 我的JDK位置
export PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin   &lt;== 順便加入 jdk的bin檔到PATH

# set JAVA_HOME for hadoop.

export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 <== 我的JDK位置

export PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin <== 順便加入 jdk的bin檔到PATH

4.測試一下, 如果有輸出help指示就OK

bin/hadoop

1	bin/hadoop

5. 編輯 etc/hadoop/core-site.xml:

&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;fs.defaultFS&lt;/name&gt;
&lt;value&gt;hdfs://localhost:9000&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

6. 編輯etc/hadoop/hdfs-site.xml:

&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;dfs.replication&lt;/name&gt;
&lt;value&gt;1&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;

<name>dfs.replication</name>

</property>

</configuration>

7.先配置成ssh若使用本機登入則免密碼

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub &gt;&gt; ~/.ssh/authorized_keys

1 2	ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

8.測試是否真的可以免密碼登入

ssh localhost
exit

1 2	ssh localhost exit

9. 複製一份mapred-site.xml

cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

1	cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

10. 再编辑 etc/hadoop/mapred-site.xml:

&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;mapreduce.framework.name&lt;/name&gt;
&lt;value&gt;yarn&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;

<name>mapreduce.framework.name</name>

</property>

</configuration>

11. 编辑etc/hadoop/yarn-site.xml:

&lt;configuration&gt;
&lt;!-- Site specific YARN configuration properties --&gt;
&lt;property&gt;
&lt;name&gt;yarn.nodemanager.aux-services&lt;/name&gt;
&lt;value&gt;mapreduce_shuffle&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

12. 確認本機是否, 避免java.lang.RuntimeException: java.net.UnknownHostException: myhostname: myhostname
編輯 /etc/hosts

127.0.0.1 myhostname

1	127.0.0.1 myhostname

12.格式化HDFS的文件系统

bin/hdfs namenode -format

1	bin/hdfs namenode -format

13. 启动 NameNode 和 DataNode 守护进程
如果想以网页方式查看NameNode： http://localhost:50070/ http://0.0.0.0:50070/

sbin/start-dfs.sh

1	sbin/start-dfs.sh

14. 启动 ResourceManager和NodeManager守护进程
如果想以网页方式查看ResourceManager： http://localhost:8088/

sbin/start-yarn.sh

1	sbin/start-yarn.sh

15. 创建文件夹 input 並 list出來

$ bin/hdfs dfs -mkdir -p /user/jamie/input
$ bin/hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - jamie supergroup          0 2015-01-19 14:30 /user/jamie/input

補充: 刪除文件夾/xxx之指令 $ bin/hdfs dfs -rm -r /xxx

$ bin/hdfs dfs -mkdir -p /user/jamie/input

$ bin/hdfs dfs -ls /

Found 1 items

drwxr-xr-x - jamie supergroup 0 2015-01-19 14:30 /user/jamie/input

補充: 刪除文件夾/xxx之指令 $ bin/hdfs dfs -rm -r /xxx

16. 将要处理的文件复制到HDFS文件夹中 (參考)

$ bin/hdfs dfs -put etc/hadoop /user/jamie/input
$ bin/hdfs dfs -ls /user/jamie/input 
Found 1 items
drwxr-xr-x   - jamie supergroup          0 2015-01-19 14:38 /user/jamie/input/hadoop

$ bin/hdfs dfs -put etc/hadoop /user/jamie/input

$ bin/hdfs dfs -ls /user/jamie/input

Found 1 items

drwxr-xr-x - jamie supergroup 0 2015-01-19 14:38 /user/jamie/input/hadoop

17. 执行最经典的 wordcount 也算是hadoop中的hello word 了 (參考)

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output

或執行以下指令 (A map/reduce program that estimates Pi using a quasi-Monte Carlo method.)
# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar   pi 2 2

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output

或執行以下指令 (A map/reduce program that estimates Pi using a quasi-Monte Carlo method.)

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 2 2

18. 运行一个MapReduce 任务

# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output2 'dfs[a-z.]+'

1	# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output2 'dfs[a-z.]+'

19. 執行 jps 看 processs
(如果找不到jps指令, 就把指令的路徑加入PATH吧如: PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin/)
如果上面操作一切正确的话，通过”jps”命令查看是否包含ResourceManager、NodeManager、NameNode、SecondaryNameNode、DataNode等5个Java进程，参考如下：

$ jps      
6539 NameNode
9741 NodeManager
7053 SecondaryNameNode
5652 Launcher
7778 Jps
9071 DataNode
9509 ResourceManager
補充: 以上為 hostid / hostname

$ jps

6539 NameNode

9741 NodeManager

7053 SecondaryNameNode

5652 Launcher

7778 Jps

9071 DataNode

9509 ResourceManager

補充: 以上為 hostid / hostname

20. 在分布式文件系统上查看输出文件：

$ bin/hadoop fs -cat output/*

1	$ bin/hadoop fs -cat output/*

21. 完成全部操作后，停止守护进程：

$ bin/stop-all.sh

1	$ bin/stop-all.sh

安裝設定 ZooKeeper

Apache ZooKeeper 是一個致力於開發與維護的開源伺服器( open-source server), 它能夠實現高度可靠的分佈式協調。

Zookeeper 作為Hadoop 項目中的一個子項目，是Hadoop 集群管理的一個必不可少的模塊，它主要用來控制集群中的數據，如它管理Hadoop 集群中的NameNode，還有Hbase 中Master Election、Server 之間狀態同步等。本文介紹的Zookeeper 的基本知識，以及介紹了幾個典型的應用場景。這些都是Zookeeper 的基本功能，最重要的是Zoopkeeper 提供了一套很好的分佈式集群管理的機制，就是它這種基於層次型的目錄樹的數據結構，並對樹中的節點進行有效管理，從而可以設計出多種多樣的分佈式的數據管理模型，而不僅僅局限於上面提到的幾個常用應用場景。(出處:分佈式服務框架Zookeeper — 管理分佈式環境中的數據)

1. 程式碼下載
老樣子, 找官方網站下載Apache ZooKeeper吧!

2. 解壓縮, 處理conf檔, 啟動

tar zxvf zookeeper-3.4.6.tar.gz
cp  conf
cp  zoo_sample.cfg  zoo.cfg
bin/zkServer.sh start

tar zxvf zookeeper-3.4.6.tar.gz

cp conf

cp zoo_sample.cfg zoo.cfg

bin/zkServer.sh start

列出zoo.cfg供參考

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

# The number of milliseconds of each tick

tickTime=2000

# The number of ticks that the initial

# synchronization phase can take

initLimit=10

# The number of ticks that can pass between

# sending a request and getting an acknowledgement

syncLimit=5

# the directory where the snapshot is stored.

# do not use /tmp for storage, /tmp here is just

# example sakes.

dataDir=/tmp/zookeeper

# the port at which the clients will connect

clientPort=2181

# the maximum number of client connections.

# increase this if you need to handle more clients

#maxClientCnxns=60

# Be sure to read the maintenance section of the

# administrator guide before turning on autopurge.

# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance

# The number of snapshots to retain in dataDir

#autopurge.snapRetainCount=3

# Purge task interval in hours

# Set to "0" to disable auto purge feature

#autopurge.purgeInterval=1

3. 驗證方式1：啟動後輸入指令jps 可以正常看到 QuorumPeerMain 的 Java Process

$ jps
32296 Jps
6539 NameNode
9741 NodeManager
32061 QuorumPeerMain
7053 SecondaryNameNode
9071 DataNode
9509 ResourceManager

$ jps

32296 Jps

6539 NameNode

9741 NodeManager

32061 QuorumPeerMain

7053 SecondaryNameNode

9071 DataNode

9509 ResourceManager

4. 驗證方式2：可以快速確認是否 ZooKeeper is running (參考)
登入 ZooKeeper 主機並輸入指令: (請先確定Client Port為2181)
確認你有沒有接收到 imok 回應, 如果沒收到就表示 ZooKeeper is not running.

echo ruok | nc 127.0.0.1 2181

1	echo ruok \| nc 127.0.0.1 2181

獲得更多資訊關於 Zookeeper

echo status | nc 127.0.0.1 2181

1	echo status \| nc 127.0.0.1 2181

5.用Client端的程式去測試：

$ <strong>./zkCli.sh -server 127.0.0.1:2181</strong>
Connecting to 127.0.0.1:2181
2015-01-26 11:42:52,353 [myid:] - INFO  [main:Environment@100] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2015-01-26 11:42:52,357 [myid:] - INFO  [main:Environment@100] - Client environment:host.name=jamie-pc
2015-01-26 11:42:52,357 [myid:] - INFO  [main:Environment@100] - Client environment:java.version=1.6.0_45
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.vendor=Sun Microsystems Inc.
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.home=/usr/lib/jvm/jdk1.6.0_45/jre
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.class.path=/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/classes:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../conf:
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.library.path=/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server:/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64:/usr/lib/jvm/jdk1.6.0_45/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2015-01-26 11:42:52,359 [myid:] - INFO  [main:Environment@100] - Client environment:java.io.tmpdir=/tmp
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:java.compiler=
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:os.name=Linux
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:os.arch=amd64
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:os.version=3.2.0-29-generic
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:user.name=jamie
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:user.home=/home/jamie
2015-01-26 11:42:52,360 [myid:] - INFO  [main:Environment@100] - Client environment:user.dir=/home/jamie/share/hadoop/zookeeper-3.4.6/bin
2015-01-26 11:42:52,362 [myid:] - INFO  [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@6526804e
Welcome to ZooKeeper!
2015-01-26 11:42:52,390 [myid:] - INFO  [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Unable to locate a login configuration)
JLine support is enabled
2015-01-26 11:42:52,395 [myid:] - INFO  [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@852] - Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session
[zk: 127.0.0.1:2181(CONNECTING) 0] 2015-01-26 11:42:52,509 [myid:] - INFO  [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x14b2455a60c0000, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: 127.0.0.1:2181(CONNECTED) 0]
[zk: 127.0.0.1:2181(CONNECTED) 0]  <strong>create /test01 abcd</strong>  --建立測試
Created /test01
[zk: 127.0.0.1:2181(CONNECTED) 2] <strong>ls /</strong>
[test01, zookeeper]   -----&gt;這是Master Node的資料，表示成功。
[zk: 127.0.0.1:2181(CONNECTED) 3] <strong>delete /test01</strong> --刪除
[zk: 127.0.0.1:2181(CONNECTED) 4] <strong>ls /</strong>
[zookeeper]  --不見了

$ ./zkCli.sh -server 127.0.0.1:2181

Connecting to 127.0.0.1:2181

2015-01-26 11:42:52,353 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT

2015-01-26 11:42:52,357 [myid:] - INFO [main:Environment@100] - Client environment:host.name=jamie-pc

2015-01-26 11:42:52,357 [myid:] - INFO [main:Environment@100] - Client environment:java.version=1.6.0_45

2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.vendor=Sun Microsystems Inc.

2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.home=/usr/lib/jvm/jdk1.6.0_45/jre

2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.class.path=/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/classes:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../conf:

2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.library.path=/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server:/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64:/usr/lib/jvm/jdk1.6.0_45/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.io.tmpdir=/tmp

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:java.compiler=

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:os.name=Linux

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:os.arch=amd64

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:os.version=3.2.0-29-generic

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:user.name=jamie

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:user.home=/home/jamie

2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:user.dir=/home/jamie/share/hadoop/zookeeper-3.4.6/bin

2015-01-26 11:42:52,362 [myid:] - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@6526804e

Welcome to ZooKeeper!

2015-01-26 11:42:52,390 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Unable to locate a login configuration)

JLine support is enabled

2015-01-26 11:42:52,395 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@852] - Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session

[zk: 127.0.0.1:2181(CONNECTING) 0] 2015-01-26 11:42:52,509 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x14b2455a60c0000, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

[zk: 127.0.0.1:2181(CONNECTED) 0]

[zk: 127.0.0.1:2181(CONNECTED) 0] create /test01 abcd --建立測試

Created /test01

[zk: 127.0.0.1:2181(CONNECTED) 2] ls /

[test01, zookeeper] ----->這是Master Node的資料，表示成功。

[zk: 127.0.0.1:2181(CONNECTED) 3] delete /test01 --刪除

[zk: 127.0.0.1:2181(CONNECTED) 4] ls /

[zookeeper] --不見了

6. 停止服務

bin/zkServer.sh stop

1	bin/zkServer.sh stop

安裝HBase (單機安裝且可以與zookeeper互動)

就是Hadoop的資料庫 (安裝參考)
另外要注意Hadoop與HBase版本之間的support與否, 可以看這個地方, 裡面也有提到JDK版本與HBase支援關係! 這個很重要~~

1. 下載

到 http://www.apache.org/dyn/closer.cgi/hbase/ 找下載點

1	到 http://www.apache.org/dyn/closer.cgi/hbase/ 找下載點

2. 解壓縮與進入目錄

tar zxvf hbase-0.98.9-hadoop1-bin.tar.gz
cd hbase-0.98.9-hadoop1

1 2	tar zxvf hbase-0.98.9-hadoop1-bin.tar.gz cd hbase-0.98.9-hadoop1

3. 編輯xml檔的設定
現在你已經可以啟動Hbase了。但是你可能需要先編輯 conf/hbase-site.xml 去配置hbase.rootdir，來選擇Hbase將數據寫到哪個目錄
單機配置，只需要如下配置 hbase-site.xml：

&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;hbase.rootdir&lt;/name&gt;
    &lt;value&gt;file:///DIRECTORY/hbase&lt;/value&gt;
  &lt;/property&gt;
&lt;/configuration&gt;

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hbase.rootdir</name>

<value>file:///DIRECTORY/hbase</value>

</property>

</configuration>

將DIRECTORY 替換成你期望寫文件的目錄. 默認hbase.rootdir 是指向/tmp/hbase-${user.name} ，也就說你會在重啟後丟失數據(重啟的時候操作系統會清理/tmp目錄)

我則改成 /home/jamie/share/hadoop/tmp/hbase : (參考出處)

&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;


&lt;configuration&gt;
 &lt;property&gt;
 &lt;name&gt;hbase.rootdir&lt;/name&gt;
 &lt;!--&lt;value&gt;hdfs://localhost:8020/hbase&lt;/value&gt;--&gt;
 &lt;value&gt;<strong>/tmp/hbase</strong>&lt;/value&gt;  <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper, 所以把hbase目錄一起放在/tmp 吧! (填寫後執行會自動建立)
 &lt;description&gt; The directory shared by RegionServers. &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
 &lt;name&gt;hbase.zookeeper.property.dataDir&lt;/name&gt;
 &lt;value&gt;<strong>/tmp/zookeeper</strong>&lt;/value&gt;  <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper (要填寫一致才能互動連線)
 &lt;description&gt;
 Property from ZooKeeper config zoo.cfg.
 The directory where the snapshot is stored.
 &lt;/description&gt;
 &lt;/property&gt;

 &lt;property&gt;
 &lt;name&gt;hbase.zookeeper.property.clientPort&lt;/name&gt;
 &lt;value&gt;<strong>2182</strong>&lt;/value&gt;    <=== 這個是參考zookeeper 的zoo.cfg 裡面 clientPort=2181
 &lt;description&gt;Property from ZooKeeper's config zoo.cfg.
 The port at which the clients will connect.
 &lt;/description&gt;
 &lt;/property&gt;

&lt;/configuration&gt;

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hbase.rootdir</name>

<value>/tmp/hbase</value> <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper, 所以把hbase目錄一起放在/tmp 吧! (填寫後執行會自動建立)

<description> The directory shared by RegionServers. </description>

</property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/tmp/zookeeper</value> <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper (要填寫一致才能互動連線)

Property from ZooKeeper config zoo.cfg.

The directory where the snapshot is stored.

</description>

</property>

<name>hbase.zookeeper.property.clientPort</name>

<value>2182</value> <=== 這個是參考zookeeper 的zoo.cfg 裡面 clientPort=2181

<description>Property from ZooKeeper's config zoo.cfg.

The port at which the clients will connect.

</description>

</property>

</configuration>

4. 編輯 .bashrc 設定 Hbase PATH 環境變數 :
其實也可以寫到 /etc/profile 檔啦, 因為全體使用者環境變數設定檔位於 /etc/profile ,但只有 root 可以修改，使用者登入後使用 BASH 的同時，第一時間會來執行 /etc/profile 這個檔案，，而才是個人的「~/.bashrc」( ~/.bash_profile, ~/.bash_login , ~/.profile 細節請 man bash)。系統管理者可透過撰寫 /etc/profile 來提供使用者一個初始化的使用者環境設定, 如果 /etc/profile 與 .bashrc 兩者都寫的話, 會以最後設定得為主, 因為後者會蓋過前者!

# set JAVA_HOME for hadoop.    ==> 文章前面就有設定了, 只是再小改一下
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45
export PATH=$PATH:$JAVA_HOME/bin

# set Hbase PATH Environmental variable  ==> 設定!!
export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1
export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin

# set JAVA_HOME for hadoop. ==> 文章前面就有設定了, 只是再小改一下

export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45

export PATH=$PATH:$JAVA_HOME/bin

# set Hbase PATH Environmental variable ==> 設定!!

export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1

export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin

改玩重新執行一下 bashrc:

$source ~/.bashrc

1	$source ~/.bashrc

5. 啟動 HBase

hbase-0.98.9-hadoop1$ bin/start-hbase.sh
starting master, logging to /home/jamie/share/hadoop/hbase-0.98.9-hadoop1/bin/../logs/hbase-jamie-master-jamie-pc.out

1 2	hbase-0.98.9-hadoop1$ bin/start-hbase.sh starting master, logging to /home/jamie/share/hadoop/hbase-0.98.9-hadoop1/bin/../logs/hbase-jamie-master-jamie-pc.out

現在你運行的是單機模式的Hbaes。所有的服務都運行在一個JVM上，包括HBase和Zookeeper。 HBase的日誌放在logs目錄,當你啟動出問題的時候，可以檢查這個日誌。

PS: 是否安裝了java ?
你需要確認安裝了Oracle的1.6 版本的java.如果你在命令行鍵入java有反應說明你安裝了Java。如果沒有裝，你需要先安裝~~，然後編輯 conf/hbase-env.sh，將其中的JAVA_HOME指向到你Java的安裝目錄。~~ (已在.bashrc檔設定)

6. 進入Hbase Shell模式:

$ <strong>./bin/hbase shell</strong>
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.9-hadoop1, r96878ece501b0643e879254645d7f3a40eaf101f, Mon Dec 15 22:36:48 PST 2014

hbase(main):001:0>       <==== 等待使用者輸入指令

$ ./bin/hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.98.9-hadoop1, r96878ece501b0643e879254645d7f3a40eaf101f, Mon Dec 15 22:36:48 PST 2014

hbase(main):001:0> <==== 等待使用者輸入指令

操作幾個小指令: (參考出處)

hbase(main):006:0> <strong>create 'test', 'cf'</strong>   <==創建名為 test 的 table
0 row(s) in 0.5290 seconds

=> Hbase::Table - test
hbase(main):007:0> <strong>list 'table'</strong>  <== 列出table
TABLE
0 row(s) in 0.0110 seconds

=> []
hbase(main):008:0> <strong>list</strong>  <=就是列出
TABLE
test
1 row(s) in 0.0120 seconds

=> ["test"]
hbase(main):009:0> <strong>list 'table'</strong>  <== 列出table
TABLE
0 row(s) in 0.0050 seconds

=> []
hbase(main):010:0> <strong>put 'test', 'row1', 'cf:a', 'value1'</strong> <== 放進test表格的列row1行cf:a, 數值為value1
0 row(s) in 0.1250 seconds

hbase(main):011:0> <strong>put 'test', 'row2', 'cf:b', 'value2'</strong>
0 row(s) in 0.0070 seconds

hbase(main):012:0><strong> put 'test', 'row3', 'cf:c', 'value3'</strong>
0 row(s) in 0.0100 seconds

hbase(main):013:0> <strong>scan 'test'</strong>  <==掃描
ROW                                COLUMN+CELL
 row1                              column=cf:a, timestamp=1422256698208, value=value1
 row2                              column=cf:b, timestamp=1422256703257, value=value2
 row3                              column=cf:c, timestamp=1422256709483, value=value3
3 row(s) in 0.0440 seconds

hbase(main):014:0> <strong>get 'test', 'row1'</strong> <== 得到值
COLUMN                             CELL
 cf:a                              timestamp=1422256698208, value=value1
1 row(s) in 0.0220 seconds

hbase(main):015:0> <strong>disable 'test'</strong>  <==先diable
0 row(s) in 1.5080 seconds

hbase(main):016:0> drop 'test'  <== 再drop掉
0 row(s) in 0.2270 seconds

hbase(main):070:0> <strong>exit</strong>  <== 離開此shell (或按 ctrl + C 也可離開)

hbase(main):006:0> create 'test', 'cf' <==創建名為 test 的 table

0 row(s) in 0.5290 seconds

=> Hbase::Table - test

hbase(main):007:0> list 'table' <== 列出table

TABLE

0 row(s) in 0.0110 seconds

=> []

hbase(main):008:0> list <=就是列出

TABLE

test

1 row(s) in 0.0120 seconds

=> ["test"]

hbase(main):009:0> list 'table' <== 列出table

TABLE

0 row(s) in 0.0050 seconds

=> []

hbase(main):010:0> put 'test', 'row1', 'cf:a', 'value1' <== 放進test表格的列row1行cf:a, 數值為value1

0 row(s) in 0.1250 seconds

hbase(main):011:0> put 'test', 'row2', 'cf:b', 'value2'

0 row(s) in 0.0070 seconds

hbase(main):012:0> put 'test', 'row3', 'cf:c', 'value3'

0 row(s) in 0.0100 seconds

hbase(main):013:0> scan 'test' <==掃描

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1422256698208, value=value1

row2 column=cf:b, timestamp=1422256703257, value=value2

row3 column=cf:c, timestamp=1422256709483, value=value3

3 row(s) in 0.0440 seconds

hbase(main):014:0> get 'test', 'row1' <== 得到值

COLUMN CELL

cf:a timestamp=1422256698208, value=value1

1 row(s) in 0.0220 seconds

hbase(main):015:0> disable 'test' <==先diable

0 row(s) in 1.5080 seconds

hbase(main):016:0> drop 'test' <== 再drop掉

0 row(s) in 0.2270 seconds

hbase(main):070:0> exit <== 離開此shell (或按 ctrl + C 也可離開)

安裝Pig

1. 下載吧, 找到apache pig網頁

2. 解壓縮

$ tar zxvf pig-0.14.0.tar.gz

1	$ tar zxvf pig-0.14.0.tar.gz

3. 再度編輯 .bashrc

export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1
export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf

export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1

export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0

export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf

4. 重新啟動界面後, 試試pig -help指令
如果有出現help說明就OK

pig -help

pig -help

據說pig有兩種模試: Local & MapReduce

5-1. 試試Local模式

$ pig -x local
2015-01-29 09:34:18,676 INFO  [main] pig.ExecTypeProvider: Trying ExecType : LOCAL
2015-01-29 09:34:18,676 INFO  [main] pig.ExecTypeProvider: Picked LOCAL as the ExecType
2015-01-29 09:34:18,724 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:01:24
2015-01-29 09:34:18,724 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/jamie/share/hadoop/pig-0.14.0/pig_1422495258723.log
2015-01-29 09:34:18,760 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found
2015-01-29 09:34:18,853 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

$ pig -x local

2015-01-29 09:34:18,676 INFO [main] pig.ExecTypeProvider: Trying ExecType : LOCAL

2015-01-29 09:34:18,676 INFO [main] pig.ExecTypeProvider: Picked LOCAL as the ExecType

2015-01-29 09:34:18,724 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:01:24

2015-01-29 09:34:18,724 [main] INFO org.apache.pig.Main - Logging error messages to: /home/jamie/share/hadoop/pig-0.14.0/pig_1422495258723.log

2015-01-29 09:34:18,760 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found

2015-01-29 09:34:18,853 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

5-2. 試試MapReduce模式
需要測定HADOOP_HOME的環境變數!
再度編輯 .bashrc

export HADOOP_HOME=/home/jamie/share/hadoop/hadoop-2.6.0
export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1
export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf

export HADOOP_HOME=/home/jamie/share/hadoop/hadoop-2.6.0

export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1

export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0

export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf

即可開啟mapReduce模式

$ pig -x mapreducepig -x mapreduce
15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/01/29 11:22:26 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-01-29 11:22:26,925 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:02:05
2015-01-29 11:22:26,925 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/jamie/pig_1422501746924.log
2015-01-29 11:22:26,941 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found
2015-01-29 11:22:27,546 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-01-29 11:22:27,546 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-01-29 11:22:27,546 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hbase-0.98.9-hadoop1/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2015-01-29 11:22:28,442 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-01-29 11:22:28,443 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: lsn-linux:9001
2015-01-29 11:22:28,443 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

$ pig -x mapreducepig -x mapreduce

15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

15/01/29 11:22:26 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-01-29 11:22:26,925 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:02:05

2015-01-29 11:22:26,925 [main] INFO org.apache.pig.Main - Logging error messages to: /home/jamie/pig_1422501746924.log

2015-01-29 11:22:26,941 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found

2015-01-29 11:22:27,546 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-01-29 11:22:27,546 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2015-01-29 11:22:27,546 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hbase-0.98.9-hadoop1/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

2015-01-29 11:22:28,442 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2015-01-29 11:22:28,443 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: lsn-linux:9001

2015-01-29 11:22:28,443 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

下載Hadoop 程式碼(source code) 並編譯 (compile)

1. 程式碼下載
這裡有管理hadoop source code的github, “Download ZIP”可以直接下載程式碼

2. 安裝工具(@Ubuntu): 參考How to Contribute to Hadoop Common

apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev

1	apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev

3. 安裝JDK7
Installation of Oracle Java JDK 7 (which includes JRE, the Java browser plugin and JavaFX) to Ubuntu

#sudo add-apt-repository ppa:webupd8team/java 
#sudo apt-get update 
#sudo apt-get install oracle-jdk7-installer

#sudo add-apt-repository ppa:webupd8team/java

#sudo apt-get update

#sudo apt-get install oracle-jdk7-installer

4. 更新安装protoc至2.5.0版

wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz  ==&gt; 下載 
tar zxvf protobuf-2.5.0.tar.gz  ==&gt; 解壓縮
sudo ./configure --prefix=/usr   ==&gt; 設定安裝
(若安装报错： cpp: error trying to exec 'cc1plus': execvp: No such file or directory 则安装g++  =&gt; sudo apt-get install g++ )
sudo make         ==&gt; 執行make 編譯
sudo make check   ==&gt; 執行make 檢查
sudo make install ==&gt; 執行make 安裝

(info:Libraries have been installed in: /usr/local/lib)

wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz ==> 下載

tar zxvf protobuf-2.5.0.tar.gz ==> 解壓縮

sudo ./configure --prefix=/usr ==> 設定安裝

(若安装报错： cpp: error trying to exec 'cc1plus': execvp: No such file or directory 则安装g++ => sudo apt-get install g++ )

sudo make ==> 執行make 編譯

sudo make check ==> 執行make 檢查

sudo make install ==> 執行make 安裝

(info:Libraries have been installed in: /usr/local/lib)

5. 編譯 Build it. (參考)
Change directory to top level directory of extracted source where you will find pom.xml, which is build script in case of maven.

# mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests

1	# mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests

在〈玩了一下安裝實作大數據Big Data的一些工具軟體: Crawlzilla 與 Hadoop , ZooKeeper , Pig〉中有 2 則留言

陳政翰 2015 年 01 月 23 日回覆

不好意思，我最近剛好也在裝Crawzilla但遇到了一些問題想要向您請教
（1）請問在裝Crawlzilla之前您有安裝nutch和tomcat嗎？
因為我在安裝時並沒有啟動tomcat的服務，所以在猜想是不是要另外先安裝tomcat和nutch。

（2）安裝Crawlzilla的過程中並沒有讓我輸入unix password那一段，更沒有進入到之後hadoop的部分，因此在猜想是否hadoop路徑與conf/nutch_conf/hadoop-env.sh中的不同。想請問您有更改conf/nutch_conf/hadoop-env.sh中的路徑嗎?

謝謝您
易春木 文章作者2015 年 01 月 23 日回覆

1) Crawlzilla 我是直接裝耶, 要先確定連上網路喔! 我沒有先裝 nutch和tomcat
2) 沒有改路徑

HBase：

Pig：

Hive：

Cascading/Scalding：

Zookeeper：

Oozie：

Azkaban:

Tez：

Crawlzilla 安裝記錄

Hadoop 單節點安裝記錄

安裝設定 ZooKeeper

安裝HBase (單機安裝且可以與zookeeper互動)

安裝Pig

下載Hadoop 程式碼(source code) 並編譯 (compile)

在〈玩了一下安裝實作大數據Big Data的一些工具軟體: Crawlzilla 與 Hadoop , ZooKeeper , Pig〉中有 2 則留言

發表迴響取消回覆