大數據本身是個很寬泛的概念,Hadoop生態圈(或者泛生態圈)基本上都是為了處理超過單機尺度的數據處理而誕生的。你可以把它比作一個廚房所以需要的各種工具。鍋碗瓢盆,各有各的用處,互相之間又有重合。你可以用湯鍋直接當碗吃飯喝湯,你可以用小刀或者刨子去皮。但是每個工具有自己的特性,雖然奇怪的組合也能工作,但是未必是最佳選擇。大數據,首先你要能存的下大數據。傳統的文件系統是單機的,不能橫跨不同的機器。
HDFS(Hadoop Distributed FileSystem)的設計本質上是為了大量的數據能橫跨成百上千台機器,但是你看到的是一個文件系統而不是很多文件系統。比如你說我要獲取/hdfs/tmp/file1的數據,你引用的是一個文件路徑,但是實際的數據存放在很多不同的機器上。你作為用戶,不需要知道這些,就好比在單機上你不關心文件分散在什麼磁道什麼扇區一樣。
HDFS為你管理這些數據。存的下數據之後,你就開始考慮怎麼處理數據。雖然HDFS可以為你整體管理不同機器上的數據,但是這些數據太大了。一台機器讀取成T上P的數據(很大的數據哦,比如整個東京熱有史以來所有高清電影的大小甚至更大),一台機器慢慢跑也許需要好幾天甚至好幾週。
對於很多公司來說,單機處理是不可忍受的,比如微博要更新24小時熱博,它必須在24小時之內跑完這些處理。那麼我如果要用很多台機器處理,我就面臨瞭如何分配工作,如果一台機器掛瞭如何重新啟動相應的任務,機器之間如何互相通信交換數據以完成複雜的計算等等。這就是MapReduce / Tez / Spark的功能。
MapReduce是第一代計算引擎,Tez和Spark是第二代。 MapReduce的設計,採用了很簡化的計算模型,只有Map和Reduce兩個計算過程(中間用Shuffle串聯),用這個模型,已經可以處理大數據領域很大一部分問題了。那什麼是Map什麼是Reduce?考慮如果你要統計一個巨大的文本文件存儲在類似HDFS上,你想要知道這個文本里各個詞的出現頻率。你啟動了一個MapReduce程序。
Map階段,幾百台機器同時讀取這個文件的各個部分,分別把各自讀到的部分分別統計出詞頻,產生類似(hello, 12100次),(world,15214次)等等這樣的Pair(我這裡把Map和Combine放在一起說以便簡化);這幾百台機器各自都產生瞭如上的集合,然後又有幾百台機器啟動Reduce處理。
Reducer機器A將從Mapper機器收到所有以A開頭的統計結果,機器B將收到B開頭的詞彙統計結果(當然實際上不會真的以字母開頭做依據,而是用函數產生Hash值以避免數據串化。因為類似X開頭的詞肯定比其他要少得多,而你不希望數據處理各個機器的工作量相差懸殊)。然後這些Reducer將再次匯總,(hello,12100)+(hello,12311)+(hello,345881)= (hello,370292)。
HBase:
是一個高可靠性、高性能、面向列、可伸縮的分佈式存儲系統,利用HBase技術可在廉價PC Server上搭建起大規模結構化數據集群。像Facebook,都拿它做大型實時應用 Facebook’s New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
Pig:
Yahoo開發的,並行地執行數據流處理的引擎,它包含了一種腳本語言,稱為Pig Latin,用來描述這些數據流。 Pig Latin本身提供了許多傳統的數據操作,同時允許用戶自己開發一些自定義函數用來讀取、處理和寫數據。在LinkedIn也是大量使用。
Hive:
Facebook領導的一個數據倉庫工具,可以將結構化的數據文件映射為一張數據庫表,並提供完整的sql查詢功能,可以將sql語句轉換為MapReduce任務進行運行。其優點是學習成本低,可以通過類SQL語句快速實現簡單的MapReduce統計。像一些data scientist 就可以直接查詢,不需要學習其他編程接口。
Cascading/Scalding:
Cascading是Twitter收購的一個公司技術,主要是提供數據管道的一些抽象接口,然後又推出了基於Cascading的Scala版本就叫Scalding。 Coursera是用Scalding作為MapReduce的編程接口放在Amazon的EMR運行。
Zookeeper:
一個分佈式的,開放源碼的分佈式應用程序協調服務,是Google的Chubby一個開源的實現。
Oozie:
一個基於工作流引擎的開源框架。由Cloudera公司貢獻給Apache的,它能夠提供對Hadoop MapReduce和Pig Jobs的任務調度與協調。
Azkaban:
跟上面很像,Linkedin開源的面向Hadoop的開源工作流系統,提供了類似於cron 的管理任務。
Tez:
Hortonworks主推的優化MapReduce執行引擎,與MapReduce相比較,Tez在性能方面更加出色。
Crawlzilla 安裝記錄
Crawlzilla 是一套簡易的搜尋引擎, 很好上手 因為幾乎是全自動安裝
介紹網站說明得很清楚 : https://code.google.com/p/crawlzilla/
簡易安裝PDF檔: http://crawlzilla.googlecode.com/svn-history/r334/trunk/docs/crawlzilla_Usage_zhtw.pdf
——–
於是我就開始玩弄Crawlzilla 了
但是我在安裝的時候, 發現我無法全自動安裝成功
原因不難發現…
我的JAVA的目錄位置與Crawlzilla安裝程式碼裡面的不同
由於Crawlzilla 是Hard Code寫死在 conf/ntuch_conf/hadoop-env.sh 約第9行
1 2 |
#export JAVA_HOME=/usr/lib/jvm/java-6-sun <==這是原本寫法 export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 <==我將它改成我的JAVA JDK路徑 |
同時 hadoop-env.sh 也定義了許多會固定路徑:
1 2 3 4 5 6 |
export NUTCH_HOME=/opt/crawlzilla/nutch export HADOOP_HOME=/opt/crawlzilla/nutch export NUTCH_CONF_DIR=/opt/crawlzilla/nutch/conf export HADOOP_CONF_DIR=/opt/crawlzilla/nutch/conf export NUTCH_LOG_DIR=/var/log/crawlzilla/hadoop-logs export HADOOP_LOG_DIR=/var/log/crawlzilla/hadoop-logs |
改完後就可以安裝囉….
對了要先確認你的主機可以連上internel了~
以下為安裝過程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
jamie@jamie-pc:~/share/hadoop/Crawlzilla_Install$ <strong>sudo ./install </strong> <==記得加 sudo 才有夠大權限作事 sudo: unable to resolve host jamie-pc Identify is root Your system information is: Ubuntu , 12.04 It will install some packages (expect, ssh, and dialog). Ign http://tw.archive.ubuntu.com precise InRelease Ign http://tw.archive.ubuntu.com precise-updates InRelease Ign http://tw.archive.ubuntu.com precise-backports InRelease Hit http://tw.archive.ubuntu.com precise Release.gpg Hit http://tw.archive.ubuntu.com precise-updates Release.gpg Hit http://tw.archive.ubuntu.com precise-backports Release.gpg Hit http://tw.archive.ubuntu.com precise Release Ign http://ppa.launchpad.net precise InRelease Ign http://ppa.launchpad.net precise InRelease Hit http://tw.archive.ubuntu.com precise-updates Release Hit http://tw.archive.ubuntu.com precise-backports Release Ign http://security.ubuntu.com precise-security InRelease Hit http://tw.archive.ubuntu.com precise/main Sources Ign http://extras.ubuntu.com precise InRelease Hit http://tw.archive.ubuntu.com precise/restricted Sources Hit http://tw.archive.ubuntu.com precise/universe Sources Hit http://tw.archive.ubuntu.com precise/multiverse Sources Hit http://tw.archive.ubuntu.com precise/main amd64 Packages Hit http://tw.archive.ubuntu.com precise/restricted amd64 Packages Hit http://tw.archive.ubuntu.com precise/universe amd64 Packages Hit http://tw.archive.ubuntu.com precise/multiverse amd64 Packages Hit http://tw.archive.ubuntu.com precise/main i386 Packages Hit http://tw.archive.ubuntu.com precise/restricted i386 Packages Hit http://tw.archive.ubuntu.com precise/universe i386 Packages Hit http://tw.archive.ubuntu.com precise/multiverse i386 Packages Hit http://tw.archive.ubuntu.com precise/main TranslationIndex Hit http://tw.archive.ubuntu.com precise/multiverse TranslationIndex Hit http://tw.archive.ubuntu.com precise/restricted TranslationIndex Hit http://tw.archive.ubuntu.com precise/universe TranslationIndex Hit http://security.ubuntu.com precise-security Release.gpg Hit http://tw.archive.ubuntu.com precise-updates/main Sources Hit http://ppa.launchpad.net precise Release.gpg Hit http://tw.archive.ubuntu.com precise-updates/restricted Sources Hit http://tw.archive.ubuntu.com precise-updates/universe Sources Hit http://tw.archive.ubuntu.com precise-updates/multiverse Sources Hit http://tw.archive.ubuntu.com precise-updates/main amd64 Packages Hit http://tw.archive.ubuntu.com precise-updates/restricted amd64 Packages Hit http://tw.archive.ubuntu.com precise-updates/universe amd64 Packages Hit http://tw.archive.ubuntu.com precise-updates/multiverse amd64 Packages Hit http://tw.archive.ubuntu.com precise-updates/main i386 Packages Hit http://tw.archive.ubuntu.com precise-updates/restricted i386 Packages Hit http://tw.archive.ubuntu.com precise-updates/universe i386 Packages Hit http://tw.archive.ubuntu.com precise-updates/multiverse i386 Packages Hit http://tw.archive.ubuntu.com precise-updates/main TranslationIndex Hit http://tw.archive.ubuntu.com precise-updates/multiverse TranslationIndex Hit http://tw.archive.ubuntu.com precise-updates/restricted TranslationIndex Hit http://tw.archive.ubuntu.com precise-updates/universe TranslationIndex Hit http://extras.ubuntu.com precise Release.gpg Hit http://tw.archive.ubuntu.com precise-backports/main Sources Hit http://tw.archive.ubuntu.com precise-backports/restricted Sources Hit http://tw.archive.ubuntu.com precise-backports/universe Sources Hit http://tw.archive.ubuntu.com precise-backports/multiverse Sources Hit http://tw.archive.ubuntu.com precise-backports/main amd64 Packages Hit http://tw.archive.ubuntu.com precise-backports/restricted amd64 Packages Hit http://tw.archive.ubuntu.com precise-backports/universe amd64 Packages Hit http://tw.archive.ubuntu.com precise-backports/multiverse amd64 Packages Hit http://tw.archive.ubuntu.com precise-backports/main i386 Packages Hit http://tw.archive.ubuntu.com precise-backports/restricted i386 Packages Hit http://tw.archive.ubuntu.com precise-backports/universe i386 Packages Hit http://tw.archive.ubuntu.com precise-backports/multiverse i386 Packages Hit http://tw.archive.ubuntu.com precise-backports/main TranslationIndex Hit http://tw.archive.ubuntu.com precise-backports/multiverse TranslationIndex Hit http://tw.archive.ubuntu.com precise-backports/restricted TranslationIndex Hit http://security.ubuntu.com precise-security Release Hit http://tw.archive.ubuntu.com precise-backports/universe TranslationIndex Hit http://tw.archive.ubuntu.com precise/main Translation-en Hit http://tw.archive.ubuntu.com precise/multiverse Translation-en Hit http://tw.archive.ubuntu.com precise/restricted Translation-en Hit http://tw.archive.ubuntu.com precise/universe Translation-en Hit http://tw.archive.ubuntu.com precise-updates/main Translation-en Hit http://tw.archive.ubuntu.com precise-updates/multiverse Translation-en Hit http://tw.archive.ubuntu.com precise-updates/restricted Translation-en Hit http://tw.archive.ubuntu.com precise-updates/universe Translation-en Hit http://tw.archive.ubuntu.com precise-backports/main Translation-en Hit http://ppa.launchpad.net precise Release.gpg Hit http://tw.archive.ubuntu.com precise-backports/multiverse Translation-en Hit http://tw.archive.ubuntu.com precise-backports/restricted Translation-en Hit http://tw.archive.ubuntu.com precise-backports/universe Translation-en Hit http://extras.ubuntu.com precise Release Hit http://security.ubuntu.com precise-security/main Sources Hit http://ppa.launchpad.net precise Release Hit http://security.ubuntu.com precise-security/restricted Sources Hit http://security.ubuntu.com precise-security/universe Sources Hit http://security.ubuntu.com precise-security/multiverse Sources Hit http://security.ubuntu.com precise-security/main amd64 Packages Hit http://security.ubuntu.com precise-security/restricted amd64 Packages Hit http://security.ubuntu.com precise-security/universe amd64 Packages Hit http://security.ubuntu.com precise-security/multiverse amd64 Packages Hit http://security.ubuntu.com precise-security/main i386 Packages Hit http://security.ubuntu.com precise-security/restricted i386 Packages Hit http://security.ubuntu.com precise-security/universe i386 Packages Hit http://extras.ubuntu.com precise/main Sources Hit http://ppa.launchpad.net precise Release Hit http://security.ubuntu.com precise-security/multiverse i386 Packages Hit http://extras.ubuntu.com precise/main amd64 Packages Hit http://extras.ubuntu.com precise/main i386 Packages Ign http://extras.ubuntu.com precise/main TranslationIndex Hit http://security.ubuntu.com precise-security/main TranslationIndex Hit http://security.ubuntu.com precise-security/multiverse TranslationIndex Hit http://security.ubuntu.com precise-security/restricted TranslationIndex Hit http://security.ubuntu.com precise-security/universe TranslationIndex Hit http://ppa.launchpad.net precise/main Sources Hit http://ppa.launchpad.net precise/main amd64 Packages Hit http://ppa.launchpad.net precise/main i386 Packages Hit http://ppa.launchpad.net precise/main TranslationIndex Hit http://security.ubuntu.com precise-security/main Translation-en Hit http://security.ubuntu.com precise-security/multiverse Translation-en Hit http://security.ubuntu.com precise-security/restricted Translation-en Hit http://ppa.launchpad.net precise/main Sources Hit http://ppa.launchpad.net precise/main amd64 Packages Hit http://ppa.launchpad.net precise/main i386 Packages Ign http://ppa.launchpad.net precise/main TranslationIndex Hit http://security.ubuntu.com precise-security/universe Translation-en Hit http://ppa.launchpad.net precise/main Translation-en Ign http://extras.ubuntu.com precise/main Translation-en_US Ign http://extras.ubuntu.com precise/main Translation-en Ign http://ppa.launchpad.net precise/main Translation-en_US Ign http://ppa.launchpad.net precise/main Translation-en Reading package lists... Done Reading package lists... Done Building dependency tree Reading state information... Done expect is already the newest version. dialog is already the newest version. ssh is already the newest version. 0 upgraded, 0 newly installed, 0 to remove and 605 not upgraded. check_sunJava <=== 開始檢查是否安裝所需的軟體套件 Crawlzilla need Sun Java JDK 1.6.x or above version System has Sun Java 1.6 above version. System has ssh. System has ssh Server (sshd). System has dialog. Welcome to use Crawlzilla, this install program will create a new accunt and to assist you to setup the password of crawler. <== 要自動幫你建立心使用者 crawler Set password for crawler: password: <== 這裡請輸入想要使用的密碼 keyin the password again: password: <== 老規矩, 密碼要再輸入一次已確定沒打錯 Master IP address is: 10.57.54.168 Master MAC address is: 00:24:be:7a:98:18 Please confirm the install infomation of above :1.Yes 2.No <== 提供 IP 與 MAC Address 資訊, 沒問題才按1:YES 1 spawn passwd crawler Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully <=成功設定密碼 Generating public/private rsa key pair. <=開始產生ssh key Created directory '/home/crawler/.ssh'. Your identification has been saved in /home/crawler/.ssh/id_rsa. Your public key has been saved in /home/crawler/.ssh/id_rsa.pub. The key fingerprint is: 4a:04:5d:34:22:87:0a:e0:f2:1a:25:b3:1c:ab:4a:f9 crawler@jamie-pc The key's randomart image is: +--[ RSA 2048]----+ |o oooo+ | |o .+.. . | |++.. . | |o*+ . | |oo. . S | |.o. . . | |oo . | |o . | |. E | +-----------------+ Could not open a connection to your authentication agent. <= 好像無傷大雅 unpack success! Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. Make the client installation package Formatting HDFS... <== 開始建立HDFS 15/01/15 14:16:34 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = java.net.UnknownHostException: jamie-pc: jamie-pc STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.19.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009 ************************************************************/ Re-format filesystem in /var/lib/crawlzilla/nutch-crawler/dfs/name ? (Y or N) y <==想要重新format嗎? 我選yes Format aborted in /var/lib/crawlzilla/nutch-crawler/dfs/name 15/01/15 14:18:25 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: jamie-pc: jamie-pc ************************************************************/ start up name node [Namenode] ... starting namenode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-namenode-jamie-pc.out <==啟動 name node start up job node [JobTracker] ... starting jobtracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-jobtracker-jamie-pc.out <==啟動 job tracker starting datanode, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-datanode-jamie-pc.out <== 啟動 data node starting tasktracker, logging to /var/log/crawlzilla/hadoop-logs/hadoop-crawler-tasktracker-jamie-pc.out <=啟動 task tracker Start up tomcat... ..... Using CATALINA_BASE: /opt/crawlzilla/tomcat Using CATALINA_HOME: /opt/crawlzilla/tomcat Using CATALINA_TMPDIR: /opt/crawlzilla/tomcat/temp Using JRE_HOME: /usr Tomcat has been started! Installed successfully! You can visit the manage website :http://10.57.54.168:8080 <==主機IP位址 + port 8080 For client install, please refer commands as follows: <= 順便告知我們, 如何安裝client scp crawler@10.57.54.168:/home/crawler/crawlzilla/source/client_deploy.sh . ./client_deploy.sh Finish!!! ==> 打完收工 |
附錄:HDFS Web Interface
HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations. By default this is exposed on port 50070 on the NameNode. Accessing http://namenode:50070/ with a web browser will return a page containing overview information about the health, capacity, and usage of the cluster (similar to the information returned by bin/hadoop dfsadmin -report).
The address and port where the web interface listens can be changed by setting dfs.http.address in conf/hadoop-site.xml. It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0.
From this interface, you can browse HDFS itself with a basic file-browser interface. Each DataNode exposes its file browser interface on port 50075. You can override this by setting the dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075. Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.
Hadoop 單節點安裝記錄
參考了一些部落格的安裝過程, Hadoop 2.6.0单节点安装参考, hadoop 2.6.0单节点-伪分布式模式安装, 以及 Mac OSX 下 Hadoop 单节点集群配置 . Hadoop快速入门
這網站還有2014競賽的內容耶
1.我的環境:
OS: Ubuntu 12.04.1 LTS
Hadoop: 2.6.0
Java: jdk1.6.0_45
2.下載Hadoop
到這 apache的網站內hadoop部分, 去找出最新版下載吧, 我是使用2.6.0版本
3. 安裝Hadoop後, 直接進入hadoop-2.6.0目錄底下
編輯 etc/hadoop/hadoop-env.sh
1 2 3 |
# The java implementation to use. #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 |
但是為了確保以後每次登入linux主機, 都不用再重新設定 JAVA_HOME (參考)
請執行並修改你的 ~/.bashrc 檔案
(其實也可以寫到 /etc/profile 檔啦, 因為全體使用者環境變數設定檔位於 /etc/profile ,但只有 root 可以修改, 使用者登入後使用 BASH 的同時,第一時間會來執行 /etc/profile 這個檔案,,而才是個人的「~/.bashrc」( ~/.bash_profile, ~/.bash_login , ~/.profile 細節請 man bash)。 系統管理者可透過撰寫 /etc/profile 來提供使用者一個初始化的使用者環境設定, 如果 /etc/profile 與 .bashrc 兩者都寫的話, 會以最後設定得為主, 因為後者會蓋過前者!)
1 |
vi ~/.bash_profile |
在最末行下加入:
1 2 3 |
# set JAVA_HOME for hadoop. export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 <== 我的JDK位置 export PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin <== 順便加入 jdk的bin檔到PATH |
4.測試一下, 如果有輸出help指示就OK
1 |
bin/hadoop |
5. 編輯 etc/hadoop/core-site.xml:
1 2 3 4 5 6 |
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
6. 編輯etc/hadoop/hdfs-site.xml:
1 2 3 4 5 6 |
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> |
7.先配置成ssh若使用本機登入則免密碼
1 2 |
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys |
8.測試是否真的可以免密碼登入
1 2 |
ssh localhost exit |
9. 複製一份mapred-site.xml
1 |
cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml |
10. 再编辑 etc/hadoop/mapred-site.xml:
1 2 3 4 5 6 |
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
11. 编辑etc/hadoop/yarn-site.xml:
1 2 3 4 5 6 7 |
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> |
12. 確認本機是否, 避免java.lang.RuntimeException: java.net.UnknownHostException: myhostname: myhostname
編輯 /etc/hosts
1 |
127.0.0.1 myhostname |
12.格式化HDFS的文件系统
1 |
bin/hdfs namenode -format |
13. 启动 NameNode 和 DataNode 守护进程
如果想以网页方式查看NameNode: http://localhost:50070/ http://0.0.0.0:50070/
1 |
sbin/start-dfs.sh |
14. 启动 ResourceManager和NodeManager守护进程
如果想以网页方式查看ResourceManager: http://localhost:8088/
1 |
sbin/start-yarn.sh |
15. 创建文件夹 input 並 list出來
1 2 3 4 5 6 |
$ bin/hdfs dfs -mkdir -p /user/jamie/input $ bin/hdfs dfs -ls / Found 1 items drwxr-xr-x - jamie supergroup 0 2015-01-19 14:30 /user/jamie/input 補充: 刪除文件夾/xxx之指令 $ bin/hdfs dfs -rm -r /xxx |
16. 将要处理的文件复制到HDFS文件夹中 (參考)
1 2 3 4 |
$ bin/hdfs dfs -put etc/hadoop /user/jamie/input $ bin/hdfs dfs -ls /user/jamie/input Found 1 items drwxr-xr-x - jamie supergroup 0 2015-01-19 14:38 /user/jamie/input/hadoop |
17. 执行最经典的 wordcount 也算是hadoop中的hello word 了 (參考)
1 2 3 4 |
# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output 或執行以下指令 (A map/reduce program that estimates Pi using a quasi-Monte Carlo method.) # bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 2 2 |
18. 运行一个MapReduce 任务
1 |
# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output2 'dfs[a-z.]+' |
19. 執行 jps 看 processs
(如果找不到jps指令, 就把指令的路徑加入PATH吧 如: PATH=$PATH:/usr/lib/jvm/jdk1.6.0_45/bin/)
如果上面操作一切正确的话,通过”jps”命令查看是否包含ResourceManager、NodeManager、NameNode、SecondaryNameNode、DataNode等5个Java进程,参考如下:
1 2 3 4 5 6 7 8 9 |
$ jps 6539 NameNode 9741 NodeManager 7053 SecondaryNameNode 5652 Launcher 7778 Jps 9071 DataNode 9509 ResourceManager 補充: 以上為 hostid / hostname |
20. 在分布式文件系统上查看输出文件:
1 |
$ bin/hadoop fs -cat output/* |
21. 完成全部操作后,停止守护进程:
1 |
$ bin/stop-all.sh |
安裝設定 ZooKeeper
Apache ZooKeeper 是一個致力於開發與維護的開源伺服器( open-source server), 它能夠實現高度可靠的分佈式協調。
Zookeeper 作為Hadoop 項目中的一個子項目,是Hadoop 集群管理的一個必不可少的模塊,它主要用來控制集群中的數據,如它管理Hadoop 集群中的NameNode,還有Hbase 中Master Election、Server 之間狀態同步等。本文介紹的Zookeeper 的基本知識,以及介紹了幾個典型的應用場景。這些都是Zookeeper 的基本功能,最重要的是Zoopkeeper 提供了一套很好的分佈式集群管理的機制,就是它這種基於層次型的目錄樹的數據結構,並對樹中的節點進行有效管理,從而可以設計出多種多樣的分佈式的數據管理模型,而不僅僅局限於上面提到的幾個常用應用場景。(出處:分佈式服務框架Zookeeper — 管理分佈式環境中的數據)
1. 程式碼下載
老樣子, 找官方網站下載Apache ZooKeeper吧!
2. 解壓縮, 處理conf檔, 啟動
1 2 3 4 |
tar zxvf zookeeper-3.4.6.tar.gz cp conf cp zoo_sample.cfg zoo.cfg bin/zkServer.sh start |
列出zoo.cfg供參考
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. # do not use /tmp for storage, /tmp here is just # example sakes. dataDir=/tmp/zookeeper # the port at which the clients will connect clientPort=2181 # the maximum number of client connections. # increase this if you need to handle more clients #maxClientCnxns=60 # # Be sure to read the maintenance section of the # administrator guide before turning on autopurge. # # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance # # The number of snapshots to retain in dataDir #autopurge.snapRetainCount=3 # Purge task interval in hours # Set to "0" to disable auto purge feature #autopurge.purgeInterval=1 |
3. 驗證方式1: 啟動後輸入指令jps 可以正常看到 QuorumPeerMain 的 Java Process
1 2 3 4 5 6 7 8 |
$ jps 32296 Jps 6539 NameNode 9741 NodeManager 32061 QuorumPeerMain 7053 SecondaryNameNode 9071 DataNode 9509 ResourceManager |
4. 驗證方式2: 可以快速確認是否 ZooKeeper is running (參考)
登入 ZooKeeper 主機並輸入指令: (請先確定Client Port為2181)
確認你有沒有接收到 imok 回應, 如果沒收到就表示 ZooKeeper is not running.
1 |
echo ruok | nc 127.0.0.1 2181 |
獲得更多資訊關於 Zookeeper
1 |
echo status | nc 127.0.0.1 2181 |
5.用Client端的程式去測試:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
$ <strong>./zkCli.sh -server 127.0.0.1:2181</strong> Connecting to 127.0.0.1:2181 2015-01-26 11:42:52,353 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT 2015-01-26 11:42:52,357 [myid:] - INFO [main:Environment@100] - Client environment:host.name=jamie-pc 2015-01-26 11:42:52,357 [myid:] - INFO [main:Environment@100] - Client environment:java.version=1.6.0_45 2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.vendor=Sun Microsystems Inc. 2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.home=/usr/lib/jvm/jdk1.6.0_45/jre 2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.class.path=/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/classes:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../build/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/home/jamie/share/hadoop/zookeeper-3.4.6/bin/../conf: 2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.library.path=/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server:/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64:/usr/lib/jvm/jdk1.6.0_45/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2015-01-26 11:42:52,359 [myid:] - INFO [main:Environment@100] - Client environment:java.io.tmpdir=/tmp 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:java.compiler= 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:os.name=Linux 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:os.arch=amd64 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:os.version=3.2.0-29-generic 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:user.name=jamie 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:user.home=/home/jamie 2015-01-26 11:42:52,360 [myid:] - INFO [main:Environment@100] - Client environment:user.dir=/home/jamie/share/hadoop/zookeeper-3.4.6/bin 2015-01-26 11:42:52,362 [myid:] - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@6526804e Welcome to ZooKeeper! 2015-01-26 11:42:52,390 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Unable to locate a login configuration) JLine support is enabled 2015-01-26 11:42:52,395 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@852] - Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session [zk: 127.0.0.1:2181(CONNECTING) 0] 2015-01-26 11:42:52,509 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x14b2455a60c0000, negotiated timeout = 30000 WATCHER:: WatchedEvent state:SyncConnected type:None path:null [zk: 127.0.0.1:2181(CONNECTED) 0] [zk: 127.0.0.1:2181(CONNECTED) 0] <strong>create /test01 abcd</strong> --建立測試 Created /test01 [zk: 127.0.0.1:2181(CONNECTED) 2] <strong>ls /</strong> [test01, zookeeper] ----->這是Master Node的資料,表示成功。 [zk: 127.0.0.1:2181(CONNECTED) 3] <strong>delete /test01</strong> --刪除 [zk: 127.0.0.1:2181(CONNECTED) 4] <strong>ls /</strong> [zookeeper] --不見了 |
6. 停止服務
1 |
bin/zkServer.sh stop |
安裝HBase (單機安裝且可以與zookeeper互動)
就是Hadoop的資料庫 (安裝參考)
另外要注意Hadoop與HBase版本之間的support與否, 可以看這個地方, 裡面也有提到JDK版本與HBase支援關係! 這個很重要~~
1. 下載
1 |
到 http://www.apache.org/dyn/closer.cgi/hbase/ 找下載點 |
2. 解壓縮 與 進入目錄
1 2 |
tar zxvf hbase-0.98.9-hadoop1-bin.tar.gz cd hbase-0.98.9-hadoop1 |
3. 編輯xml檔的設定
現在你已經可以啟動Hbase了。但是你可能需要先編輯 conf/hbase-site.xml 去配置hbase.rootdir,來選擇Hbase將數據寫到哪個目錄
單機配置,只需要如下配置 hbase-site.xml:
1 2 3 4 5 6 7 8 |
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value>file:///DIRECTORY/hbase</value> </property> </configuration> |
將DIRECTORY 替換成你期望寫文件的目錄. 默認hbase.rootdir 是指向/tmp/hbase-${user.name} ,也就說你會在重啟後丟失數據(重啟的時候操作系統會清理/tmp目錄)
我則改成 /home/jamie/share/hadoop/tmp/hbase : (參考出處)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <!--<value>hdfs://localhost:8020/hbase</value>--> <value><strong>/tmp/hbase</strong></value> <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper, 所以把hbase目錄一起放在/tmp 吧! (填寫後執行會自動建立) <description> The directory shared by RegionServers. </description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value><strong>/tmp/zookeeper</strong></value> <=== 這個是參考zookeeper 的zoo.cfg 裡面 dataDir=/tmp/zookeeper (要填寫一致才能互動連線) <description> Property from ZooKeeper config zoo.cfg. The directory where the snapshot is stored. </description> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value><strong>2182</strong></value> <=== 這個是參考zookeeper 的zoo.cfg 裡面 clientPort=2181 <description>Property from ZooKeeper's config zoo.cfg. The port at which the clients will connect. </description> </property> </configuration> |
4. 編輯 .bashrc 設定 Hbase PATH 環境變數 :
其實也可以寫到 /etc/profile 檔啦, 因為全體使用者環境變數設定檔位於 /etc/profile ,但只有 root 可以修改, 使用者登入後使用 BASH 的同時,第一時間會來執行 /etc/profile 這個檔案,,而才是個人的「~/.bashrc」( ~/.bash_profile, ~/.bash_login , ~/.profile 細節請 man bash)。 系統管理者可透過撰寫 /etc/profile 來提供使用者一個初始化的使用者環境設定, 如果 /etc/profile 與 .bashrc 兩者都寫的話, 會以最後設定得為主, 因為後者會蓋過前者!
1 2 3 4 5 6 7 |
# set JAVA_HOME for hadoop. ==> 文章前面就有設定了, 只是再小改一下 export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_45 export PATH=$PATH:$JAVA_HOME/bin # set Hbase PATH Environmental variable ==> 設定!! export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1 export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin |
改玩重新執行一下 bashrc:
1 |
$source ~/.bashrc |
5. 啟動 HBase
1 2 |
hbase-0.98.9-hadoop1$ bin/start-hbase.sh starting master, logging to /home/jamie/share/hadoop/hbase-0.98.9-hadoop1/bin/../logs/hbase-jamie-master-jamie-pc.out |
現在你運行的是單機模式的Hbaes。所有的服務都運行在一個JVM上,包括HBase和Zookeeper。 HBase的日誌放在logs目錄,當你啟動出問題的時候,可以檢查這個日誌。
PS: 是否安裝了java ?
你需要確認安裝了Oracle的1.6 版本的java.如果你在命令行鍵入java有反應說明你安裝了Java。如果沒有裝,你需要先安裝,然後編輯 conf/hbase-env.sh,將其中的JAVA_HOME指向到你Java的安裝目錄。 (已在.bashrc檔設定)
6. 進入Hbase Shell模式:
1 2 3 4 5 6 |
$ <strong>./bin/hbase shell</strong> HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.98.9-hadoop1, r96878ece501b0643e879254645d7f3a40eaf101f, Mon Dec 15 22:36:48 PST 2014 hbase(main):001:0> <==== 等待使用者輸入指令 |
操作幾個小指令: (參考出處)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
hbase(main):006:0> <strong>create 'test', 'cf'</strong> <==創建名為 test 的 table 0 row(s) in 0.5290 seconds => Hbase::Table - test hbase(main):007:0> <strong>list 'table'</strong> <== 列出table TABLE 0 row(s) in 0.0110 seconds => [] hbase(main):008:0> <strong>list</strong> <=就是列出 TABLE test 1 row(s) in 0.0120 seconds => ["test"] hbase(main):009:0> <strong>list 'table'</strong> <== 列出table TABLE 0 row(s) in 0.0050 seconds => [] hbase(main):010:0> <strong>put 'test', 'row1', 'cf:a', 'value1'</strong> <== 放進test表格的列row1行cf:a, 數值為value1 0 row(s) in 0.1250 seconds hbase(main):011:0> <strong>put 'test', 'row2', 'cf:b', 'value2'</strong> 0 row(s) in 0.0070 seconds hbase(main):012:0><strong> put 'test', 'row3', 'cf:c', 'value3'</strong> 0 row(s) in 0.0100 seconds hbase(main):013:0> <strong>scan 'test'</strong> <==掃描 ROW COLUMN+CELL row1 column=cf:a, timestamp=1422256698208, value=value1 row2 column=cf:b, timestamp=1422256703257, value=value2 row3 column=cf:c, timestamp=1422256709483, value=value3 3 row(s) in 0.0440 seconds hbase(main):014:0> <strong>get 'test', 'row1'</strong> <== 得到值 COLUMN CELL cf:a timestamp=1422256698208, value=value1 1 row(s) in 0.0220 seconds hbase(main):015:0> <strong>disable 'test'</strong> <==先diable 0 row(s) in 1.5080 seconds hbase(main):016:0> drop 'test' <== 再drop掉 0 row(s) in 0.2270 seconds hbase(main):070:0> <strong>exit</strong> <== 離開此shell (或按 ctrl + C 也可離開) |
安裝Pig
1. 下載吧, 找到apache pig網頁
2. 解壓縮
1 |
$ tar zxvf pig-0.14.0.tar.gz |
3. 再度編輯 .bashrc
1 2 3 |
export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1 export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0 export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf |
4. 重新啟動界面後, 試試pig -help指令
如果有出現help說明就OK
1 |
pig -help |
據說pig有兩種模試: Local & MapReduce
5-1. 試試Local模式
1 2 3 4 5 6 7 |
$ pig -x local 2015-01-29 09:34:18,676 INFO [main] pig.ExecTypeProvider: Trying ExecType : LOCAL 2015-01-29 09:34:18,676 INFO [main] pig.ExecTypeProvider: Picked LOCAL as the ExecType 2015-01-29 09:34:18,724 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:01:24 2015-01-29 09:34:18,724 [main] INFO org.apache.pig.Main - Logging error messages to: /home/jamie/share/hadoop/pig-0.14.0/pig_1422495258723.log 2015-01-29 09:34:18,760 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found 2015-01-29 09:34:18,853 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// |
5-2. 試試MapReduce模式
需要測定HADOOP_HOME的環境變數!
再度編輯 .bashrc
1 2 3 4 |
export HADOOP_HOME=/home/jamie/share/hadoop/hadoop-2.6.0 export HBASE_HOME=/home/jamie/share/hadoop/hbase-0.98.9-hadoop1 export PIG_HOME=/home/jamie/share/hadoop/pig-0.14.0 export PATH=$PATH:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$PIG_HOME/bin:PIG_HOME/conf |
即可開啟mapReduce模式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
$ pig -x mapreducepig -x mapreduce 15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/01/29 11:22:26 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/01/29 11:22:26 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-01-29 11:22:26,925 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:02:05 2015-01-29 11:22:26,925 [main] INFO org.apache.pig.Main - Logging error messages to: /home/jamie/pig_1422501746924.log 2015-01-29 11:22:26,941 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/jamie/.pigbootup not found 2015-01-29 11:22:27,546 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2015-01-29 11:22:27,546 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-01-29 11:22:27,546 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/jamie/share/hadoop/hbase-0.98.9-hadoop1/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2015-01-29 11:22:28,442 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2015-01-29 11:22:28,443 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: lsn-linux:9001 2015-01-29 11:22:28,443 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS |
下載Hadoop 程式碼(source code) 並編譯 (compile)
1. 程式碼下載
這裡有管理hadoop source code的github, “Download ZIP”可以直接下載程式碼
2. 安裝工具(@Ubuntu): 參考How to Contribute to Hadoop Common
1 |
apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev |
3. 安裝JDK7
Installation of Oracle Java JDK 7 (which includes JRE, the Java browser plugin and JavaFX) to Ubuntu
1 2 3 |
#sudo add-apt-repository ppa:webupd8team/java #sudo apt-get update #sudo apt-get install oracle-jdk7-installer |
4. 更新安装protoc至2.5.0版
1 2 3 4 5 6 7 8 9 |
wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz ==> 下載 tar zxvf protobuf-2.5.0.tar.gz ==> 解壓縮 sudo ./configure --prefix=/usr ==> 設定安裝 (若安装报错: cpp: error trying to exec 'cc1plus': execvp: No such file or directory 则安装g++ => sudo apt-get install g++ ) sudo make ==> 執行make 編譯 sudo make check ==> 執行make 檢查 sudo make install ==> 執行make 安裝 (info:Libraries have been installed in: /usr/local/lib) |
5. 編譯 Build it. (參考)
Change directory to top level directory of extracted source where you will find pom.xml, which is build script in case of maven.
1 |
# mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests |
不好意思,我最近剛好也在裝Crawzilla但遇到了一些問題想要向您請教
(1)請問在裝Crawlzilla之前您有安裝nutch和tomcat嗎?
因為我在安裝時並沒有啟動tomcat的服務,所以在猜想是不是要另外先安裝tomcat和nutch。
(2)安裝Crawlzilla的過程中並沒有讓我輸入unix password那一段,更沒有進入到之後hadoop的部分,因此在猜想是否hadoop路徑與conf/nutch_conf/hadoop-env.sh中的不同。想請問您有更改conf/nutch_conf/hadoop-env.sh中的路徑嗎?
謝謝您
1) Crawlzilla 我是直接裝耶, 要先確定連上網路喔! 我沒有先裝 nutch和tomcat
2) 沒有改路徑