转自http://prinx.blog.163.com/blog/static/190115275201211128513868/和http://www.cnblogs.com/jie465831735/archive/2013/03/06.html
按如下顺序看效果最佳:
1. MapReduce Simplied Data Processing on Large Clusters
2. Hadoop环境的安装 By 徐伟
3. Parallel K-Means Clustering Based on MapReduce
4. 《Hadoop权威指南》的第一章和第二章
5. 迭代式MapReduce框架介绍 董的博客
6. HaLoop: Efficient Iterative Data Processing on Large Clusters
7. Twister: A Runtime for Iterative MapReduce
8. 迭代式MapReduce解决方案(一)
9. 迭代式MapReduce解决方案(二)
10. 迭代式MapReduce解决方案(三)
11. Granules: A Lightweight, Streaming Runtime for Cloud Computing With Support for Map-Reduce
12. On the Performance of Distributed Data Clustering Algorithms in File and Streaming Processing Systems
13. Spark: Cluster Computing with Working Set
14. iMapReduce: A Distributed Computing Framework for Iterative Computation
15. 《Hadoop权威指南》的第三章到第十章
16. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
17. Clustering Very Large Multi-dimensional Datasets with MapReduce
18. HBase环境的安装 By 徐伟 + HBase 测试程序
Ps:简单讲解一下上面的流程,MapReduce计算模型就是Google在(1)中提出来的,一定要仔细看这篇论文,我当初因为看的不够仔细走了很多的弯路。Hadoop是一个开源的MapReduce计算模型实现,按照(2)来安装,以及跑一遍Word Count程序,基本上就算是入门了。(3)这篇文章价值不大,但是可以通过其看一下K-Means算法是如何MapReduce化的,以后就可以举一反三了。(4)的作用就是加深对(1-3)的理解。从(5)开始就可以进入迭代MapReduce的子领域了,董是这方面的大牛。(6)(7)是(5)中提到的两篇论文,(5-7)都要仔细的看,把迭代MapReduce的基础打牢。(8-10)也是董的文章,加深一下对迭代MapReduce问题的理解。(11)(12)是Jaliya Ekanayake、Shrideep Pallickara合作的文章,他们是国外迭代MapReduce领域的发文章最多的两个人。(13)是伯克利大学的迭代MapReduce的文章,Spark是所有实验室产品中唯一已经商用推广的,赞!(14)这篇文章,我看的不是很细致,但是Collector的灵感就是来源于这篇文章。这个时候估计你已经有自己的解决方案了,要编程实现自己的设计了,需要仔细的看(15)了。(16) Map-Reduce-Merge咱们实验室曾经做过的一个问题。(17)这篇文章+Canopy算法,可以得出一些关于用MapReduce实现高质量数据抽样的思路。(18)如果需要使用HBase,可以参考这篇文章。
转自http://cloud.dlmu.edu.cn/cloudsite/index.php?action-viewnews-itemid-123-php-1
[71] Bose JH, Andrzejak A, Hogqvist M. Beyond online aggregation: Parallel and incremental data mining with online Map-Reduce. In: Tanaka K, Zhou XF, Zhang M, Jatowt A, eds. Proc. of the Workshop on Massive Data Analytics on the Cloud (WWW 2010). Raleigh: ACM Press, 2010. 3.
[72] Kumar V, Andrade H, Gedik B, Wu KL. DEDUCE: At the intersection of MapReduce and stream processing. In: Manolescu I, Spaccapietra S, Teubner J, Kitsuregawa M, Léger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 657.662.
[73] Abramson D, Dinh MN, Kurniawan D, Moench B, DeRose L. Data centric highly parallel debugging. In: Hariri S, Keahey K, eds. Proc. of the HPDC. Chicago: ACM Press, 2010. 119.129.
[74] Morton K, Friesen A, Balazinska M, Grossman D. Estimating the progress of MapReduce pipelines. In: Li FF, Moro MM, Ghandeharizadeh S, et al., eds. Proc. of the ICDE. Long Beach: IEEE Press, 2010. 681.684.
[75] Morton K, Balazinska M, Grossman D. ParaTimer: A progress indicator for MapReduce DAGs. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 507.518.
[76] Lang W, Patel JM. Energy management for MapReduce clusters. PVLDB, 2010,3(1-2):129.139.
[77] Wieder A, Bhatotia P, Post A, Rodrigues R. Brief announcement: Modeling MapReduce for optimal execution in the cloud. In: Richa AW, Guerraoui R, eds. Proc. of the PODC. Zurich: ACM Press, 2010. 408.409.
[78] Zheng Q. Improving MapReduce fault tolerance in the cloud. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshops and Phd Forum (IPDPS 2010). Atlanta: IEEE Presss, 2010. 1.6.
[79] Groot S. Jumbo: Beyond MapReduce for workload balancing. In: Mylopoulos J, Zhou LZ, Zhou XF, eds. Proc. of the PhD Workshop (VLDB 2010). Singapore: VLDB Endowment, 2010. 7.12.
[80] Chatziantoniou D, Tzortzakakis E. ASSET queries: A declarative alternative to MapReduce. SIGMOD Record, 2009,38(2):35.41.
[81] Bu YY, Howe B, Balazinska M, Ernst MD. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 2010,3(1-2): 285−296.
[82] Wang HJ, Qin XP, Zhang YS, Wang S, Wang ZW. LinearDB: A relational approach to make data warehouse scale like MapReduce. In: Yu JX, Kim MH, Unland R, eds. Proc. of the DASFAA. Hong Kong: Springer-Verlag, 2011. 306−320
转自 http://blog.csdn.net/zhaomirong/article/details/7832215
1. nosqldbs-NOSQL Introduction and Overview
2. system and method for data distribution(2009)
3. System and method for large-scale data processing using an application-independent framework(2010)
4. MapReduce: Simplified Data Processing on Large Clusters;
5. MapReduce-- a flexible data processing tool(2010)
6. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
7. MapReduce and Parallel DBMSs--Friends or Foes(2010)
8. Presentation:MapReduce and Parallel DBMSs:Together at Last (2010)
9. Twister: A Runtime for Iterative MapReduce(2010)
10. MapReduce Online(2009)
11. Megastore: Providing Scalable, Highly Available Storage for Interactive Services (2011,CIDR)
12. Interpreting the Data:Parallel Analysis with Sawzall
13. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (technical report 2010)
14. Large-scale Incremental Processing Using Distributed Transactions and Notifications(2010)
15. Improving MapReduce Performance in Heterogeneous Environments
16. Dremel: Interactive Analysis of WebScale Datasets(2011)
17. Large-scale Incremental Processing Using Distributed Transactions and Notifications
18. Chukwa: a scalable cloud monitoring System (presentation)
19. The Chubby lock service for loosely-coupled distributed systems
20. Paxos Made Simple(2001,Lamport)
21. Fast Paxos(2006)
22. Paxos Made Live - An Engineering Perspective(2007)
23. Classic Paxos vs. Fast Paxos: Caveat Emptor
24. On the Coordinator’s Rule for Fast Paxos(2005)
25. Paxos made code:Implementing a high throughput Atomic Broadcast (2009)
26. Bigtable: A Distributed Storage System for Structured Data(2006)
27. The Google File System
Google patent papers
1. Data processing system and method for financial debt instruments(1999)
2. Data processing system and method to enforce payment of royalties when copying softcopy books(1996)
3. Data processing systems and methods(2005)
4. Large-scale data processing in a distributed and parallel processing environment(2010)
5. METHODS AND SYSTEMS FOR MANAGEMENT OF DATA()
6. SEARCH OVER STRUCTURED DATA(2011)
7. System and method for maintaining replicated data coherency in a data processing system(1995)
8. System and method of using data mining prediction methodology(2006)
9. System and Methodology for Data Processing Combining Stream Processing and spreadsheet computation(2011)
10. Patent Factor index report of system and method of using data mining prediction methodology
11. Pregel: A System for Large-Scale Graph Processing(2010)
Hadoop
1. A simple totally ordered broadcast protocol
2. ZooKeeper: Wait-free coordination for Internet-scale systems
3. Zab: High-performance broadcast for primary-backup systems(2011)
4. wait-free syschronization(1991)
5. ON SELF-STABILIZING WAIT-FREE CLOCK SYNCHRONIZATION(1997)
6. Wait-free clock synchronization(ps format)
7. Programming with ZooKeeper - A basic tutorial
8. Hive – A Petabyte Scale Data Warehouse Using Hadoop
9. Thrift: Scalable Cross-Language Services Implementation(Facebook)
10. Hive other files: HiveMetaStore class picture, Chinese docs
11. Scaling out data preprocessing with Hive (2011)
12. HBase The Definitive Guide - 2011
13. Nova: Continuous Pig/Hadoop Workflows(yahoo,2011)
14. Pig Latin: A Not-So-Foreign Language for Data Processing(2008)
15. Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?(2009)
a. Some docs about HStreaming,Zebra
16. HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks
17. System Anomaly Detection in Distributed Systems through MapReduce-Based Log Analysis(2010)
18. Benchmarking Cloud Serving Systems with YCSB(2010)
19. Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework (2009)
SmallFile Combine in hadoop world
1. TidyFS: A Simple and Small Distributed File System(Microsoft)
2. Improving the storage efficiency of small files in cloud storage(chinese,2011)
3. Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications(2010)
4. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems(Facebook)
5. A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files(IBM,2010)
Job schedule
1. Job Scheduling for Multi-User MapReduce Clusters(Facebook)
2. MapReduce Scheduler Using Classifiers for Heterogeneous Workloads(2011)
3. Performance-Driven Task Co-Scheduling for MapReduce Environments
4. Towards a Resource Aware Scheduler in Hadoop(2009)
5. Delay Scheduling: A Simple Technique for Achieving
6. Locality and Fairness in Cluster Scheduling(yahoo,2010)
7. Dynamic Proportional Share Scheduling in Hadoop(HP)
8. Adaptive Task Scheduling for MultiJob MapReduce Environments(2010)
9. A Dynamic MapReduce Scheduler for Heterogeneous Workloads(2009)
HStreaming
1. HStreaming Cloud Documentation
2. S4: Distributed Stream Computing Platform(yahoo,2010)
3. Complex Event Processing(2009)
4. Hstreaming : http://www.hstreaming.com/resources/manuals/
5. StreamBase: http://streambase.com/developers-docs-pdfindex.htm
6. Twitter storm: http://www.infoq.com/cn/news/2011/09/twitter-storm-real-time-hadoop
7. Bulk Synchronous Parallel(BSP) computing
8. MPI
SQL/Mapreduce
1. Aster Data whilepaper:Deriving Deep Insights from Large Datasets with SQL-MapReduce (2004)
2. SQL/MapReduce: A practical approach to self-describing,polymorphic, and parallelizable user-defined functions(2009,aster)
3. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads(2009)
4. HadoopDB in Action: Building Real World Applications(2010)
5. Aster Data presentation: Making Advanced Analytics on Big Data Fast and Easy(2010)
6. A Scalable, Predictable Join Operator for
7. Highly Concurrent Data Warehouses(2009)
8. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce(2010)
9. Greenplum whilepaper:A Unified Engine for RDBMS and MapReduce(2004)
10. A Comparison of Approaches to Large-Scale Data Analysis(2009)
11. MAD Skills: New Analysis Practices for Big Data (2009)
12. C Store A Column oriented DBMS(2005)
13. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations(Microsoft)
Microsoft
1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (2007)
Amazon
1. Dynamo: Amazon’s Highly Available Key-value Store(2007)
2. Efficient Reconciliation and Flow Control for Anti-Entropy Protocols
3. The Eucalyptus Open-source Cloud-computing System
4. Eucalyptus: An Open-source Infrastructure for Cloud Computing(presentation)
5. Eucalyptus : A Technical Report on an Elastic Utility Computing Archietcture Linking Your Programs to Useful Systems (2008)
6. Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms(2011)
7. Database-Agnostic Transaction Support for Cloud Infrastructures
8. CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems(2011)
9. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures
Books
1. Distributed Systems Concepts and Design (5th Edition)
2. Principles of Computer Systems (7-11)
3. Distributed system(chapter)
4. Data-Intensive Text Processing with MapReduce (2010)
5. Hadoop in Action
6. 21 Recipes for Mining Twitter
7. Hadoop.The.Definitive.Guide.2nd.Edition
8. Pro hadoop
Other papers about Distributed system
1. Flexible Update Propagation for Weakly Consistent Replication(1997)
2. Providing High Availability Using Lazy Replication(1992)
3. Managing Update Conflicts in Bayou,a Weakly Connected Replicated Storage System(1995)
4. XMIDDLE: A Data-Sharing Middleware for Mobile Computing(2002)
5. design and implementation of sun network filesystem
6. Chord: A Scalable Peertopeer Lookup Service for Internet Applications(2001)
7. A Survey and Comparison of Peer-to-Peer Overlay Network Schemes(2004)
8. Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing(2001)
BI
1. 21 Recipes for Mining Twitter(Book)
2. Web Data Mining(Book)
3. Web Mining and Social Networking(Book)
4. mining the social web(book)
5. TEXTUAL BUSINESS INTELLIGENCE (Inmon)
6. Social Network Analysis and Mining for Business Applications(yahoo,2011)
7. Data Mining in Social Networks(2002)
8. Natural Language Processing with Python(book)
9. data_mining-10_methods(Chinese editation)
10. Mahout in Action(Book)
11. Text Mining Infrastructure in R(2008)
12. Text Mining Handbook(2010)
Web search engine
1. Building Efficient Multi-Threaded Search Nodes(Yahoo,2010)
2. The Anatomy of a Large-Scale Hypertextual Web Search Engine(google)
Hadoop
一个分布式系统基础架构,由Apache基金会开发。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有着高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上。而且它提供高传输率(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求(requirements)这样可以流的形式访问(streaming access)文件系统中的数据。
名字起源
起源
Hadoop logo
诸多优点
架构
HDFS
NameNode
DataNode
文件操作
Linux 集群
集群系统
应用程序
Hadoop系统安装于配置
海量数据处理平台架构介绍
Hadoop能解决哪些问题
Hadoop在国内的情景
Hadoop简介
Hadoop生态系统介绍
HDFS简介
HDFS设计原则
HDFS系统结构
HDFS文件权限
HDFS文件读取
HDFS文件写入
HDFS文件存储
HDFS文件存储结构
HDFS开发常用命令
Hadoop管理员常用命令
HDFS API简介
用Java对HDFS编程
Mapreduce简介
编写MapReduce程序的步骤
MapReduce模型
MapReduce运行步骤
MapReduce执行流程
MapReduce基本流程
JobTracker(JT)和TaskTracker(TT)简介
Mapreduce原理
使用ZooKeeper来协作JobTracker
Hadoop Job Scheduler
mapreduce的类型与格式
mapreduce的数据类型与java类型对应关系
Writable接口
实现自定义的mapreduce类型
mapreduce驱动默认的设置