Preface
Stripped to its core, the tools that Hadoop provides for working with big data are simple.If there is a common theme, it is about raising the level of abstraction -- to create building blocks for programmers who have lots of data to store and analyze, and who don't have time, the skill, or the inclination to become distributed system experts to build infrastructure on it.
Part I Hadoop Fundamentals
Chapter 1 Meet Hadoop
Data!
individual footprint to grow(Machine logs, RFID readers, sensor networks)
"More data usually beats better algorithm"
Data Storage
There is a long time to read all data from one disk -- and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once.
read and write from parallel disks:
1. the chance of one will fail is fairly high. replication: RAID / HDFS
2. most analysis tasks need to be able to combine the data in some way. MapReduce
Hadoop provides: a reliable, scalable platform for storage and analysis
Querying All Your Data
MapReduce is a batch query processor
Beyond Batch
MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis. You can't run a query and get results back in a few seconds or less.
"Hadoop" is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce.
HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access individual rows and batch operations for reading and writing data in bulk.
YARN(Yet Another Resource Negotiator), a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in Hadoop cluster.
Different processing patterns that work with Hadoop:[ROADMAP]
1. Interactive SQL: Impala(Daemon), Hive on Tez(Container Reuse)
2. Interactive Processing: Spark(Many algorithm in machine learning are interactive in nature)
3. Stream Processing: Storm, Spark Streaming, Samza run realtime, distributed computation on unbounded streams of data and emit results.
4. Search: Solr can run on Hadoop cluster
Comparison with Other Systems
Relational Database Management System
Why can't we use databases with lots of disks to do large-scale analysis?
seeking time is improving more slowly than transfer rate. B-Tree works well for small portion of data.
Grid Computing(good for compute insensitive jobs)
Problem when nodes need to access larger data volumes(hundreds of gigabytes, the points at which Hadoop really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.
Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.(Data Locality)
Volunteer Computing