您的当前位置:首页正文

《Hadoop: The Definitive Guide》读书笔记 -- Chapter 1 Meet Hadoop

2024-11-06 来源:个人技术集锦

Preface

Stripped to its core, the tools that Hadoop provides for working with big data are simple.If there is a common theme, it is about raising the level of abstraction -- to create building blocks for programmers who have lots of data to store and analyze, and who don't have time, the skill, or the inclination to become distributed system experts to build infrastructure on it.


Part I Hadoop Fundamentals

Chapter 1 Meet Hadoop

Data!

individual footprint to grow(Machine logs, RFID readers, sensor networks)

"More data usually beats better algorithm"


Data Storage 

There is a long time to read all data from one disk -- and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once.

read and write from parallel disks:

1. the chance of one will fail is fairly high. replication: RAID / HDFS

2. most analysis tasks need to be able to combine the data in some way. MapReduce

Hadoop provides: a reliable, scalable platform for storage and analysis


Querying All Your Data

MapReduce is a batch query processor


Beyond Batch

MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis. You can't run a query and get results back in a few seconds or less.


"Hadoop" is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce.

HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access individual rows and batch operations for reading and writing data in bulk.

YARN(Yet Another Resource Negotiator), a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in Hadoop cluster.


Different processing patterns that work with Hadoop:[ROADMAP]

1. Interactive SQL: Impala(Daemon), Hive on Tez(Container Reuse)

2. Interactive Processing: Spark(Many algorithm in machine learning are interactive in nature)

3. Stream Processing: Storm, Spark Streaming, Samza run realtime, distributed computation on unbounded streams of data and emit results.

4. Search: Solr can run on Hadoop cluster


Comparison with Other Systems

Relational Database Management System

Why can't we use databases with lots of disks to do large-scale analysis?

seeking time is improving more slowly than transfer rate. B-Tree works well for small portion of data.


Grid Computing(good for compute insensitive jobs)

Problem when nodes need to access larger data volumes(hundreds of gigabytes, the points at which Hadoop really starts to shine), since the network bandwidth is the bottleneck and compute nodes become idle.

Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.(Data Locality)


Volunteer Computing




显示全文