Hadoop

HDFS & MapReduce

Hadoop: HDFS & MapReduce Apache Hadoop is an open-source framework for distributed storage (HDFS) and processing (MapReduce/YARN) of large datasets across clust…

Hadoop: HDFS & MapReduce

Apache Hadoop is an open-source framework for distributed storage (HDFS) and processing (MapReduce/YARN) of large datasets across clusters of commodity hardware. It underpins the "big data" ecosystem.

HDFS Architecture

HDFS (Hadoop Distributed File System):

NameNode (master)
  - Stores filesystem metadata (directory tree, file→block mapping)
  - Does NOT store actual data
  - Single point of failure → use NameNode HA or Quorum Journal Manager
  - Keeps metadata in memory (fast) + written to disk (EditLog + FsImage)

DataNodes (workers, many)
  - Store actual data blocks (default 128 MB per block)
  - Send heartbeats to NameNode every 3 seconds
  - Block reports every 6 hours (list of all blocks they hold)

Block replication (default: 3 copies):
  - Replica 1: same rack as writer
  - Replica 2: different rack
  - Replica 3: same rack as replica 2
  - Tolerates: 1 rack failure or 2 DataNode failures

Secondary NameNode (NOT a backup!)
  - Periodically merges EditLog into FsImage (checkpointing)
  - Reduces NameNode restart time
  - Use NameNode HA + Standby NameNode for real HA

HDFS Commands

# HDFS file operations (hdfs dfs or hadoop fs — equivalent)
hdfs dfs -ls /user/hadoop/
hdfs dfs -ls -h /data/           # human-readable sizes
hdfs dfs -mkdir -p /user/hadoop/input
hdfs dfs -put localfile.txt /user/hadoop/input/
hdfs dfs -put -f localfile.txt /hdfs/path/   # force overwrite
hdfs dfs -get /user/hadoop/output/part-00000 ./local-output.txt
hdfs dfs -cat /user/hadoop/input/data.csv
hdfs dfs -head /user/hadoop/input/data.csv   # first 1KB
hdfs dfs -tail /user/hadoop/output/result.txt
hdfs dfs -cp /source/file.txt /dest/
hdfs dfs -mv /old/path /new/path
hdfs dfs -rm /user/hadoop/file.txt
hdfs dfs -rm -r /user/hadoop/directory/     # recursive
hdfs dfs -rm -r -skipTrash /path/           # skip trash (permanent)

# Check filesystem health
hdfs dfsadmin -report                        # cluster status
hdfs dfsadmin -safemode get                  # check safe mode
hdfs fsck /                                  # filesystem check
hdfs fsck /path/to/file -files -blocks -locations

# Disk usage
hdfs dfs -du -h /user/hadoop/
hdfs dfs -du -s -h /data/                   # summary only
hdfs dfs -count /user/hadoop/               # dir count, file count, size

# Permissions (POSIX-style)
hdfs dfs -chmod 755 /user/hadoop/output/
hdfs dfs -chown hadoop:hadoop /user/hadoop/
hdfs dfs -chgrp datagrp /shared/

MapReduce

MapReduce is a programming model for processing large datasets in parallel. A job consists of a Map phase (process each record independently) and a Reduce phase (aggregate results).

// WordCount example (classic MapReduce)
public class WordCount {

    // Mapper: input line → (word, 1) pairs
    public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);   // emit (word, 1)
            }
        }
    }

    // Reducer: (word, [1,1,1,...]) → (word, count)
    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);  // local reducer on mapper node
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

# Run MapReduce job
hadoop jar wordcount.jar WordCount /input /output

# Monitor via ResourceManager web UI: http://namenode:8088

# Job history
mapred job -history /output/_logs/history/
yarn logs -applicationId application_1234567890_0001

Hadoop

YARN, Ecosystem & Modern Alternatives

Hadoop: YARN, Ecosystem & Modern Alternatives YARN (Resource Manager) YARN (Yet Another Resource Negotiator) decouples resource management from application logi…

Hadoop: YARN, Ecosystem & Modern Alternatives

YARN (Resource Manager)

YARN (Yet Another Resource Negotiator) decouples resource management from application logic. It allows running diverse workloads (MapReduce, Spark, Flink, Tez) on the same cluster.

YARN components:

ResourceManager (master)
  - Scheduler: allocates cluster resources to applications
  - ApplicationsManager: manages application lifecycle

NodeManager (on each worker node)
  - Launches and manages containers
  - Reports resource usage to ResourceManager
  - Monitors container health

ApplicationMaster (per application)
  - Runs inside a container
  - Negotiates resources from ResourceManager
  - Coordinates task execution

Container: unit of resources (CPU cores + memory)

Scheduling queues:
  FIFO Scheduler    — simple, single queue, not suitable for shared clusters
  Capacity Scheduler — multiple queues with guaranteed capacity (default)
  Fair Scheduler    — shares resources equally among all apps

# YARN commands
yarn node -list                          # list NodeManagers
yarn application -list                   # running applications
yarn application -status <app_id>
yarn application -kill <app_id>
yarn logs -applicationId <app_id>

# Queue status
yarn queue -status default

# ResourceManager web UI: http://resourcemanager:8088

Hadoop Ecosystem

Data Storage:
  HDFS          — distributed file system (core)
  HBase         — NoSQL on top of HDFS (random read/write, wide-column)
  Kudu          — columnar storage with fast analytics + random access

Data Processing:
  MapReduce     — batch processing (low-level, rarely used directly now)
  Spark         — in-memory batch/streaming (replaces MapReduce for most use cases)
  Tez           — DAG execution engine (used by Hive, Pig)
  Flink         — streaming-first, low-latency processing

SQL on Hadoop:
  Hive          — SQL interface over HDFS/ORC (batch queries)
  Impala        — fast SQL on HDFS (interactive, no Tez/MapReduce)
  Presto/Trino  — federated SQL (HDFS, S3, databases)
  Spark SQL     — SQL within Spark

Data Ingestion:
  Sqoop         — import/export between RDBMS and HDFS
  Kafka         — high-throughput event streaming
  Flume         — log/event collection
  NiFi          — data flow automation (drag-and-drop)

Orchestration:
  Oozie         — workflow scheduler for Hadoop jobs
  Airflow       — Python-based DAG scheduler (now preferred)

Serialization:
  Avro          — row-based, schema evolution, Kafka-native
  Parquet       — columnar, great for analytics (Spark/Hive preferred)
  ORC           — columnar, optimized for Hive

Hive Basics

-- Create external table (data stays in HDFS)
CREATE EXTERNAL TABLE logs (
    timestamp STRING,
    level     STRING,
    message   STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '	'
STORED AS TEXTFILE
LOCATION '/data/logs/';

-- Create ORC table (optimized format)
CREATE TABLE sales_orc (
    sale_id   INT,
    amount    DECIMAL(10,2),
    sale_date DATE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");

-- Partitioned table (critical for performance)
CREATE TABLE events (
    user_id INT,
    action  STRING
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET;

-- Load data with partition
INSERT INTO events PARTITION (year=2024, month=3, day=15)
SELECT user_id, action FROM raw_events WHERE dt = '2024-03-15';

-- Query (partition pruning is automatic)
SELECT COUNT(*) FROM events WHERE year=2024 AND month=3;

When to Use Hadoop vs Alternatives

Hadoop strengths: petabyte-scale data already on HDFS, existing cluster investment, tight HBase integration.
Use Spark instead of MapReduce: 10-100x faster, same HDFS/YARN, Python/Scala/SQL APIs.
Use cloud object storage (S3, GCS) instead of HDFS: cheaper, no NameNode bottleneck, managed service.
Use Trino/Athena for interactive SQL: no ETL needed, query S3 directly, pay-per-query.
Use Kafka + Flink for streaming: lower latency than Spark Streaming, exactly-once semantics.
Managed services: AWS EMR, Google Dataproc, Azure HDInsight — skip Hadoop cluster management.
Hadoop is mature but declining: new projects typically choose Spark + cloud storage + Airflow.

Hadoop notes for developers