HDFS & MapReduce
Hadoop: HDFS & MapReduce Apache Hadoop is an open-source framework for distributed storage (HDFS) and processing (MapReduce/YARN) of large datasets across clust…
Hadoop: HDFS & MapReduce
Apache Hadoop is an open-source framework for distributed storage (HDFS) and processing (MapReduce/YARN) of large datasets across clusters of commodity hardware. It underpins the "big data" ecosystem.
HDFS Architecture
HDFS (Hadoop Distributed File System):
NameNode (master)
- Stores filesystem metadata (directory tree, file→block mapping)
- Does NOT store actual data
- Single point of failure → use NameNode HA or Quorum Journal Manager
- Keeps metadata in memory (fast) + written to disk (EditLog + FsImage)
DataNodes (workers, many)
- Store actual data blocks (default 128 MB per block)
- Send heartbeats to NameNode every 3 seconds
- Block reports every 6 hours (list of all blocks they hold)
Block replication (default: 3 copies):
- Replica 1: same rack as writer
- Replica 2: different rack
- Replica 3: same rack as replica 2
- Tolerates: 1 rack failure or 2 DataNode failures
Secondary NameNode (NOT a backup!)
- Periodically merges EditLog into FsImage (checkpointing)
- Reduces NameNode restart time
- Use NameNode HA + Standby NameNode for real HAHDFS Commands
# HDFS file operations (hdfs dfs or hadoop fs — equivalent)
hdfs dfs -ls /user/hadoop/
hdfs dfs -ls -h /data/ # human-readable sizes
hdfs dfs -mkdir -p /user/hadoop/input
hdfs dfs -put localfile.txt /user/hadoop/input/
hdfs dfs -put -f localfile.txt /hdfs/path/ # force overwrite
hdfs dfs -get /user/hadoop/output/part-00000 ./local-output.txt
hdfs dfs -cat /user/hadoop/input/data.csv
hdfs dfs -head /user/hadoop/input/data.csv # first 1KB
hdfs dfs -tail /user/hadoop/output/result.txt
hdfs dfs -cp /source/file.txt /dest/
hdfs dfs -mv /old/path /new/path
hdfs dfs -rm /user/hadoop/file.txt
hdfs dfs -rm -r /user/hadoop/directory/ # recursive
hdfs dfs -rm -r -skipTrash /path/ # skip trash (permanent)
# Check filesystem health
hdfs dfsadmin -report # cluster status
hdfs dfsadmin -safemode get # check safe mode
hdfs fsck / # filesystem check
hdfs fsck /path/to/file -files -blocks -locations
# Disk usage
hdfs dfs -du -h /user/hadoop/
hdfs dfs -du -s -h /data/ # summary only
hdfs dfs -count /user/hadoop/ # dir count, file count, size
# Permissions (POSIX-style)
hdfs dfs -chmod 755 /user/hadoop/output/
hdfs dfs -chown hadoop:hadoop /user/hadoop/
hdfs dfs -chgrp datagrp /shared/MapReduce
MapReduce is a programming model for processing large datasets in parallel. A job consists of a Map phase (process each record independently) and a Reduce phase (aggregate results).
// WordCount example (classic MapReduce)
public class WordCount {
// Mapper: input line → (word, 1) pairs
public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one); // emit (word, 1)
}
}
}
// Reducer: (word, [1,1,1,...]) → (word, count)
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class); // local reducer on mapper node
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}# Run MapReduce job
hadoop jar wordcount.jar WordCount /input /output
# Monitor via ResourceManager web UI: http://namenode:8088
# Job history
mapred job -history /output/_logs/history/
yarn logs -applicationId application_1234567890_0001