Introduction to Data Storage and Processing
Installing the Hadoop Distributed File System (HDFS)
• Defining key design assumptions and architecture
• Configuring and setting up the file system
• Issuing commands from the console
• Reading and writing files
Setting the stage for MapReduce
• Reviewing the MapReduce approach
• Introducing the computing daemons
• Dissecting a MapReduce job
Defining Hadoop Cluster Requirements
Planning the architecture
• Selecting appropriate hardware
• Designing a scalable cluster
Building the cluster
• Installing Hadoop daemons
• Optimizing the network architecture
Configuring a Cluster
Preparing HDFS
• Setting basic configuration parameters
• Configuring block allocation, redundancy and replication
Deploying MapReduce
• Installing and setting up the MapReduce environment
• Delivering redundant load balancing via Rack Awareness
Maximizing HDFS Robustness
Creating a fault-tolerant file system
Leveraging NameNode Federation
Introducing YARN
Managing Resources and Cluster Health
Allocating resources
Maintaining HDFS
Administering MapReduce
Extending Hadoop
Simplifying information access
• Enabling SQL-like querying with Hive
• Installing Pig to create MapReduce jobs
Integrating additional elements of the ecosystem
• Imposing a tabular view on HDFS with HBase
• Configuring Oozie to schedule workflows
Implementing Data Ingress and Egress
Facilitating generic input/output
• Moving bulk data into and out of Hadoop
• Transmitting HDFS data over HTTP with WebHDFS
Acquiring application-specific data
• Collecting multi-sourced log files with Flume
• Importing and exporting relational information with Sqoop