Virtualizing a Hadoop Cluster (two videos)

January 28, 2015

Tom Phelan, Chief Architect at BlueData talks about the appropriate situations in which to virtualize Hadoop, either in containers or in virtual machines. In evaluating the situations he explains what questions you should and should not be asking.


  • What BlueData has learned about running Hadoop jobs
  • Under what situations should one virtualize a Hadoop cluster
  • The shape and components of a physical cluster
    • Both master and worker controllers contain
    • Disks, server, named node, and resource manager
  • Types of virtualization:
    • public cloud
    • private cloud / hypervisor (strong fault isolation)
    • private cloud / containers (weak fault isolation)
    • paravirtualization
  • The appropriate infrastructure for various situations
    • Questions NOT to ask
    • Questions to ask
  • Performance and Data Locality
  • Five use-cases and their infrastructure needs

Pt 2: Orchestration with Docker

Joel Baxter, also at BlueData, leads a breakout session about what an ideal orchestration manager would look like for managing Hadoop clusters and associated data. The state of the art is evolving but not there yet.


  • Hadoop Cluster orchestration with Docker
  • Wishlist for an ideal orchestration manager
    • Add/remove a host from a list of peers
    • Track resource consumption/availability per host
    • Library of pre-prepared application images
    • Multi-tenant protection
      • Performance isolation and data security
      • Containers are less secure than virtual machines
      • Quotas and priority
    • Matching containers with physical hosts
    • Spreading containers across fault-zones
    • Packing containers in contiguous memory
  • Migrations vs stateless servers
  • Circumventing docker and doing IP address allocation
  • DNS settings for bidirectional Hadoop communication
  • Wishlist for storage
    • Docker volumes
    • Volumes for fragments of HDFS are hard to reconstruct on container resurrection
    • Ship your data to another db, or paravirtualize the storage
  • Solutions?
    • There is not yet a turn-key orchestration system that solves all the items in the wishlists
    • The area is rapidly evolving