Sunday, September 8, 2019

kubernetes


  • A recent performance benchmark completed by Intel and BlueData using the BigBench benchmarking kit has shown that the performance ratios for container-based Hadoop workloads on BlueData EPIC are equal to and in some cases, better than bare-metal Hadoop

Under the hood, the BlueData used several enhancements to boosts the I/O performance and scalability of container-based clusters.

container-based Spark cluster vs. bare-metal Spark cluster.
For instance, scatter/gather pattern can be used to implement a MapReduce like batch processing architecture on top of Kubernetes. Similarly, event-driven stream data processing is a lot easier to implement as microservices running on top Kubernetes.

a distributed Pachyderm File System (PFS) and a data-aware scheduler Pachyderm Pipeline System (PPS) on top of Kubernetes.
Pachyderm uses default Kubernetes scheduler to implement fault-tolerance and incremental processing.
In addition, for FPS Pachyderm utilizes a copy-on-write paradigm which inspired by Git.
Pachyderm is applying version control to your data as it's processed which processing jobs run on only the diff.

Custom schedulers

Kubernetes custom schedulers specifically optimised for big data workloads
Kubernetes scheduler is responsible for scheduling pods onto nodes. Kubernetes ships with a default scheduler which provides a range of scheduling features. To schedule pods onto nodes, Kubernetes default scheduler considers several factors including individual and collective resource requirements, quality of service requirements, hardware constraints, affinity or anti-affinity specifications, data locality, inter-workload interference and so on. Using default scheduler's node affinity feature you can ensure that certain pods only schedule on nodes with specialized hardware like GPU, memory-optimised, I/O optimised etc. Similarly, pods affinity features allow you to place pods relative to one another.
Kubernetes allows you to run multiple schedulers simultaneously.

the performance of Spark with native Kubernetes scheduler can be improved by running HDFS inside Kubernetes
This enables HDFS data locality by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons
Basically having HDFS in Kubernetes makes schedule data-aware.
It is possible to use YARN as Kubernetes custom scheduler.
Heron is a real-time, distributed stream processing engine developed at Twitter.
It can be considered as a drop-in replacement for Apache Storm.
Just like Apache Storm, Heron has a concept of topology.
A topology is a directed acyclic graph (DAG) used to process streams of data and it can be stateless or stateful.
Heron topology is essentially a set of pods that can be scheduled by Kubernetes.
Heron scheduler converts packing plan for a topology into pod definitions which is then submitted to Kubernetes scheduler via APIs
A topology can be updated (scale up or down based on load) without having to build a new JAR to submit to the cluster.

Storage provisioning
Storage options have been another big roadblock in porting data workloads on Kubernetes particularly for stateful data workloads like Zookeeper, Cassandra, etc.
These new Kubernetes storage options have enabled us to deploy more fault-tolerance stateful data workloads on Kubernetes without the risk of data loss. For instance, by levering PersistentVolumes, a custom Cassandra Seed Provider, and StatefulSets we can provide a resilient installation of Cassandra
https://www.abhishek-tiwari.com/kubernetes-for-big-data-workloads/
  • What is a container?

It’s similar to a virtual machine (VM), but it avoids a great deal of the trouble because it virtualizes the operating system (OS) rather than the underlying hardware.
This enables engineers to quickly develop applications that will run consistently across a large number of machines and software environments.
What is Docker?
The Docker Container Platform is an excellent tool for building and deploying containerized applications. 
 The platform helps developers easily isolate software into containers as they create it. It’s also an effective way to prepare existing applications for the cloud.
 What is Kubernetes?
 Kubernetes is an open source solution for container orchestration.

 Kubernetes and Docker: Finding your best container solution
 While Docker does have its own container orchestration solution called Docker Swarm, Kubernetes and Docker mostly solve different problems and thus can coexist. Later versions of Docker even have built-in integration with Kubernetes.
https://www.ibm.com/blogs/cloud-computing/2018/07/30/kubernetes-docker-vs/