Tuesday, September 25, 2018

Distributed Computing Systems

  • Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale


Apache Flink is a distributed system and requires compute resources in order to execute applications. Flink integrates with all common cluster resource managers such as Hadoop YARN, Apache Mesos, and Kubernetes but can also be setup to run as a stand-alone cluster.

Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations. Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted. Processing of bounded streams is also known as batch processing.

Run Applications at any Scale
Flink is designed to run stateful streaming applications at any scale. Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster.
Therefore, an application can leverage virtually unlimited amounts of CPUs, main memory, disk and network IO.
Moreover, Flink easily maintains very large application state. Its asynchronous and incremental checkpointing algorithm ensures minimal impact on processing latencies while guaranteeing exactly-once state consistency.
A modern container is more than just an isolation mechanism: it also includes an
image—the files that make up the application that runs inside the container.
Within Google, MPM (Midas Package Manager) is used to build and deploy container images.
Containerization transforms the data center from being machine-oriented to being application-oriented.

Containers encapsulate the application environment, abstracting away many details of machines and operating
systems from the application developer and the deployment infrastructure.

Because well-designed containers and container images  are scoped to a single application, managing containers
means managing applications rather than machines. This shift of management APIs from machine-oriented to
application oriented dramatically improves application deployment and introspection.

The original purpose of the cgroup, chroot , and namespace facilities in the kernel was to protect applications from
noisy, nosey, and messy neighbors. Combining these with container images created an abstraction that also isolates
applications from the (heterogeneous) operating systems on which they run.

This decoupling of image and OS makes it possible to provide the same deployment environment in
both development and production, which, in turn, improves deployment reliability and speeds up development by
reducing inconsistencies and friction.

https://flink.apache.org/flink-architecture.html


Hadoop vs Spark vs Flink – Big Data Frameworks Comparison
  • Apache Flink on DC/OS (and Apache Mesos

For the uninitiated: Flink is a stateful stream processing framework that supports high-throughput, low-latency applications. Flink is a lightweight and fault tolerant, providing strict accuracy guarantees in case of failures, with minimal impact on performance. Flink deployments cover a range of use cases; Alibaba, for example, uses Flink to optimize search results in real-time.

Flink on DC/OS
In its Mesos user survey, Mesosphere found that 87% of new Mesos users are running DC/OS, and so Flink’s Mesos support wouldn’t be complete without DC/OS support, too. To run Flink on DC/OS, first install DC/OS from the official site.
Note that DC/OS includes Mesos, Marathon (a service that will supervise your applications and maintain their state in case of failures), and ZooKeeper, all pre-configured out of the box.

Flink on Mesos
First things first: the Mesos documentation will walk you through initial Mesos setup. Next you should also install Marathon, since you usually want your cluster to be highly available (HA). In order to run Marathon, you also need a ZooKeeper quorum running. We assume for the following that ZooKeeper is reachable under node:2181. Lastly, we recommend installing a distributed file system where Flink can store its checkpoints. We assume for the following that HDFS is installed and can be reached via hdfs://node/.


https://mesosphere.com/blog/apache-flink-on-dcos-and-apache-mesos/


This repository is a sample setup to run an Apache Flink job in Kubernetes.

https://github.com/sanderploegsma/flink-k8s


Flink session cluster on Kubernetes

A Flink session cluster is executed as a long-running Kubernetes Deployment. Note that you can run multiple Flink jobs on a session cluster. Each job needs to be submitted to the cluster after the cluster has been deployed.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html


Flink Helm Chart

https://github.com/docker-flink/examples

  • Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

http://samza.apache.org/


  • Choose between Confluent Open Source or Confluent Enterprise edition. Both are built on the world’s most popular streaming platform, Apache Kafka

Know and respond to every single event in your organization in real time.
Like a central nervous system, Confluent creates an Apache Kafka-based streaming platform to unite your organization around a single source of truth.
Confluent Platform adds administration, data management, operations tools and robust testing to your Kafka environment.
an architecture based on a highly scalable, distributed streaming platform like Confluent will grow quickly with your business and the expanding data pipelines that come with it, accelerating both performance and cost savings.
https://www.confluent.io/

  • Apache Spark
Apache Spark is a fast and general engine for large-scale data processing.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
https://spark.apache.org

  • Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources
ou can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
https://spark.apache.org/

Apache Spark is an open-source cluster-computing framework.
Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
https://en.wikipedia.org/wiki/Apache_Spark


Building Clustered Applications with Kubernetes and Docker

  • Implement batch and streaming data processing jobs that run on any execution engine.

https://beam.apache.org/

  • The Grizzly NIO framework

Writing scalable server applications in the Java™ programming language has always been difficult. Before the advent of the Java New I/O API (NIO), thread management issues made it impossible for a server to scale to thousands of users. The Grizzly NIO framework has been designed to help developers to take advantage of the Java™ NIO API. Grizzly’s goal is to help developers to build scalable and robust servers using NIO as well as offering extended framework components: Web Framework (HTTP/S), WebSocket, Comet
https://javaee.github.io/grizzly/


  • Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.Netty is a NIO client server framework which enables quick and easy development of network applications such as protocol servers and clients. It greatly simplifies and streamlines network programming such as TCP and UDP socket server. 

https://netty.io/

  • How WebSockets Work

WebSockets provide a persistent connection between a client and server that both parties can use to start sending data at any time.
The client establishes a WebSocket connection through a process known as the WebSocket handshake. This process starts with the client sending a regular HTTP request to the server.
WebSocket URLs use the ws scheme. There is also wss for secure WebSocket connections which is the equivalent of HTTPS.
Now that the handshake is complete the initial HTTP connection is replaced by a WebSocket connection that uses the same underlying TCP/IP connection.
At this point either party can starting sending data.

With WebSockets you can transfer as much data as you like without incurring the overhead associated with traditional HTTP requests. Data is transferred through a WebSocket as messages, each of which consists of one or more frames containing the data you are sending (the payload). In order to ensure the message can be properly reconstructed when it reaches the client each frame is prefixed with 4-12 bytes of data about the payload. Using this frame-based messaging system helps to reduce the amount of non-payload data that is transferred, leading to significant reductions in latency.
having the ability to open bidirectional, low latency connections enables a whole new generation of real-time web applications.

https://blog.teamtreehouse.com/an-introduction-to-websockets

  • HTTP protocol is connection-less

only the client can request information from a server
HTTP is purely half-duplex
a server can answer only one time to a client request
Some websites or web applications require the server to update client from time to time

There were a few ways to do so
the client request the server at a regular interval to check if there is a new information available
the client send a request to the server and the server answers as soon as he has an information to provide to the client (also known as long time polling)
those methods have many drawbacks due to HTTP limitation

So a new protocol has been designed: websockets
allows a 2 ways communication (full duplex) between a client and a server, over a single TCP connection
websockets re-use the HTTP connection it was initialized on
uses the standard TCP port

How does websocket work ???
The most important part is the “Connection: Upgrade” header which let the client know to the server it wants to change to an other protocol, whose name is provided by “Upgrade: websocket” header
the TCP connection used for the HTTP request/response challenge is used for the websocket: whenever a peer wants to interact with the other peer, it can use the it
The socket finishes as soon as one peer decides it or the TCP connection is closed
https://www.haproxy.com/blog/websockets-load-balancing-with-haproxy/

  • HTML5 WebSocket: A Quantum Leap in Scalability for the Web

defines a full-duplex communication channel that operates through a single socket over the Web
especially for real-time, event-driven web applications
how HTML5 Web Sockets can offer such an incredibly dramatic reduction of unnecessary network traffic and latency

Normally when a browser visits a web page, an HTTP request is sent to the web server that hosts that page. The web server acknowledges this request and sends back the response
If you want to get the most up-to-date "real-time" information, you can constantly refresh that page manually
Current attempts to provide real-time web applications largely revolve around polling and other server-side push technologies

HTML5 Web Sockets represents the next evolution of web communications—a full-duplex, bidirectional communications channel that operates through a single socket over the Web
HTML5 Web Sockets provides a true standard that you can use to build scalable, real-time web applications

To establish a WebSocket connection, the client and server upgrade from the HTTP protocol to the WebSocket protocol during their initial handshake
Once established, WebSocket data frames can be sent back and forth between the client and the server in full-duplex mode
Both text and binary frames can be sent full-duplex, in either direction at the same time.
The data is minimally framed with just two bytes.
In the case of text frames, each frame starts with a 0x00 byte, ends with a 0xFF byte, and contains UTF-8 data in between. WebSocket text frames use a terminator while binary frames use a length prefix.
it cannot deliver raw binary data to JavaScript because JavaScript does not support a byte type. Therefore, binary data is ignored if the client is JavaScript—but it can be delivered to other clients that support it.

Today, only Google's Chrome browser supports HTML5 Web Sockets natively
To leverage the full power of HTML5 Web Sockets, Kaazing provides a ByteSocket library for binary communication and higher-level libraries for protocols like Stomp, AMQP, XMPP, IRC and more, built on top of WebSocket.

http://www.websocket.org/quantum.html
SIMPLEX, HALF-DUPLEX, FULL-DUPLEX

  • Who said REST can’t be realtime? Stop polling your REST APIs and have Kaazing dynamically deliver updates from them the moment they are available to any client, anywhere

A distributed cloud-based service ensures the highest reliability and performance.

kaazing.io seamlessly connects to a Kafka instance and subscribes to messages that need to be delivered over the web.
using Server Sent Events (SSE) it streams these messages to either browsers or mobile apps
SSE is a HTML5 standard
Advanced security capabilities include end-to-end encryption and access control (a set which app users receive which data) while performance innovations ensure that no matter what, users will receive, the right data, at the right time in the right form
https://kaazing.com/

  • Once established, a websocket connection does not have to send headers with its messages so we can expect the total data transfer per message to be less than an equivalent HTTP request

Although most often used in the context of HTTP, Representational State Transfer (REST) is an architectural design pattern and not a transport protocol. The HTTP protocol is just one implementation of the REST architecture.
When it comes to scalability, one advantage of a REST architecture is the statelessness which means that any server can handle any request and there is no need to synchronize any shared state other than the database
https://blog.feathersjs.com/http-vs-websockets-a-performance-comparison-da2533f13a77
  • CLOUD SCALABILITY: SCALE UP VS SCALE OUT

IT Managers run into scalability challenges on a regular basis. It is difficult to predict growth rates of applications, storage capacity usage, and bandwidth.
When a workload reaches capacity limits the question is how is performance maintained while preserving efficiency to scale?

Scale-up vs Scale-out
Infrastructure scalability handles the changing needs of an application by statically adding or removing resources to meet changing application demands as needed.In most cases, this is handled by scaling up (vertical scaling) and/or scaling out (horizontal scaling).

Scale-up or Vertical Scaling
Scale-up is done by adding more resources to an existing system to reach a desired state of performance. For example, a database or web server needs additional resources to continue performance at a certain level to meet SLAs. More compute, memory, storage, or network can be added to that system to keep the performance at desired levels.When this is done in the cloud, applications often get moved onto more powerful instances and may even migrate to a different host and retire the server it was on.These types of scale-up operations have been happening on-premises in datacenters for decades.However, the time it takes to procure additional recourses to scale-up a given system could take weeks or months in a traditional on-premises environment while scaling-up in the cloud can take only minutes

Scale-out or Horizontal Scaling
Scale-out is usually associated with distributed architectures.There are two basic forms of scaling out:
Adding additional infrastructure capacity in pre-packaged blocks of infrastructure or nodes (i.e. hyper-converged) 
use a distributed service that can retrieve customer information but be independent of applications or services.Horizontal scaling makes it easy for service providers to offer “pay-as-you-grow” infrastructure and services.Hyper-converged infrastructure has become increasingly popular for use in private cloud and even tier 2 service providers.This approach is not quite as loosely coupled as other forms of distributed architectures but, it does help IT managers that are used to traditional architectures make the transition to horizontal scaling and realize the associated cost benefits.
Loosely coupled distributed architecture allows for scaling of each part of the architecture independently. This means a group of software products can be created and deployed as independent pieces, even though they work together to manage a complete workflow. Each application is made up of a collection of abstracted services that can function and operate independently. This allows for horizontal scaling at the product level as well as the service level. Even more granular scaling capabilities can be delineated by SLA or customer type (e.g. bronze, silver, or gold) or even by API type if there are different levels of demand for certain APIs. This can promote an efficient use of scaling within a given infrastructure.

https://blog.turbonomic.com/blog/on-technology/cloud-scalability-scale-vs-scale

No comments:

Post a Comment