Gemini

Relational Databases

Structure: Data is organized into tables with rows and columns.

Relationships: Tables are linked through relationships (primary and foreign keys) to establish connections between data.

Data Integrity: Strong data integrity through ACID properties (Atomicity, Consistency, Isolation, Durability).

Query Language: SQL (Structured Query Language) is used for data manipulation and retrieval.

Examples: MySQL, PostgreSQL, Oracle, SQL Serve

Non-Relational Databases (NoSQL)

Structure: Data is stored in various models like key-value, document, columnar, or graph.

Flexibility: Highly flexible schema, accommodating diverse data types.

Scalability: Designed for handling large volumes of data and high traffic.

Performance: Often optimized for specific data access patterns.

Examples: MongoDB, Cassandra, Redis, Neo4

When to Use Which

Relational Databases:

Complex relationships between data

Strong data consistency requirements

ACID compliance is essential

Predictable data access patterns

Non-Relational Databases:

Large volumes of unstructured or semi-structured data

High write throughput and scalability

Rapid development needs

Flexible data models

In conclusion, the choice between relational and non-relational databases depends on the specific requirements of the application. Many modern applications use a combination of both to handle diverse data needs effectively

ChatGPT

1. Data Model

Relational Databases:
- Structure: Data is organized into tables (relations) with rows and columns. Each table represents an entity, and columns represent attributes of that entity.
- Schema: A predefined schema defines the structure of the tables, including data types and constraints. The schema must be established before data is inserted.
Non-Relational Databases:

Structure: Data is stored in various formats, including key-value pairs, documents, wide-columns, or graphs, depending on the type of non-relational database.
Schema: Schemas are flexible or schema-less. Data can be stored without a fixed schema, allowing for dynamic and evolving data structures.

2. Query Language

Relational Databases:
- Language: Use SQL (Structured Query Language) for querying and managing data. SQL provides a powerful and standardized way to perform operations like querying, updating, and joining data.
Non-Relational Databases:

Language: Use various query languages or APIs depending on the database type. For example, MongoDB uses BSON (Binary JSON) queries, while key-value stores may use simple key-based retrieval

3. Scalability

Relational Databases:

Scalability: Typically scale vertically by adding more resources (CPU, RAM) to a single server. Horizontal scaling (scaling out across multiple servers) can be challenging due to the rigid schema and complex joins.

Non-Relational Databases:

Scalability: Designed to scale horizontally by distributing data across multiple servers or nodes. This makes them well-suited for handling large volumes of data and high traffic loads.

4. Data Integrity and Transactions

Relational Databases:
- ACID Compliance: Generally adhere to ACID (Atomicity, Consistency, Isolation, Durability) principles to ensure data integrity and reliable transactions.
- Transactions: Support complex transactions involving multiple operations, ensuring that all operations within a transaction are completed successfully.
Non-Relational Databases:
- Eventual Consistency: Often use eventual consistency models rather than strict ACID compliance. They prioritize availability and partition tolerance, accepting that data may be temporarily inconsistent.
- Transactions: May have limited support for multi-document transactions, focusing instead on single-document operations or simple transactions.

5. Use Cases

Relational Databases:
- Use Cases: Well-suited for applications requiring complex queries, relationships between data, and strong data integrity. Examples include financial systems, CRM systems, and traditional enterprise applications.
Non-Relational Databases:
- Use Cases: Ideal for applications with rapidly changing data, high volumes of unstructured data, or requiring flexible schemas. Examples include content management systems, social media platforms, and real-time analytics.

6. Examples

Relational Databases:

Examples: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server.

Non-Relational Databases:

Examples: MongoDB (document-based), Cassandra (wide-column store), Redis (key-value store), Neo4j (graph database).

Summary

Relational Databases use a structured schema and SQL for managing data, excelling in scenarios requiring complex queries, strong data integrity, and consistent transactions. They typically scale vertically and are suited for traditional, well-defined data relationships.
Non-Relational Databases offer flexibility in schema design and scalability through horizontal distribution. They are designed for handling large volumes of unstructured or semi-structured data and are ideal for dynamic and rapidly changing applications where high availability is a priority.

Both types of databases have their unique advantages and are chosen based on specific application requirements and use cases.

NoSQL

some of the

as:

easy

http://nosql-database.org

What is NoSQL Database?

in order

SQL-like query language to be used

In short, NoSQL

particularly

is compensated

As such, NoSQL

retrieve

are categorized

, in general,

encodings

Encodings

as well as

This kind of database is designed

whose relations are well represented

kind of

The data could be stored

datatype

Database Transactions: ACID

–

is hidden

concurrentlyA

persistentOnce

(

)

transaction!the

sane

the resource is returned

Transaction_Management_and_Automatic_Versioning

ACID (Atomicity, Consistency, Isolation, Durability)

are processed

the context of

a single logical operation on the data is called

the database state is left

be obtained

were executed

a transaction has been committed

in the event

What is ACID in Database world?

Only valid data is written

Transactions committed to database will never be lost

Atomicity:

either

one of the users is required

the other user transaction is completed

If B is required

A’s transaction is completed

is completed

is updated

A’s transaction is logged

the 2 out of 3 is really 1 out of 2: It's really just A vs C!

is achieved

Paritioning

What is the CAP Theorem?

result in

and

The server is not allowed

the network will be allowed

SQL vs. Hadoop: Acid vs. Base

Big Data and NoSQL Databases Tutorial(BaSE)- Part 3

Big Data and NoSQL Databases Tutorial-(CAP theorem)- Part 5

ACID versus BASE for database transactions

the entire transaction is rolled

etc

book store

in the process of

to eventually be

Myth: Eric Brewer On Why Banks Are BASE Not ACID - Availability Is Revenue

ATMs are designed

is chosen

down

is written

down

is reconciled

an ATM is disconnected

be made

NoSQL provides an alternative to ACID called BASE.

BASE means:

Basic Availability

Soft state

Eventual consistency

https://www.wisdomjobs.com/e-university/nosql-interview-questions.html

What is Basic Availability, Soft State and Eventual Consistency (BASE)

1. ACID properties are to RDBMS as BASE properties are to NoSQL systems. BASE refers to Basic availability, Soft state and Eventual consistency. Basic availability implies continuous system availability despite network failures and tolerance to temporary inconsistency. Eventual consistency means that if no further updates are made to a given updated database item for long enough period of time , all users will see the same value for the updated item. Soft state refers to state change without input which is required for eventual consistency

https://www.igi-global.com/dictionary/database-systems-for-big-data-storage-and-retrieval/62445

In the past, when we wanted to store more data or increase our processing power, the common option was to scale vertically (get more powerful machines) or further optimize the existing code base. However, with the advances in parallel processing and distributed systems, it is more common to expand horizontally, or have more machines to do the same task in parallel

to effectively pick

partition

are sufficiently replicated

If you ever worked with any NoSQL database, you must have heard about CAP theorem

)

some

back

system

is misunderstood

is driven

be met

NoSQL DEFINITION: Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

as:

easy

nosql

sql

should be seen as

Misc

C++,Concurrency

Misc

GridFS

Misc

CouchDB

MapReduceR

Funcs

Tuple

stronger then

Bigdata

very

Misc

dynamic

sharding

Misc

Multimodel

http://nosql-database.org

What is Apache Spark?

A n00bs guide to Apache Spark

Spark Tutorial: Real Time Cluster Computing Framework

Hadoop vs Spark – Choosing the Right Big Data Software

Spark vs. Hadoop – Who Wins?

Apache Spark vs Hadoop MapReduce – Feature Wise Comparison [Infographic]

Hadoop and Spark: Friends or Foes?

Chapter 4. In-Memory Computing with Spark

park for Big Data Analytics [Part 1]

Presto is a high performance, distributed SQL query engine for big data.

https://prestosql.io/

Spark SQL is Apache Spark's module for working with structured data.

https://spark.apache.org/sql/

Differences Between to Spark SQL vs Presto

Presto in simple terms is ‘SQL Query Engine’, initially developed for Apache Hadoop. It’s an open source distributed SQL query engine designed for running interactive analytic queries against data sets of all sizes.

Spark SQL is a distributed in-memory computation engine with a SQL layer on top of structured and semi-structured data sets. Since its in-memory processing, the processing will be fast in Spark SQL

https://www.educba.com/spark-sql-vs-presto/

What is Database Mirroring?

can be used to design high-availability and high-performance solutions for database redundancy.

In database mirroring, transaction log records are sent directly from the principal database to the mirror database.

This helps to keep the mirror database up to date with the principal database, with no loss of committed data.

If the principal server fails, the mirror server automatically becomes the new principal server and recovers the principal database using a witness server under high-availability mode.

Principal database is the active live database that supports all the commands, Mirror is the hot standby and witness which allows for a quorum in case of automatic switch over

How does Database Mirroring works?

In database mirroring, the transaction log records for a database are directly transferred from one server to another, thereby maintaining a hot standby server. As the principal server writes the database’s log buffer to disk, it simultaneously sends that block of log records to the mirror instance. The mirror server continuously applies the log records to its copy of the database. Mirroring is implemented on a per-database basis, and the scope of protection that it provides is restricted to a single-user database. Database mirroring works only with databases that use the full recovery model.

What is a Principal server?

Principal server is the server which serves the databases requests to the Application.

What is a Mirror?

This is the Hot standby server which has a copy of the database.

What is a Witness Server?

This is an optional server. If the Principal server goes down then Witness server controls the fail over process.

What is Synchronous and Asynchronous mode of Database Mirroring?

In synchronous mode, committed transactions are guaranteed to be recorded on the mirror server. Should a failure occur on the primary server, no committed transactions are lost when the mirror server takes over. Using synchronous mode provides transaction safety because the operational servers are in a synchronized state, and changes sent to the mirror must be acknowledged before the primary can proceed

In asynchronous mode, committed transactions are not guaranteed to be recorded on the mirror server. In this mode, the primary server sends transaction log pages to the mirror when a transaction is committed. It does not wait for an acknowledgement from the mirror before replying to the application that the COMMIT has completed. Should a failure occur on the primary server, it is possible that some committed transactions may be lost when the mirror server takes over.

What are the benefits of that Database Mirroring?

Database mirroring architecture is more robust and efficient than Database Log Shipping.

It can be configured to replicate the changes synchronously to minimized data loss.

It has automatic server failover mechanism.

Configuration is simpler than log shipping and replication, and has built-in network encryption support (AES algorithm)

Because propagation can be done asynchronously, it requires less bandwidth than synchronous method (e.g. host-based replication, clustering) and is not limited by geographical distance with current technology.

Database mirroring supports full-text catalogs.

Does not require special hardware (such as shared storage, heart-beat connection) and cluster ware, thus potentially has lower infrastructure cost

What are the Disadvantages of Database Mirroring?

Potential data lost is possible in asynchronous operation mode.

It only works at database level and not at server level.It only propagates changes at database level, no server level objects, such as logins and fixed server role membership, can be propagated.

Automatic server failover may not be suitable for application using multiple databases.

What are the Restrictions for Database Mirroring?

A mirrored database cannot be renamed during a database mirroring session.

Only user databases can be mirrored. You cannot mirror the master, msdb, tempdb, or model databases.

Database mirroring does not support FILESTREAM. A FILESTREAM filegroup cannot be created on the principal server. Database mirroring cannot be configured for a database that contains FILESTREAM filegroups.

On a 32-bit system, database mirroring can support a maximum of about 10 databases per server instance.

Database mirroring is not supported with either cross-database transactions or distributed transactions.

What are the minimum requirements for Database Mirroring?

Database base recovery model should be full

Database name should be same on both SQL Server instances

Server should be in the same domain name

Mirror database should be initialized with principle server

What are the operating modes of Database Mirroring?

High Availability Mode

High Protection Mode

High Performance Mode

What is High Availability operating mode?

It consists of the Principal, Witness and Mirror in synchronous communication.If Principal is lost Mirror can automatically take over.If the network does not have the bandwidth, a bottleneck could form causing performance issue in the Principal.In this mode SQL server ensures that each transaction that is committed on the Principal is also committed in the Mirror prior to continuing with next transactional operation in the principal.

What is High Protection operating mode?

What is High Performance operating mode?

It consists of only the Principal and the Mirror in asynchronous communication. Since the safety is OFF, automatic failover is not possible, because of possible data loss; therefore, a witness server is not recommended to be configured for this scenario. Manual failover is not enabled for the High Performance mode. The only type of failover allowed is forced service failover, which is also a manual operation.

What are Recovery models support Database Mirroring?

Database Mirroring is supported with Full Recovery model.

What is Log Hardening?

Log hardening is the process of writing the log buffer to the transaction log on disk, a process called.

What is Role-switching?

Inter changing of roles like principal and mirror are called role switching.

https://www.dbamantra.com/sql-server-dba-interview-questions-answers-database-mirroring-1/

What is Database Mirroring?

Database mirroring is a solution for increasing the availability of a SQL Server database. Mirroring is implemented on a per-database basis and works only with databases that use the full recovery model.

Database mirroring is the creation and maintenance of redundant copies of a database. The purpose is to ensure continuous data availability and minimize or avoid downtime that might otherwise result from data corruption or loss, or from a situation when the operation of a network is partially compromised.

https://www.dbrnd.com/2016/05/sql-server-database-mirroring-interview-questions-and-answers-day-1/

LOG SHIPPING VS. MIRRORING VS. REPLICATION

Log Shipping:

It automatically sends transaction log backups from one database (Known as the primary database) to a database (Known as the Secondary database) on another server. An optional third server, known as the monitor server, records the history and status of backup and restore operations. The monitor server can raise alerts if these operations fail to occur as scheduled.

Mirroring::

Database mirroring is a primarily software solution for increasing database availability.

It maintains two copies of a single database that must reside on different server instances of SQL Server Database Engine.

Replication::

It is a set of technologies for copying and distributing data and database objects from one database to another and then synchronizing between databases to maintain consistency. Using replication, you can distribute data to different locations and to remote or mobile users over local and wide area networks, dial-up connections, wireless connections, and the Internet.

https://bhanusqldba-hub.blogspot.com/2014/09/log-shipping-vs-mirroring-vs-replication.html

Master-Slave Replication in MongoDB

the master-slave replication was used for failover, backup, and read scaling.

replaced by replica sets

Replica Set in MongoDB

A replica set consists of a group of mongod (read as Mongo D) instances that host the same data set.

In a replica set, the primary mongod receives all write operations and the secondary mongod replicates the operations from the primary and thus both have the same data set.

The primary node receives write operations from clients.

A replica set can have only one primary and therefore only one member of the replica set can receive write operations.

A replica set provides strict consistency for all read operations from the primary.

The primary logs any changes or updates to its data sets in its oplog(read as op log).

The secondaries also replicate the oplog of the primary and apply all the operations to their data sets.

When the primary becomes unavailable, the replica set nominates a secondary as the primary.

By default, clients read data from the primary.

Automatic Failover in MongoDB

When the primary node of a replica set stops communicating with other members for more than 10 seconds or fails, the replica set selects another member as the new primary.

The selection of the new primary happens through an election process and whichever secondary node gets majority of the votes becomes the primary.

you may deploy a replica set in multiple data centers or manipulate the primary election by adjusting the priority of members

replica sets support dedicated members for functions, such as reporting, disaster recovery, or backup.

In addition to the primary and secondaries, a replica set can also have an arbiter.

Unlike secondaries, arbiters do not replicate or store data.

arbiters play a crucial role in selecting a secondary to take the place of the primary when the primary becomes unavailable.

What is Sharding in MongoDB?

Sharding in MongoDB is the process of distributing data across multiple servers for storage.

MongoDB uses sharding to manage massive data growth.

With an increase in the data size, a single machine may not be able to store data or provide an acceptable read and write throughput

MongoDB sharding supports horizontal scaling and thus is capable of distributing data across multiple machines.

Sharding in MongoDB allows you to add more servers to your database to support data growth and automatically balances data and load across various servers.

MongoDB sharding provides additional write capacity by distributing the write load over a number of mongod instances.

It splits the data set and distributes them across multiple databases, or shards.

Each shard serves as an independent database, and together, shards make a single logical database.

MongoDB sharding reduces the number of operations each shard handles and as a cluster grows, each shard handles fewer operations and stores lesser data. As a result, a cluster can increase its capacity and input horizontally.

For example, to insert data into a particular record, the application needs to access only the shard that holds the record.

If a database has a 1 terabyte data set distributed amongst 4 shards, then each shard may hold only 256 Giga Byte of data.

If the database contains 40 shards, then each shard will hold only 25 Giga Byte of data.

When Can I Use Sharding in MongoDB?

consider deploying a MongoDB sharded cluster when your system shows the following characteristics:

The data set outgrows the storage capacity of a single MongoDB instance

The size of the active working set exceeds the capacity of the maximum available RAM

A single MongoDB instance is unable to manage write operations

What is a MongoDB Shard?

A shard is a replica set or a single mongod instance that holds the data subset used in a sharded cluster.

Shards hold the entire data set for a cluster.

Each shard is a replica set that provides redundancy and high availability for the data it holds.

MongoDB shards data on a per collection basis and lets you access the sharded data through mongos instances.

every database contains a “primary” shard that holds all the un-sharded collections in that database.

https://www.simplilearn.com/replication-and-sharding-mongodb-tutorial-video

adding redundancy to a database through replication

sharding, a scalable partitioning strategy used to improve performance of a database that has to handle a large number of operations from different clients

REPLICATION

Replication refers to a database setup in which several copies of the same dataset are hosted on separate machines.

The main reason to have replication is redundancy.

If a single database host machine goes down, recovery is quick since one of the other machines hosting a replica of the same database can take over

A quick fail-over to a secondary machine minimizes downtime, and keeping an active copy of the database acts as a backup to minimize loss of data

a Replica Set: a group of MongoDB server instances that are all maintaining the same data set

One of those instances will be elected as the primary, and will receive all write operations from client applications

The primary records all changes in its operation log, and the secondaries asynchronously apply those changes to their own copies of the database.

If the machine hosting the primary goes down, a new primary is elected from the remaining instances

By default, client applications read only from the primary.

However, you can take advantage of replication to increase data availability by allowing clients to read data from secondaries.

Keep in mind that since the replication of data happens asynchronously, data read from a secondary may not reflect the current state of the data on the primary.

DATABASE SHARDING

Having a large number of clients performing high-throughput operations can really test the limits of a single database instance.

Sharding is a strategy that can mitigate this by distributing the database data across multiple machines.

It is essentially a way to perform load balancing by routing operations to different database servers.

https://deadline.thinkboxsoftware.com/feature-blog/2017/5/26/advanced-database-features-replication-and-sharding

Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

available

given

Additionally, Cloud

available

The trade off is increased

Divide a data store into a set of horizontal partitions or shards. This can improve scalability when storing and accessing large volumes of data.

For example, a customer might only have one billing address, yet I might choose to put the billing address information into a separate table with a CustomerId reference so that I have the flexibility to move that information into a separate database

For example, I might shard my customer database using CustomerId as a shard key - I'd store ranges 0-10000 in one shard and 10001-20000 in a different shard. When choosing a shard key, the DBA will typically look at data-access patterns and space issues to ensure that they are distributing load and space across shards evenly.

Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infra- structure to address efficiently.

that is

Clickstreams

Traditional database systems were designed

Traditional database systems are also designed

The Big Data landscape is dominated

is primarily captured

"Big Data" caught on quickly as a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

reportand

Doug Laney

Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.

be dealt

in a timely manner

all types of

IBM is unique in having developed an enterprise class big data and analytics platform – Watson Foundations - that allows you to address the full spectrum of big data business challenges. Information management is key to that platform helping organizations discover fresh insights, operate in a timely fashion and establish trust to act with confidence.

Big Data

year

Low latency (capital markets)

parity is achieved

demonstrate

Low latency allows human-unnoticeable delays between an input being processed and the corresponding output providing real time characteristics

Commodity hardware

rather,

CPUs

q=content

Commodity hardware, in an IT context, is a device or device component that is relatively inexpensive, widely available and more or less interchangeable with other hardware of its type.

mean time

is dedicated

running

are often considered

, as such,

are replaced

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster

MapReduce is a programming model and an associated

mapreduce

Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

which will be stored

Yet Another Resource Negotiator (YARN)

YARN was designed

CSA Releases the Expanded Top Ten Big Data Security & Privacy Challenges

Security and privacy issues are magnified

The CSA’s Big Data Working Group followed a three-step process to arrive at top security and privacy challenges presented by Big Data:

1. Interviewed CSA members and surveyed security-practitioner oriented trade journals to draft an initial list of high priority security and privacy problems

2. Studied published solutions.

3. Characterized a problem as a challenge if the proposed solution does not cover the problem scenarios.

Top 10 challenges

Secure computations in distributed programming frameworks

Security best practices for non-relational data stores

Secure data storage and transactions logs

End-point input validation/filtering

Real-Time Security Monitoring

Scalable and composable privacy-preserving data mining and analytics

Cryptographically enforced data centric security

Granular access control

Granular audits

Data Provenance The Expanded Top 10 Big Data challenges has evolved from the initial list of challenges presented at CSA Congress to an expanded version that addresses three new distinct issues:

cyber

Analysis: finding tractable solutions based on the threat model

casesfor

The challenges themselves can be organized

as follows

https://cloudsecurityalliance.org/articles/csa-releases-the-expanded-top-ten-big-data-security-privacy-challenges/

Bigtable

is designed to scale

petabytes

all of

BigTable

BigTable

SSTable

other

It is not distributed

Google File System

GoogleFS

[

It is designed

is codenamed

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage

The project includes these modules:
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache:
Cassandra: A scalable multi-master database with no single points of failure.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
ZooKeeper: A high-performance coordination service for distributed applications.

http://hadoop.apache.org/

Hue is a Web interface for analyzing data with Apache Hadoop

http://gethue.com/

Two options that are backed by Amazon S3 cloud

• org.apache.hadoop.fs.s3.S3FileSystem
• http://wiki.apache.org/hadoop/AmazonS3
– org.apache.hadoop.fs.kfs.KosmosFileSystem
• Backed by CloudStore
• http://code.google.com/p/kosmosfs

Apache HBase

Apache HBase is the Hadoop database, a distributed, scalable, big data store.
Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the
Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
http://hbase.apache.org/

what is HBase?

Column-Oriented data store, known as “Hadoop Database”
Type of “NoSQL” DB
– Does not provide a SQL based access
5 – Does not adhere to Relational Model for storage
HDFS won't do well on anything under 5 nodes anyway;
particularly with a block replication of 3
• HBase is memory and CPU intensive

When to Use HBase
Two well-known use cases
– Lots and lots of data (already mentioned)
– Large amount of clients/requests (usually cause a lot of
data)
Great for single random selects and range scans by key
Great for variable schema
– Rows may drastically differ
– If your schema has many columns and most of them are null

When NOT to Use HBase
Bad for traditional RDBMs retrieval
– Transactional applications
– Relational Analytics
• 'group by', 'join', and 'where column like', etc....

Currently bad for text-based search access
– There is work being done in this arena
• HBasene: https://github.com/akkumar/hbasene/wiki
• HBASE-3529: 100% integration of HBase and Lucene
based on HBase' coprocessors
– Some projects provide solution that use HBase
• Lily=HBase+Solr http://www.lilyproject.org

HBase Access
HBase Shell
Native Java API
Avro Server
    http://avro.apache.org
HBql
    http://www.hbql.com
PyHBase
    https://github.com/hammer/pyhbase
AsyncHBase
    https://github.com/stumbleupon/asynchbase
JPA/JPO access to HBase via DataNucleous
    http://www.datanucleus.org
HBase-DSL
    https://github.com/nearinfinity/hbase-dsl
Native API is not the only option
    REST Server
        Complete client and admin APIs
        Requires a REST gateway server
        Supports many formats: text, xml, json, protocol buffers,raw binary
    Thrift
        Apache Thrift is a cross-language schema compiler
        http://thrift.apache.org
        Requires running Thrift Server

http://courses.coreservlets.com/Course-Materials/pdf/hadoop/03-HBase_1-Overview.pdf

Runtime Modes

Local (Standalone) Mode
    Comes Out-of-the-Box, easy to get started
    Uses local filesystem (not HDFS), NOT for production
    Runs HBase & Zookeeper in the same JVM

Pseudo-Distributed Mode
    Requires HDFS
    Mimics Fully-Distributed but runs on just one host
    Good for testing, debugging and prototyping
    Not for production use or performance benchmarking!
    Development mode used in class
Fully-Distributed Mode
    Run HBase on many machines
    Great for production and development clusters

http://courses.coreservlets.com/Course-Materials/pdf/hadoop/03-HBase_2-InstallationAndShell.pdf

Designing to Store Data in Tables

Two options: Tall-Narrow OR Flat-Wide
Tall Narrow is generally recommended
    Store parts of cell data in the row id
    Faster data retrieval
    HBase splits on row boundaries

Flat-Wide vs. Tall-Narrow
Tall Narrow has superior query granularity
    query by rowid will skip rows, store files!
    Flat-Wide has to query by column name which won’t skip rows or storefiles
Tall Narrow scales better
    HBase splits at row boundaries meaning it will shard at blog boundary
    Flat-Wide solution only works if all the blogs for a single user can fit into a single row
Flat-Wide provides atomic operations
    atomic on per-row-basis
    There is no built in concept of a transaction for multiple rows or tables

http://courses.coreservlets.com/Course-Materials/pdf/hadoop/03-HBase_6-KeyDesign.pdf
Hadoop vs MPP

The Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Thrift Framework

Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml.
Originally developed at Facebook, Thrift was open sourced in April 2007 and entered the Apache Incubator in May, 2008.

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages
http://thrift.apache.org/

Apache Ambari

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
http://ambari.apache.org/

Apache Ambari
A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters
http://hortonworks.com/apache/ambari/

Apache Avro

Apache Avro™ is a data serialization system.
Avro provides:
    Rich data structures.
    A compact, fast, binary data format.
    A container file, to store persistent data.
    Remote procedure call (RPC).
    Simple integration with dynamic languages
http://avro.apache.org/

Cascading

Cascading is the proven application development platform for building data applications on Hadoop.
http://www.cascading.org/

Fully-featured data processing and querying library for Clojure or Java.

http://cascalog.org/

Cloudera Impala

Open Source, Interactive SQL for Hadoop
Cloudera Impala is massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

PigPen

PigPen is an eclipse plugin that helps users create pig-scripts, test them using the example generator and then submit them to a hadoop cluster as well.
https://wiki.apache.org/pig/PigPen

Apache Crunch

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines.
Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
https://crunch.apache.org

Apache Giraph

Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in a 2010
http://giraph.apache.org/

Apache Accumulo

The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high-performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift.
https://accumulo.apache.org/

Apache Chukwa is an open source data collection system for monitoring large distributed systems. Apache Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Apache Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data

http://chukwa.apache.org/

Apache Bigtop

Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.
http://bigtop.apache.org/

Hortonworks Data Platform 2.1

100% Open Source Enterprise Apache Hadoop
http://hortonworks.com/

Why use Storm?

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
http://storm.apache.org

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

http://kafka.apache.org

Apache Spark Components – Spark Streaming

Spark Streaming: What Is It and Who’s Using It?

Performance Tuning of an Apache Kafka/Spark Streaming System - Telecom Case Study

The Tools of API Management — The Full Stack

The Latest in API Orchestration, Mediation, and Integration

OpenLogic stacks | Rogue Wave

What is Couchbase Server? The Scalability of NoSQL + The Flexibility of JSON + The Power of SQL

https://www.slideshare.net/BigDataSpain/migration-and-coexistence-between-relational-and-nosql-databases-by-manuel-hurtado

A query language for your API

GraphQL is a query language for APIs and a runtime for fulfilling those queries with your existing data. GraphQL provides a complete and understandable description of the data in your API, gives clients the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, and enables powerful developer tools.
https://graphql.org/

COUCHBASE IS THE NOSQL DATABASE FOR BUSINESS-CRITICAL APPLICATIONS

https://www.couchbase.com/

Apache Kafka GraphQL Couchbase PostgreSQL

Hadoop Distributions

Cloudera
Cloudera offers CDH, which is 100% open source, as a free download as well as a free edition of their Cloudera Manager console for administering and managing Hadoop clusters of up to 50 nodes.
The enterprise version on the other hand combines CDH and a more sophisticated Manager plus an enterprise support package

Hortonworks
The major difference from Cloudera and MapR is that HDP uses Apache Ambari for cluster management and monitoring.

MapR
The major differences to CDH and HDP is that MapR uses their proprietary file system MapR-FS instead of HDFS.
The reason for their Unix-based file system is that MapR considers HDFS as a single point of failure.

http://www.ymc.ch/en/hadoop-overview-of-top-3-distributions

Neo4j Community Edition is an open source product licensed under GPLv3.Neo4j is the world’s leading Graph Database. It is a high performance graph store with all the features expected of a mature and robust database, like a friendly query language and ACID transactions.

https://github.com/neo4j/neo4j

Apache Cassandra

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
http://cassandra.apache.org/

Redis

Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
http://redis.io/

RocksDB is an embeddable persistent key-value store for fast storage.

https://rocksdb.org/

mongoDB

MongoDB (from "humongous") is a scalable, high-performance, open source, document-oriented database.
http://www.mongodb.org/

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16 MB.

When to Use GridFS
In MongoDB, use GridFS for storing files larger than 16 MB.
In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem.

If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed.
When you want to access information from portions of large files without having to load whole files into memory, you can use GridFS to recall sections of files without reading the entire file into memory.
When you want to keep your files and metadata automatically synced and deployed across a number of systems and facilities, you can use GridFS. When using geographically distributed replica sets, MongoDB can distribute files and their metadata automatically to a number of mongod instances and facilities.
https://docs.mongodb.com/manual/core/gridfs/

Riak

Riak is an open source, distributed database
http://basho.com/riak/

CouchDB

Apache CouchDB is a database that uses JSON for documents, JavaScript for MapReduce queries, and regular HTTP for an API
https://couchdb.apache.org/

Pivotal Greenplum Database (Pivotal GPDB) manages, stores and analyzes terabytes to petabytes of data in large-scale analytic data warehouses

Pivotal GPDB extracts the data you need to determine which modern applications will best serve customers in context, at the right location, and during the right activities using a shared-nothing, massively parallel processing (MPP) architecture and flexible column- and row-oriented storage
http://www.pivotal.io/big-data/pivotal-greenplum-database

Commercial Hadoop Distributions

Amazon Web Services Elastic MapReduce
EMR is Hadoop in the cloud leveraging Amazon EC2 for compute, Amazon S3 for storage and other services.

Cloudera
Enterprise customers wanted a management and monitoring tool for Hadoop, so Cloudera built Cloudera Manager,
Enterprise customers wanted a faster SQL engine for Hadoop, so Cloudera built Impala using a massively parallel processing (MPP) architecture

IBM InfoSphere BigInsights
IBM has advanced analytics tools, a global presence and implementation services, so it can offer a complete big data solution that will be attractive to many customers

MapR
MapR Technologies has added some unique innovations to its Hadoop distribution, including support for Network File System (NFS), running arbitrary code in the cluster, performance enhancements for HBase, as well as high-availability and disaster recovery features

Pivotal Greenplum Database
In addition to the columnar Greenplum Database technology it brought from EMC, Pivotal's Hadoop distribution has an MPP Hadoop SQL engine called HAWQ that provides MPP-like SQL performance on Hado

Teradata
The Teradata distribution for Hadoop includes integration with Teradata's management tool and SQL-H, a federated SQL engine that lets customers query data from its data warehouse and Hadoop

Intel Delivers Hardware-Enhanced Performance, Security for Hadoop

Microsoft Windows Azure HDInsight
Microsoft Windows Azure HDInsight Service is designed specifically for the Windows Azure cloud. HDInsight and Hadoop for Windows (a version of Hortonworks Data Platform) comprise the only Hadoop distributions that run in a Windows environment.

http://www.cio.com/article/2368763/big-data/146238-How-the-9-Leading-Commercial-Hadoop-Distributions-Stack-Up.html

MapR

MapR is a complete distribution for Apache Hadoop that packages more than a dozen projects from the Hadoop ecosystem to provide a broad set of Big Data capabilities for the user. Projects such as Apache HBase, Hive, Pig, Mahout, Flume, Avro, Sqoop, Oozie and Whirr are included along with non-Apache projects such as Cascading and Impala.
https://www.mapr.com/why-hadoop/why-mapr

Amazon EMR

Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.
Amazon EMR uses Hadoop, an open source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances. Amazon EMR is used in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics
http://aws.amazon.com/elasticmapreduce/

InfoSphere BigInsights

IBM makes it simpler to use Hadoop to get value out of big data and build big data applications. It enhances open source technology to withstand the demands of your enterprise, adding administrative, discovery, development, provisioning, security, and support, along with best-in-class analytical capabilities
http://www-01.ibm.com/software/data/infosphere/biginsights/

oracle golden gate

Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments.
http://www.oracle.com/technetwork/middleware/goldengate/overview/index.html

The GridGain® in-memory computing platform includes an In-Memory Compute Grid which enables parallel processing of CPU or otherwise resource intensive tasks including traditional High Performance Computing (HPC) and Massively Parallel Processing (MPP). Queries are broken down into sub-queries which are sent to the relevant nodes in the compute grid of the GridGain cluster and processed on the local CPUs where the data resides.

https://www.gridgain.com/technology/in-memory-computing-platform/compute-grid

In-Memory Speed and Massive Scalability

In-memory computing solutions deliver real-time application performance and massive scalability by creating a copy of your data stored in disk-based databases in RAM. When data is stored in RAM, applications run 1,000x faster because data does not need to be retrieved from disk and moved into RAM prior to processing. Performance is further increased across the distributed in-memory computing cluster by using massive parallel processing to distribute processing across the cluster nodes, leveraging the computing power of all of the nodes in the cluster. In-memory computing platforms provide massive application scalability by allowing users to expand the size of their in-memory CPU pool by adding more nodes to the cluster. In-memory computing platforms can scale up by upgrading existing nodes with more powerful servers which have more cores, RAM and/or computing power.

https://www.gridgain.com/technology/in-memory-computing-platform

What is Spark In-memory Computing?

In in-memory computation, the data is kept in random access memory(RAM) instead of some slow disk drives and is processed in parallel. Using this we can detect a pattern, analyze large data. This has become popular because it reduces the cost of memory. So, in-memory processing is economic for applications. The two main columns of in-memory computation are-
RAM storage
Parallel distributed processing.
https://data-flair.training/blogs/spark-in-memory-computing/

Why In-Memory Computing Is Cheaper And Changes Everything

What is in-memory computing?

All forms of flash today are used like disk drives. Even though we may remove the controller as a bottleneck, the applications are still doing I/O to a flash drive or a flash board. It is getting much more reliable and cheaper, so it is going to become a persistence mechanism replacing disk

Today, the reliability of flash is longer than that of disk drives. If you replace your hardware every three to four years, and you have flash SSD and disk, you will probably not see a failure on the flash at all in that period of time, but I guarantee you that you will change disk drives.

When we talk about in-memory, we are talking about the physical database being in-memory rather than as it is “traditionally” done: on disk.

https://timoelliott.com/blog/2013/04/why-in-memory-computing-is-cheaper-and-changes-everything.html

In-Memory Database Computing – A Smarter Way of Analyzing Data

Querying data from physical disks was traditional approach, In-memory database computing (IMDBC) replaced this approach, as in IMDBC data is queried through the computer’s random access memory i.e. RAM. Upon querying data from RAM results in shorter query response time and allows analytics applications to support faster business decisions

https://www.xoriant.com/blog/big-data-analytics/memory-database-computing-faster-smarter-analysis-big-data-world.html

In-Data Computing vs In-Memory Computing vs Traditional Model(Storage based)

In-Data Computing (also known as In-Place Computing) is an abstract model in which all data is kept in an infinite and persistent memory space for both storage and computing. In-data computing is a data-centric approach, where data is computed in the same space it is stored. Instead of moving data to the code, code is moved to the data space for processing. With today’s 64-bit architecture and virtualization technology, BigObject® depicts a persistent and nearly infinite memory space for data
http://www.bigobject.io/technology.html

SAP HANA is an in-memory database: • - It is a combination of hardware and software made to process massive real time data using In-Memory computing.

• - It combines row-based, column-based database technology.
• - Data now resides in main-memory (RAM) and no longer on a hard disk.
• - It’s best suited for performing real-time analytics, and developing and deploying real-time applications.

http://saphanatutorial.com/sap-hana-database-introduction/

in-memory database

An in-memory database (IMDB); also main memory database system or MMDB or memory resident database) is a database management system that primarily relies on main memory for computer data storage.

It is contrasted with database management systems that employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.
http://en.wikipedia.org/wiki/In-memory_database

The in-memory database defined

An in-memory database is a type of nonrelational database that relies primarily on memory for data storage, in contrast to databases that store data on disk or SSDs.
In-memory databases are designed to attain minimal response time by eliminating the need to access disks.
Because all data is stored and managed exclusively in main memory, it is at risk of being lost upon a process or server failure.
In-memory databases can persist data on disks by storing each operation in a log or by taking snapshots.
In-memory databases are ideal for applications that require microsecond response times and can have large spikes in traffic coming at any time such as gaming leaderboards, session stores, and real-time analytics.

Use cases
Real-time bidding
In-memory databases are ideal choices for ingesting, processing, and analyzing real-time data with submillisecond latency.
Caching
A cache is a high-speed data storage layer which stores a subset of data, typically transient in nature, so that future requests for that data are served up faster than is possible by accessing the data’s primary storage location. Caching allows you to efficiently reuse previously retrieved or computed data. The data in a cache is generally stored in fast access hardware such as RAM (Random-access memory) and may also be used in correlation with a software component. A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.

Popular in-memory databases

Amazon Elasticache for Redis
in-memory data store that provides submillisecond latency to power internet-scale, real-time applications.
Developers can use ElastiCache for Redis as an in-memory nonrelational database.
ElastiCache for Redis also provides the ability to add and remove shards from a running cluster.
The ElastiCache for Redis cluster configuration supports up to 15 shards and enables customers to run Redis workloads with up to 6.1 TB of in-memory capacity in a single cluster.
ElastiCache for Redis also provides the ability to add and remove shards from a running cluster.

Amazon ElastiCache for Memcached
Memcached-compatible in-memory key-value store service
can be used as a cache or a data store
for use cases where frequently accessed data must be in-memory.

Aerospike
in-memory database solution for real-time operational applications
This high-performance nonrelational database can be installed as a persistent in-memory service with a RAM cluster
or greater data sizes using local SSD instances
https://aws.amazon.com/nosql/in-memory/

Difference between In-Memory cache and In-Memory Database

the differences between In-Memory cache(redis, memcached), In-Memory data grids (gemfire) and In-Memory database (VoltDB)

Cache - By definition means it is stored in memory. Any data stored in memory (RAM) for faster access is called cache. Examples: Ehcache, Memcache Typically you put an object in cache with String as Key and access the cache using the Key. It is very straight forward. It depends on the application when to access the cahce vs database and no complex processing happens in the Cache. If the cache spans multiple machines, then it is called distributed cache. For example, Netflix uses EVCAche which is built on top of Memcache to store the users movie recommendations that you see on the home screen.

In Memory Database - It has all the features of a Cache plus come processing/querying capabilities. Redis falls under this category. Redis supports multiple data structures and you can query the data in the Redis ( examples like get last 10 accessed items, get the most used item etc). It can span multiple machine and is usually very high performant and also support persistence to disk if needed. For example, Twitter uses Redis database to store the timeline information.
https://stackoverflow.com/questions/37015827/difference-between-in-memory-cache-and-in-memory-database

Redis vs. Memcached: In-Memory Data Storage Systems

Redis and Memcached are both in-memory data storage systems
Memcached is a high-performance distributed memory cache service
Redis is an open-source key-value store
Similar to Memcached, Redis stores most of the data in the memory

In Memcached, you usually need to copy the data to the client end for similar changes and then set the data back.
The result is that this greatly increases network IO counts and data sizes.
In Redis, these complicated operations are as efficient as the general GET/SET operations.Therefore, if you need the cache to support more complicated structures and operations, Redis is a good choice.

Different memory management scheme
In Redis, not all data storage occurs in memory. This is a major difference between Redis and Memcached. When the physical memory is full, Redis may swap values not used for a long time to the disks

Difference in Cluster Management
Memcached is a full-memory data buffering system. Although Redis supports data persistence, the full-memory is the essence of its high performance. For a memory-based store, the size of the memory of the physical machine is the maximum data storing capacity of the system. If the data size you want to handle surpasses the physical memory size of a single machine, you need to build distributed clusters to expand the storage capacity.
Compared with Memcached which can only achieve distributed storage on the client side, Redis prefers to build distributed storage on the server side.

https://medium.com/@Alibaba_Cloud/redis-vs-memcached-in-memory-data-storage-systems-3395279b0941

Amazon ElastiCache offers fully managed Redis and Memcached. Seamlessly deploy, run, and scale popular open source compatible in-memory data stores.

Build data-intensive apps or improve the performance of your existing apps by retrieving data from high throughput and low latency in-memory data stores
https://aws.amazon.com/elasticache/

Distributed cache

In computing, a distributed cache is an extension of the traditional concept of cache used in a single locale.
A distributed cache may span multiple servers so that it can grow in size and in transactional capacity
It is mainly used to store application data residing in database and web session data
The idea of distributed caching has become feasible now because main memory has become very cheap and network cards have become very fast, with 1Gbit now standard everywhere and 10Gbit gaining traction.
Also, a distributed cache works well on lower cost machines usually employed for web servers as opposed to database servers which require expensive hardware.
https://en.wikipedia.org/wiki/Distributed_cache

Infinispan

Infinispan is a distributed in-memory key/value data store with optional schema
Available as an embedded Java library or as a language-independent service accessed remotely over a variety of protocols (Hot Rod, REST, Memcached and WebSockets)
Use it as a cache or a data grid.
Advanced functionality such as transactions, events, querying, distributed processing, off-heap and geographical failover
http://infinispan.org/

In-Memory Distributed Caching

The primary use case for caching is to keep frequently accessed data in process memory to avoid constantly fetching this data from disk, which leads to the High Availability (HA) of that data to the application running in that process space (hence, “in-memory” caching)
Most of the caches were built as distributed in-memory key/value stores that supported a simple set of ‘put’ and ‘get’ operations and optionally some sort of read-through and write-through behavior for writing and reading values to and from underlying disk-based storage such as an RDBMS. Depending on the product, additional features like ACID transactions, eviction policies, replication vs. partitioning, active backups, etc. also became available as the products matured.

In-Memory Data Grid
The feature of data grids that distinguishes them from distributed caches the most was their ability to support co-location of computations with data in a distributed context and consequently provided the ability to move computation to data.This capability was the key innovation that addressed the demands of rapidly growing data sets that made moving data to the application layer increasing impractical. Most of the data grids provide some basic capabilities to move the computations to the data.Another uniquely new characteristic of in-memory data grids is the addition of distributed MPP processing based on standard SQL and/or MapReduce, that allows to effectively compute over data stored in-memory across the cluster.Just as distributed caches were developed in response to a growing need for data HA, in-memory data grids were developed to respond to the growing complexities of data processing.Adding distributed SQL and/or MapReduce type processing required a complete re-thinking of distributed caches, as focus has shifted from pure data management to hybrid data and compute management.This new and very disruptive capability of in-memory data grids also marked the start of the in-memory computing revolution.
https://www.gridgain.com/resources/blog/cache-data-grid-database

From distributed caches to in-memory data grids

What do high-load applications need from cache?
Low latency
Linear horizontal scalability
Distributed cache

Cache access patterns: Client Cache Aside For reading data:
Cache access patterns: Client Read Through For reading data
Cache access patterns: Client Write Through For writing data
Cache access patterns: Client Write Behind For writing data
https://www.slideshare.net/MaxAlexejev/from-distributed-caches-to-inmemory-data-grids

the difference between two major categories in in-memory computing: In-Memory Database and In-Memory Data Grid

In-Memory Data Grids (IMDGs) are sometimes (but not very frequently) called In-Memory NoSQL/NewSQL Databases
there are also In-Memory Compute Grids and In-Memory Computing Platforms that include or augment many of the features of In-Memory Data Grids and In-Memory Databases.

Tiered Storage
what we mean by “In-Memory”. Surprisingly - there’s a lot of confusion here as well as some vendors refer to SSDs, Flash-on-PCI, Memory Channel Storage, and, of course, DRAM as “In-Memory”

https://www.gridgain.com/resources/blog/in-memory-database-vs-in-memory-data-grid-revisited

Cache < Data Grid < Database

In-Memory Distributed Cache
In-Memory Data Grid
In-Memory Database

In-Memory Distributed Caching
The primary use case for caching is to keep frequently accessed data in process memory to avoid constantly fetching this data from disk, which leads to the High Availability (HA) of that data to the application running in that process space (hence, “in-memory” caching).

In-Memory Data Grid
The feature of data grids that distinguishes them from distributed caches was their ability to support co-location of computations with data in a distributed context and consequently provided the ability to move computation to data.

In-Memory Database
The feature that distinguishes in-memory databases over data grids is the addition of distributed MPP processing based on standard SQL and/or MapReduce, that allows to compute over data stored in-memory across the cluster.

https://dzone.com/articles/cache-data-grid-database

What’s the Difference Between an In-Memory Data Grid and In-Memory Database?

The differences between IMDGs and IMDBs are more technical in nature, but both offer ways to accelerate development and deployment of data-driven applications.

What’s an In-Memory Data Grid (IMDG)?
IMDGs are designed to provide high availability and scalability by distributing data across multiple machines. The rise of cloud, social media, and the Internet of Things (IoT) has created demand for applications that need to be extremely fast and capable of processing millions of transactions per second.

So, What’s an In-Memory Database (IMDB)?
An IMDB is a database management system that primarily relies on main memory for computer data storage. It contrasts with database management systems that employ a disk-storage mechanism. An IMDB eliminates the latency and overhead of hard-disk storage and reduces the instruction set that’s required to access data.

https://www.electronicdesign.com/embedded-revolution/what-s-difference-between-memory-data-grid-and-memory-database

How To Choose An In-Memory NoSQL Solution: Performance Measuring

The main purpose of this work is to show results of benchmarking some of the leading in-memory NoSQL databases with a tool named YCSB
We selected three popular in-memory database management systems:
Redis (standalone and in-cloud named Azure Redis Cache), Tarantool and CouchBase and one cache system Memcached.
Memcached is not a database management system and does not have persistence. But we decided to take it, because it is also widely used as a fast storage system

In-memory NoSQL database management system is a database management system that stores all the data in the main memory and persists each update on disk. Persistency is provided by saving each data modification request in a binary log. Since the log is written in append-only mode, it is rarely a bottleneck. Both read and write workloads are processed without significant disk head movement

Memcached is a general-purpose distributed memory caching system
Tarantool is an open-source NoSQL database management system and Lua application server developed in my.com.
Couchbase Server, originally known as Membase, is an open-source, distributed NoSQL document-oriented database.

Yahoo! Cloud Serving Benchmark
Yahoo! Cloud Serving Benchmark, or YCSB is a powerful utility for performance measuring of a wide range of NoSQL databases including in-memory and on-disk solutions

http://highscalability.com/blog/2015/12/30/how-to-choose-an-in-memory-nosql-solution-performance-measur.html

Potential of in memory NO SQL database

What is No SQL?
A No-SQL (often interpreted as Not Only SQL) database provides a mechanism for storage and retrieval of data that is modelled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability.

What is In Memory Database?
An in-memory database (IMDB; also main memory database system or MMDB or memory resident database) is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.

Why In Memory NoSQL ?
Application developers have been frustrated with the impedance mismatch between the relational data structures and the in-memory data structures of the application. Using NoSQL databases allows developers to develop without having to convert in-memory structures to relational structures.

The NoSQL realm already has in-memory DBMS options such as Aerospike
In-memory DBMS vendors MemSQL and VoltDB are taking the trend in the other direction, recently adding flash- and disk-based storage options to products that previously did all their processing entirely in memory. The goal here is to add capacity for historical data for long-term analysis.

https://www.rebaca.com/potentiality-of-in-memory-no-sql-database/

Ehcache

Ehcache is an open source, standards-based cache for boosting performance, offloading your database, and simplifying scalability. It's the most widely-used Java-based cache because it's robust, proven, and full-featured
http://ehcache.org/

Memcached

Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
http://memcached.org/

OSCache

OSCache is a widely used, high performance J2EE caching framework. In addition to it's servlet-specific features, OSCache can be used as a generic caching solution for any Java application. A few of its generic features include:

* Caching of Arbitrary Objects - You are not restricted to caching portions of JSP pages or HTTP requests. Any Java object can be cached.
* Comprehensive API - The OSCache API gives you full programmatic control over all of OSCache's features.
* Persistent Caching - The cache can optionally be disk-based, thereby allowing expensive-to-create data to remain cached even across application restarts.
* Clustering - Support for clustering of cached data can be enabled with a single configuration parameter. No code changes required.
* Expiry of Cache Entries - You have a huge amount of control over how cached objects expire, including pluggable RefreshPolicies if the default functionality does not meet your requirements.
http://www.opensymphony.com/oscache/

ShiftOne

ShiftOne Java Object Cache is a Java library that implements several strict object caching policies, decorators that add behavior, and a light framework for configuring them for an application.
http://jocache.sourceforge.net/

SwarmCache

SwarmCache is a simple but effective distributed cache. It uses IP multicast to efficiently communicate with any number of hosts on a LAN. It is specifically designed for use by clustered, database-driven web applications
http://swarmcache.sourceforge.net/

Whirlycache

Whirlycache is a fast, configurable in-memory object cache for Java.
It can be used, for example, to speed up a website or an application by caching objects that would otherwise have to be created by querying a database or by another expensive procedure
http://java.net/projects/whirlycache

cache4j

cache4j is a cache for Java objects with a simple API and fast implementation.
It features in-memory caching, a design for a multi-threaded environment, both synchronized and blocking implementations, a choice of eviction algorithms (LFU, LRU, FIFO), and the choice of either hard or soft references for object storage.
http://cache4j.sourceforge.net/

JCS is a distributed caching system written in java. It is intended to speed up applications by providing a means to manage cached data of various dynamic natures. Like any caching system, JCS is most useful for high read, low put applications. Latency times drop sharply and bottlenecks move away from the database in an effectively cached system.
https://commons.apache.org/proper/commons-jcs//

Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.
http://lucene.apache.org/solr/

Tapestry

Component oriented framework for creating dynamic, robust, highly scalable web applications in Java.
http://tapestry.apache.org/

Apache Lucene

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java
http://lucene.apache.org/java/docs/index.html

JiBX

JiBX is a tool for binding XML data to Java objects. It's extremely flexible, allowing you to start from existing Java code and generate an XML schema, start from an XML schema and generate Java code, or bridge your existing code to a schema that represents the same data. It also provides very high performance, outperforming all other Java data binding tools across a wide variety of tests.
http://jibx.sourceforge.net/

Hazelcast

Open source clustering and highly scalable data distribution platform for Java
http://www.hazelcast.com

fakecineaste

Monday, April 14, 2014

Big Data

Relational Databases

Structure: Data is organized into tables with rows and columns.

Relationships: Tables are linked through relationships (primary and foreign keys) to establish connections between data.

Data Integrity: Strong data integrity through ACID properties (Atomicity, Consistency, Isolation, Durability).

Query Language: SQL (Structured Query Language) is used for data manipulation and retrieval.

Examples: MySQL, PostgreSQL, Oracle, SQL Serve

Non-Relational Databases (NoSQL)

Structure: Data is stored in various models like key-value, document, columnar, or graph.

Flexibility: Highly flexible schema, accommodating diverse data types.

Scalability: Designed for handling large volumes of data and high traffic.

Performance: Often optimized for specific data access patterns.

Examples: MongoDB, Cassandra, Redis, Neo4

When to Use Which

Relational Databases:

Complex relationships between data

Strong data consistency requirements

ACID compliance is essential

Predictable data access patterns

Non-Relational Databases:

Large volumes of unstructured or semi-structured data

High write throughput and scalability

Rapid development needs

Flexible data models

In conclusion, the choice between relational and non-relational databases depends on the specific requirements of the application. Many modern applications use a combination of both to handle diverse data needs effectively

1. Data Model

2. Query Language

3. Scalability

4. Data Integrity and Transactions

5. Use Cases

6. Examples

Summary

SQL vs. Hadoop: Acid vs. Base

Big Data and NoSQL Databases Tutorial(BaSE)- Part 3

Big Data and NoSQL Databases Tutorial-(CAP theorem)- Part 5

2 comments:

Labels

Blog Archive

Followers

Search This Blog