Apache Spark Ecosystem
[ uh–pach-ee spahrk] n. The Apache Spark ecosystem is an open-source distributed cluster-computing framework. Spark is a data processing engine developed to provide faster and easier analytics than Hadoop MapReduce.
Background: Apache Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. Many of the ideas behind the system were presented in various research papers over the years.
After being released, Spark grew into a broad developer community, and moved to the Apache Software Foundation in 2013. Today, the project is developed collaboratively by a community of hundreds of developers from hundreds of organizations.
Since its release, Apache Spark, the unified analytics engine, has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.
The team that started the Spark research project at UC Berkeley founded Databricks in 2013.
Apache Spark is 100% open source, hosted at the vendor-independent Apache Software Foundation. Databricks is fully committed to maintaining this open development model. Together with the Spark community, Databricks continues to contribute heavily to the Apache Spark project, through both development and community evangelism.
Concepts and terminologies :
RDD: Resilient Distributed Datasets is Spark’s core abstraction as a distributed collection of objects. It is an immutable dataset which cannot change with time. This data can be stored in memory or disk across the cluster. The data is logically partitioned over the cluster. It offers in-parallel operation across the cluster. As RDDs cannot be changed it can be transformed using several operations. Furthermore, RDDs are fault tolerant in nature. If any failure occurs it can rebuild lost data automatically through lineage graph.
Dataframe: It is an immutable distributed data collection, like RDD. We can organize data into names, columns, tables etc. in the database. This design makes large datasets processing even easier. It allows developers to impose distributed collection into a structure and high-level abstraction.
Dataset: To express transformation on domain objects, Datasets provides an API to users. It also enhances the performance and advantages of robust Spark SQL execution engine.
SparkSQL: It is a spark module which works with structured data. Also, supports workloads, even combine SQL queries with the complicated algorithm based analytics.
Spark Streaming: uses Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture.
PySpark: is the collaboration of Apache Spark and Python. The Spark Python API (PySpark) exposes the Spark programming model to Python.
Pandas: is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.