Member-only story

Connecting Apache Spark to different Relational Databases(Locally and AWS) using PySpark.

Rajan Sahu

--

My article is for everyone! Non-members can click on this link and jump straight into the full text!!

In this post, we will learn how to connect a Spark Application to a locally installed relational database, as well as AWS RDS.

Table of Contents

  • Apache Spark
  • Pyspark
  • Connect PySpark with a locally installed MySQL RDB
  • Connect PySpark with an AWS MySQL RDS
  • Connect PySpark with a locally installed Postgres RDB
  • Connect PySpark with Postgres AWS RDS
  • Connect PySpark with a locally installed Oracle RDB
  • Connect PySpark with Oracle AWS RDS
  • References

Apache Spark

  • Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple nodes or computers.

Pyspark

  • PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster.
  • It is written in Python to run Apache Spark jobs using Python.
  • It is a medium to tell spark to do the task; other than nothing it does.

Requirements

python
pyspark
java jdk

Connecting PySpark application with a locally installed MySQL database

  1. Before writing your scripts you need to put the MySQL jar file inside jars directory
/home/<username>/.local/lib/python3.8/site-packages/pyspark/jars/

mysql-connector-java-8.0.22.jar

Download the jar file from the below link

--

--

Responses (1)

Write a response