Home

Library

Stories

Stats

Member-only story

Connecting Apache Spark to different Relational Databases(Locally and AWS) using PySpark.

3 min readMar 14, 2022

--

My article is for everyone! Non-members can click on this link and jump straight into the full text!!

In this post, we will learn how to connect a Spark Application to a locally installed relational database, as well as AWS RDS.

Table of Contents

Apache Spark
Pyspark
Connect PySpark with a locally installed MySQL RDB
Connect PySpark with an AWS MySQL RDS
Connect PySpark with a locally installed Postgres RDB
Connect PySpark with Postgres AWS RDS
Connect PySpark with a locally installed Oracle RDB
Connect PySpark with Oracle AWS RDS
References

Apache Spark

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple nodes or computers.

Pyspark

PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster.
It is written in Python to run Apache Spark jobs using Python.
It is a medium to tell spark to do the task; other than nothing it does.

Requirements

python
pyspark
java jdk

Connecting PySpark application with a locally installed MySQL database

Before writing your scripts you need to put the MySQL jar file inside jars directory

/home/<username>/.local/lib/python3.8/site-packages/pyspark/jars/

mysql-connector-java-8.0.22.jar

Download the jar file from the below link

Download mysql-connector-java JAR file with all dependencies

JDBC Type 4 driver for MySQL Artifact mysql-connector-java Group org.wisdom-framework Version 5.1.34_1 Last update 13…

jar-download.com

Written by Rajan Sahu

Backend and Data Engineer by Day; Teacher, Friend and Content-Writer by night.

Responses (1)

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech