DE Zoomcamp 2024 – Project1

This repository contains a brief description of my DE Zoomcamp 2024 Project 1

Problem statement

The Retailrocket has collected a large dataset of E-commerce i.e a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content transformations, however, all values are hashed due to confidential issues. The purpose of publishing is to motivate researches in the field of recommender systems with implicit feedback. The goal of this project is to create a streamlined and efficient process for ingesting and analyzing e-commerce on Cloud by implementing Data Engineering concepts.

About the Dataset

Retailrocket recommender system

The dataset consists of three context files i.e. :

a file with behaviour data (events.csv)
a file, which describes category tree (category_tree.сsv).
a file with item properties (item_properties_part1.сsv & item_properties_part2.csv)

The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content transformations, however, all values are hashed due to confidential issues.

The behaviour data, i.e. events like clicks, add to carts, transactions, represent interactions that were collected over a period of 4.5 months. A visitor can make three types of events, namely view, addtocart or transaction.

Technologies / Tools

Containerisation : Docker
Cloud : GCP
Infrastructure as code (IaC) : Terraform
Workflow orchestration : Mage-ai
Data Warehouse : BigQuery
Batch processing : pyspark SQL
IDE : VS Code, Jupyter Notebook
Language : Python
Visualisation : Google Looker Studio

Project Architecture

The end-to-end data pipeline includes the below steps:

Kaggle dataset is downloaded into the Google VM.
The downloaded CSV files (raw) are then uploaded to a folder in Google Cloud bucket (parquet) as Data Like.
Next, the data will be stored in BigQuery with format and values same as the GCP bucket files.
Last new tables are created from those original tables by using Spark SQL with correct data types as well as partitioned by Month and Clustered for optimised performance. These tables would be Data Warehouse tables.
Spin up a dataproc clusters (master and worker) and execute the pyspark jobs for procusts analys purposes
Configure Google Looker Studio to power dashboards from BigQuery Data Warehouse tables

You can find the detailed Architecture on the diagram below:

Reproducing from Scratch

Setup GCP

Create GCP Account.
Setup New Project and write down your Project ID.
Configure Service Account to get access to the project and download auth-keys (.json). Change auth-keys name if required.
Please provide the service account the permissions below (sorted by name):

BigQuery Admin
Cloud SQL Client
Compute Admin
Compute Engine Service Agent
Compute Network Admin
Compute Storage Admin
Dataproc Service Agent
Editor
Logs Bucket Writer
Owner
Storage Admin
Storage Object Admin

Enable the following options under the APIs and services section:

Identity and Access Management (IAM) API
IAM service account credentials API
Cloud Dataproc API
Compute Engine API (if you are going to use VM instance)

Terraform as Internet as Code (IaC) to build infrastructure

Download Terraform from here: https://www.terraform.io/downloads
Under terraform folder, create files main.tf (required) and variables.tf (optional) to store terraform variables.
main.td containt the following resources want to be applied:

Google Provider Versions
resource "google_service_account"
resource "google_project_iam_member"
resource "google_compute_firewall"
resource "google_storage_bucket"
resource "google_storage_bucket_iam_member"
resource "google_bigquery_dataset"
resource "google_dataproc_cluster" (cluster_config : master_config, worker_config, software_config : image_version = "2.2.10-debian12"
    optional_components   = ["DOCKER", "JUPYTER"])

terraform init or terraform init -upgrade: command initializes the directory, downloads, teh necesary plugins for the cinfigured provider, and prepares for use.
terraform plan : too see execution plan
terraform apply : to apply the changes

If you would like to remove your stack from the Cloud, use the terraform destroy command.

Reproducibility

Assign External IP Address After terraform apply complete succeesfully, assign External IP Address for Master and Workers instances using Console. From VM Instance (Compute Engine) –> SSH
Setting up Mage-ai, PostgreSQL and pgAdmin through the Master VM Instance SSH. Copy repsistories.sh into VM. repsistories.sh is script for installing docker network and bring up docker containers of Mage-ai, postgresql and pgAdmin.

#############Install Docker network#############
#create a network most containers will use
sudo docker network create dockernet >> /root/dockernet.log
sudo docker network ls >> /root/dockernet.log
 
#############Bring up docker containers############
cat > /root/docker-compose.yml <<- "SCRIPT"
 
version: '3'
services:
  magic:
    image: mageai/mageai:latest
    command: mage start dezoomcamp
    container_name: dezoomcamp-mage
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      USER_CODE_PATH: /home/src/dezoomcamp
      POSTGRES_DBNAME: dezoomcampdb
      POSTGRES_SCHEMA: public
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres316
      POSTGRES_HOST: vm-ikg-dezoomcamp
      POSTGRES_PORT: 5432
    ports:
      - 6789:6789
    volumes:
      - .:/home/src/
      - /root/.google/credentials/key-ikg-dezoomcamp-2024.json
    restart: on-failure:5
  postgres:
    image: postgres:14
    restart: on-failure
    container_name: dezoomcamp-postgres
    environment:
      POSTGRES_DB: dezoomcampdb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres316
    ports:
      - 5432:5432
  pgadmin:
    image: dpage/pgadmin4
    container_name: dezoomcamp-pgadmin
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - 8080:80
 
SCRIPT
 
sudo docker compose -f /root/docker-compose.yml up -d

chmod +x repositories.sh

sudo ./repositories.sh

==> Mage-ai, postgresql and pgAdmin would be installed and up running.

Check mage :

Check pgadmin :

Restart Juypyer Notebook

Stop :
```
sudo systemctl stop jupyter
```
 
Start by using port 8888 :
```
jupyter-notebook  --port=8888 --ip=0.0.0.0 --no-browser
```
Note: we use 0.0.0.0 just for demo purpose. Don't use this in production!
 
![image](https://github.com/garjita63/retailrocket-ecommerce-batch/assets/77673886/e78fc04d-9055-4aeb-ac27-5b877a99e1ec)

Increase memory size for cluster if required, and then restart Jupyter Notebook

1	`jupyter notebook --generate-config`

Open /home//.jupyter/jupyter_notebook_config.py

Edit file and modify parameter: c.NotebookApp.max_buffer_size

Spark master and worker clusters Edit ~/.bashrc file and add lines below: export SPATH=$SPARK_HOME/bin:$SPARK/sbin:$PATH source ~/.bashrc which start-all.sh Start master and worker clusters start-all.sh Try run spark by using dataset on hdfs Copy dataset folder into /user/ hdfs dfs -mkdir /user/<some_folder> hdfs dfs -copyFromLocal ecommerce-dataset/ /user/s<some_folder> Login to Master Cluster web Login to Worker Cluster web

Mage-ai orchestration pipelines

Create two runtime variables: bucket_name and dataset. These variables used by all pipeline blocks.

All pipeline and its blocks available in mage-project1.tar files.

Put mage-project1.tar into VM. You must have file copy authority to the master VM.

gcloud auth login
gcloud config set project <project_name>
--> allow to access of your google account

gcloud compute scp mage-project1.tar <username>@<project_name>:/home/<username>

Open SSH on Master instance
Copy mage-project1.tar into mage container (in this project named: dezoomcamp-mage)

docker cp mager-project1.tar <mage_container>:/home/src/<some_folder>   # in this project is "dezoomcamp'

Go to mage container

1	`docker exec -it <mage_container> bash`

Extract (untar) mage-project.tar file

cd /home/src/<some_folder>
tar -xvf mage-project1.tar

Open Mage application website : http://:6789

BigQuery Tables

events preview

events_dwh preview

item_properties preview

item_properties_dwh preview

Dashboard

The 10 Sold Items

Percentage of Events

Search This Blog

gar-oracledba

Data Engineering Zoomcamp 2024 – Project 1