RevoU Python Assignment

Project Summary

Imported and joined dataset from the source files.
Conducted data cleaning such as removing null values, removing outliers, removing irrelevant values, and feature engineering.
Conducted exploratory data analysis on the dataset and convey the important findings
Created user segmentation using k-means clustering

Insights

As time goes by the number of orders is increasing until it reaches the highest number of orders in November 2017 of 6,581 orders. Then the number of orders decreased and increased until the last in August 2018 as many as 6,313 orders.
It can be concluded that the peak day is Tuesday, where the total number of orders reached 17660 orders.
The most obtained were credit cards with a percentage of 73.21%, followed by boleto with a percentage of 19.33%, vouchers at 5.93%, and finally debit cards at 1.53%.
Obtained 4 clusters based on elbow method and silhouette analysis.

Project Files

For a more comprehensive analysis and visualization, please open the project files.

Project Background

Python is the most challenging yet also exciting data programming languages. In this assignment we practiced python skills such as data cleaning and exploratory data analysis using python code. Also, we practiced more advanced python skills such as user segmentation using cluster analysis. We used Google Collab as a python notebook tools.

Data Scope, Goals & Objectives

For this assignment we used data from kaggle and provided by olist, brazilian e-commerce company Customers Dataset. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

Goals

Gustavo just started an e-commerce startup based in Brazil that recently opened an online website to sell their product. Fortunately, Gustavo is launcing their website when the covid-19 hits and making them grow faster than ever. However, Gustavo is still not using targeted marketing which hurts their marketing budget as only a fraction of their user comes back to their website. Gustavo needs your help to increase their marketing conversion rate by doing more targeted marketing using customer segmentation so that it will not hurt their budget.

Objectives

Conducting data cleaning using various methods so the result would be more accurate.
Exploratory data analysis of the dataset to help gustavo find the problem within his e-commerce company and to help gustavo make important decision.
Creating user segmentation using cluster analysis to help targeted marketing.

Data Analysis

Note : only important steps shown to simplify the analysis explanation.

Data Preparation & Cleaning

Data Prep

Python environment preparation by loading the necessary library.

Import Dataset

Imported dataset from google sheets using csv interpreter.

Handling Data

Removed irrelevant features from the dataset.

Removed duplicate values from the dataset.

Feature Format

Changed the feature format to datetime, this is necessary to analyze the data further.

Handling Outliers

Removed the outliers from the data to make analysis results more statistically significant.

Combining Data

Combined data using join function.

Exploratory Data Analysis

Evaluative descriptive statistics

Numerical feature desciption using describe function in python.

Actual Delivery Time

The minimum actual delivery time value is -7 and the maximum is 28 which results in an actual delivery time range of 35. This value is quite large when compared to the average actual delivery time of 10.09. The mean > median > mode indicates a positive skew. The standard deviation of 6.06 Kurtosis 0.12 indicates the data is fairly distributed with the standard.

Payment Sequential

The minimum payment sequential value is 1 and the maximum is 29 which results in a sequential payment range of 28. This value is quite large when compared to the average sequential payment of 1.10. The mean > median > mode indicates a positive skew. The standard deviation of 0.74 Kurtosis 338.85 indicates peaked data compared to the normal distribution.

Payment Installments

The minimum payment installments value is 0 and the maximum is 24 which results in a payment installments range of 24. This value is quite large when compared to the average payment installments of 2.61. The mean > median > mode indicates a positive skew. The standard deviation of 2.44 Kurtosis 3.10 indicates peaked data compared to the normal distribution.

Payment Value

The minimum payment value is 0 and the maximum is 344.34 which results in a payment value time range of 344.34. This value is quite large when compared to the average payment value of 109.37. The mean > median > mode indicates a positive skew. The standard deviation of 72.53 Kurtosis 0.52 indicates that the data is fairly distributed with a standard.

Categorical feature desciption using describe function in python.

Number of orders per month

The graph shows a comparison between the number of orders vs the current month from October 2016 to August 2018. Over time the number of orders has increased until it reaches the highest number of orders in November 2017 of 6,581 orders. Then the number of orders decreased and increased until the last in August 2018 as many as 6,313 orders.

Daily orders trend

The graph shows the comparison between the number of orders vs the day on which the order was made. This chart is made to see Daily Trends/peak days where buyers place orders. It can be concluded that the peak day is Tuesday, where the total number of orders reached 17660 orders.

Percentage of orders per payment type

The graph shows the comparison between the number of orders vs. payment type. This graph is made to see what payment types are most often used by buyers. The most obtained were credit cards with a percentage of 73.21%, followed by boleto with a percentage of 19.33%, vouchers at 5.93%, and finally debit cards at 1.53%.

User Segmentation

Preparing the data for cluster analysis

How many clusters? We used Elbow Method and Silhouette Analysis

The elbow method show 4 clusters is the optimal cluster number.

The silhouette analysis also show 4 clusters is the optimal cluster number.

Creating cluster using K-Means

The K-Means clustering used because the data have more numerical features than categorical features.

The distribution of data on each cluster quite good (no cluster with small count).
Cluster 0: prefer to pay with credit card with relatively normal amounts of payment value >> can be named: "Middle Class"
Cluster 1: prefer to pay with credit card (predominantly) with relatively huge amounts of payment value >> can be named: "The Real Sultans"
Cluster 2: prefer to pay with boleto (cash) with relatively normal amounts of payment >> can be named: "Boleto Lovers"
Cluster 3: prefer to pay with voucher & debit card (cash) with relatively small amounts of payment >> can be named: "Frugal Living"

Recommendation

The market predominantly populated with middle class, using discount or promo would be helpful to boost sales.
The real sultans also have significant amount of population, and these people tend to be more loyal and come back to buy more. Increasing benefit or creating customer loyalty program campaign would be helpful to retain these customers since they are can be quite a money maker.

Home Projects