KMART — Data Analysis for Sales Strategy Using Python
Background
Kmart is a leading online retailer in the US and as part of their annual sales review meeting, they need to decide on their sales strategy for the year 2020 based on the insights from the sales data in 2019.
You are given month wise sales data for 2019 and the task is to generate key insights which will help the sales team of Kmart to take some key business decisions towards finetuning their sales strategy.
PROBLEMS STATEMENTS
1. What was the best month for sales? How much was earned that month?
2. What city sold the most product?
3. What time should we display advertisements to maximize the likelihood of customer’s buying products?
4. What Products are most often sold together?
5. What product sold the most? Why do you think it did?
Pre-modelling methods:
We can see that there are 12 csv files having month wise sales data for electronic items. Each csv file has order details consisting of order ID, Product, Quantity, Price and Date along with purchase address.
To analyse the data from these csv files, let us first import them into each dataframe using pandas library.
After loading csv data into 12 dataframes, we will add these 12 dataframes into a list “sales”
Add this list (“sales”) into a single dataframe using pd.concat()
We will try to understand what are the datatypes of each column in this dataset.
We can see all columns have data type as object only and there are a total of 1,86,850 records in this dataframe.
We will now add a “Month” column to this dataset by extracting the month details from the Order Date column. First 2 characters of the Order data column refers to month.
As we can see the error says there are few rows which have NaN values in the Order Date column.
We realise that there are 545 rows having NaN values in this dataframe. Let us drop all such rows from our dataframe.
We can see that the number of rows has reduced from 1,86,850 to 1,86,305 i.e. by 545 count.
Let us again try to add a Month column to this dataset.
It seems there are few non-integer values in the Order date column, let us check what exactly are these non-integer values
Let us remove these strings from the dataset.
After removing the string characters, we can see that the Month column has only numbers from 1 to 12 each corresponding to respective month.
Now let us add a “Sales” column to this dataframe by multiplying Quantity ordered and Price of each item.
This error shows that the data in Quantity and Price columns is not numeric.
We will convert both these columns to numeric data type and add a sales column to this dataframe.
Now our dataframe is ready for further analysis.
Q1 .What was the best month for sales ? How much was earned that month?
Now that we have added sales column and month column to our dataframe, we can analyse which month had best sales.
We will first create a new dataset which will have all the records grouped by Month and each group will have a sum of all the sales made in that month.
We can clearly see that month 12 (December) is the highest sales in 2019 with approximately $4,810,000.
Q2 . Which city had the highest number of sales ?
To answer this question, obviously we need to create a new column called “City” column. How do we get that? As usual, we will check the top 5 data in our dataframe to figure out where we can get our “City” column using .head() method.
The “Purchase Address” column contains the city information. We can’t get it directly, we need to extract the data. We will extract this data into a separate dataframe.
We can see now the City column has a city name which is extracted from the Purchase Address column.
We will group this dataframe by City and each group will have a sum of all the sales in that city.
We can conclude from the graph that San Francisco has the highest number of sales.
Q3: Recommend the most appropriate time to display advertising to maximize the likelihood of customers buying the product/s?
To answer this question, we need to understand at what time of the day KMART had the maximum Sales. Let us look at the data to take the necessary action.
As per the above data, the Order Date Column has the date time stamp. We need to extract the Hours from the Order Date. Given below is the code used to extract the Hours.
Let us look at the head of the data frame.
Next, we have grouped the data by Hours and total sales. Then, sorting the data in descending order gives the hours where the sales were high.
The sales were maximum at 7 pm and 12 pm. Let, try an alternate way to validate this visually by plotting the graph of total Quantity Ordered with respect to each hour. Here are the code snippet and the output:
We can clearly see a peak at 12 pm as well as 7 pm.
Thus, we can say that the most appropriate time to display advertising to maximize the likelihood of customers buying the product/s is right before 12 pm and/or before 7 pm.
Maybe, 11:30 am and 6:30 pm could be the right time to display the advertisement.
Q4. What products are most often sold together?
By having a look at the data, we observed that to answer this question, we need to group the product by the Order ID to know which products were sold together. Let’s find the duplicate values of the “Order ID” by using the .duplicated() method.
Next, we need to create a new column called “Product Bundle” to join values from multiple rows into a single row. We will use the .transform() method here.
We can see a lot of duplicate values as the products were merged for each Order Id. We need to drop these duplicates first.
Now that we have removed the duplicates, we need to count the pair of products. We need to import two new libraries itertools and collections in order to count all the combinations of the product bundles. Next, to display the top 10 combinations with the count.
We can clearly see that the most often products sold together are iPhone and Lightning Charging Cable with 1005 transactions.
Q5. What products sold the most ?, Why do you think it sold the most ?
To answer this question, we first need to find the sum of the “Quantity Ordered” by grouping by “Product”
Let us also have a visual representation of the Quantity Ordered for each Product.
The output clearly shows that the Product that was ordered the most was AAA Batteries (4-pack) and the Quantity Ordered was 31017.
Let us also have a visual representation of the Price of each grouped by Product.
The above visualization reveals that the price of AAA Batteries (4-pack) is the lowest ($2.99). Hence, it is the product which was sold the most.