• AIPressRoom
  • Posts
  • Ink to Insights: Evaluating SQL and Python Queries utilizing Bookshop Analytics | by John Lenehan | Sep, 2023

Ink to Insights: Evaluating SQL and Python Queries utilizing Bookshop Analytics | by John Lenehan | Sep, 2023

SQL is the bread and butter of any knowledge scientist’s toolbox — the flexibility to rapidly pull knowledge from a knowledge supply for evaluation is a vital talent for anybody working with massive quantities of information. On this submit I wished to provide some examples of some primary queries I sometimes use in SQL, over the course of an EDA course of. I’ll evaluate these queries towards related scripts in Python which produce the identical output, as a comparability between the 2 approaches.

For this evaluation I’ll be utilizing some artificial knowledge on final yr’s highest rated books from a hypothetical chain of bookshops (the Complete Fiction Bookstore). A hyperlink to the github folder for this venture may be discovered here, the place I am going into the small print of operating the evaluation.

As a aspect notice — whereas I primarily give attention to the SQL queries on this article, it’s price noting that these queries may be built-in fairly seamlessly with Python utilizing the pandaSQL library (as I’ve completed for this venture). This may be seen intimately within the Jupyter pocket book on this venture’s GitHub hyperlink, however the construction of this question usually goes as follows:

question = """
SELECT * FROM DATA
"""

output = sqldf(question,locals())
output

PandaSQL is a really sensible library for individuals who have extra familiarity with SQL querying than the standard Pandas dataset manipulation — and is usually a lot simpler to learn, as I’ll present right here.

The Dataset

A snippet of the dataset may be seen under — there are columns for guide title and the yr it was printed, the variety of pages, the genres, the guide’s common score, the writer, variety of items offered, and guide income.

Income Evaluation by Decade

Let’s say I wish to know which decade has printed probably the most worthwhile books for the bookshop. The unique dataset doesn’t have a column for which decade the books had been printed in — nevertheless that is comparatively simple to enter to the info. I run a subquery to divide the yr utilizing ground division and multiply by 10 to get the last decade knowledge, earlier than aggregating and averaging the votes by decade. I then order the outcomes by complete income to get probably the most worthwhile many years of printed books within the bookshop.

WITH bookshop AS
(
SELECT TITLE, YEARPUBLISHED,
(YEARPUBLISHED/10) * 10 AS DECADE,
NUMPAGES, GENRES, RATING, AUTHOR, UNITSSOLD,
REVENUE
from df
)

SELECT DECADE, SUM(REVENUE) AS TOTAL_REVENUE,
ROUND(AVG(REVENUE),0) AS AVG_REVENUE
FROM bookshop
GROUP BY DECADE
ORDER BY TOTAL_REVENUE DESC

By comparability, an equal output in Python would look one thing just like the code snippet under. I apply a lambda perform which runs the ground division and outputs the last decade, and from there I mixture the votes by decade and kind the end result by complete income.

# creating df bookshop
bookshop = df.copy()
bookshop['Decade'] = (bookshop['YearPublished'] // 10) * 10

# group by decade, agg income by sum and imply
end result = bookshop.groupby('DECADE')
.agg({'Income': ['sum', 'mean']})
.reset_index()

end result.columns = ['Decade', 'Total_Revenue', 'Avg_Revenue']

# sorting by decade
end result = end result.sort_values('Total_Revenue')

Notice the larger variety of separate steps there are within the python script to realize the identical end result — the features are awkward and obscure on first look. By comparability the SQL script is far clearer in its presentation, and far simpler to learn.

I can now take this question and visualise it to get a way of the guide income tendencies throughout the many years, establishing a matplotlib graph utilizing the next script — bar charts present the overall income by decade, with a scatter plot on the secondary axis to indicate common guide income.

# Creating major y-axis (complete income)
fig, ax1 = plt.subplots(figsize=(15, 9))
ax1.bar(agg_decade['DECADE'], agg_decade['TOTAL_REVENUE'],
width = 0.4, align='heart', label='Complete Income ({Dollars})')
ax1.set_xlabel('Decade')
ax1.set_ylabel('Complete Income ({Dollars})', colour='blue')

# Adjusting gridlines on the first y-axis
ax1.grid(colour='blue', linestyle='--', linewidth=0.5, alpha=0.5)

# Creating secondary y-axis (avg income)
ax2 = ax1.twinx()
ax2.scatter(agg_decade['DECADE'], agg_decade['AVG_REVENUE'],
marker='o', colour='purple', label='Avg Income ({Dollars})')
ax2.set_ylabel('Avg Income ({Dollars})', colour='purple')

# Adjusting gridlines on the secondary y-axis
ax2.grid(colour='purple', linestyle='--', linewidth=0.5, alpha=0.5)

# Setting the identical y-axis limits for each ax1 and ax2
ax1.set_ylim(0, 1.1*max(agg_decade['TOTAL_REVENUE']))
ax2.set_ylim(0, 1.1*max(agg_decade['AVG_REVENUE']))

# Combining legends for each axes
traces, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(traces + lines2, labels + labels2, loc='higher left')

# Set title
plt.title('Complete and Avg Income by Decade')

# Present the plot
plt.present()

The visualisation may be seen under — books printed within the Sixties are apparently probably the most worthwhile for the bookshop, producing over $192,000 in income for Complete Fiction Bookstore. By comparability, books on the checklist from the 1900s are extra worthwhile on common, although didn’t promote in addition to books from the Sixties.

Common guide income follows an identical development to complete income throughout all many years of printed books — except for books from the 1900s and Nineteen Eighties, that are extra worthwhile on common however not general.

Creator Evaluation

Now, suppose I wish to get knowledge on the highest 10 authors within the checklist, ordered by their complete generated income. For this question I wish to know the variety of books they’ve made that seem on the checklist, the overall income they’ve generated on these books, their common income per guide, and the common score of these books within the bookshop. Easy sufficient query to reply utilizing SQL — I can use a depend assertion to get the overall variety of books they’ve made, and avg statements to get the imply income and score per writer. Following that I can group these statements by director.

SELECT AUTHOR,
COUNT(TITLE) AS NUM_BOOKS,
SUM(REVENUE) AS TOTAL_REVENUE,
ROUND(AVG(REVENUE),0) AS AVG_REVENUE,
ROUND(AVG(RATING),2) AS AVG_RATING_PER_BOOK
FROM bookshop
GROUP BY AUTHOR
ORDER BY TOTAL_REVENUE DESC
LIMIT 10

An equal Python script would seem like this — roughly the identical size, however far more complicated for a similar output. I group the values by writer earlier than specifying the best way to mixture every column within the agg perform, then sorting the values by complete income. Once more, the SQL script is far clearer by comparability.

end result = bookshop.groupby('Creator') 
.agg({
'Title': 'depend',
'Income': ['sum', 'mean'],
'Ranking': 'imply'
})
.reset_index()

end result.columns = ['Author', 'Num_Books', 'Total_Revenue',
'Avg_Revenue', 'Avg_Rating_per_Book']

# Sorting for complete income
end result = end result.sort_values('Total_Revenue', ascending=False)

# high 10
result_top10 = end result.head(10)

The output from this question may be seen under — Ava Mitchell leads the sector, with a complete income of over $152,000 from her guide gross sales. Emma Hayes takes 2nd place at over $85,000, with Liam Parker shut behind at over $83,000.

Visualising this in matplotlib utilizing the next script, we will generate bar plots of complete income with knowledge factors exhibiting the common guide income per writer. The common score per writer can be plotted on a secondary axis.

# Creating determine and axis
fig1, ax1 = plt.subplots(figsize=(15, 9))

#plotting bar chart of complete income
ax1.bar(agg_author['Author'], agg_author['TOTAL_REVENUE'],
width=0.4, align='heart', colour='silver', label='Complete Income ({Dollars})')
ax1.set_xlabel('Creator')
ax1.set_xticklabels(agg_author['Author'], rotation=-45, ha='left')
ax1.set_ylabel('Complete Income ({Dollars})', colour='blue')

# Adjusting gridlines on the first y-axis
ax1.grid(colour='blue', linestyle='--', linewidth=0.5, alpha=0.5)

#creating scatter plot of avg income
ax1.scatter(agg_author['Author'], agg_author['AVG_REVENUE'],
marker="D", colour='blue', label='Avg Income per Guide ({Dollars})')

# Creating scatter plot of avg score on secondary axis
ax2 = ax1.twinx()
ax2.scatter(agg_author['Author'], agg_author['AVG_RATING_PER_BOOK'],
marker='^', colour='purple', label='Avg Ranking per Guide')
ax2.set_ylabel('Avg Ranking', colour='purple')

# Adjusting gridlines on the secondary y-axis
ax2.grid(colour='purple', linestyle='--', linewidth=0.5, alpha=0.5)

# Combining legends for each axes
traces, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(traces + lines2, labels + labels2, loc='higher proper')

# Set title
plt.title('Prime 10 Authors by Income, Ranking')

# Present the plot
plt.present()

Operating this, we get the next graph:

This graph does level to a reasonably clear assertion — income doesn’t correlate with common score for every writer. Ava Mitchell has the best income however is on the median by way of score for the authors listed above. Olivia Hudson is highest by common score whereas putting eighth in complete votes; there isn’t a observable development between an writer’s income and their reputation.

Evaluating Guide Size vs Income

Lastly, let’s assume I wish to present how guide income differs based mostly on the guide size. To reply this query, I first wish to divide the books equally into 4 classes based mostly on the guide size quartiles, which is able to give a greater concept of general income vs guide size tendencies.

Firstly, I outline the quartiles in SQL, utilizing a subquery to generate these values, earlier than sorting the books into these buckets utilizing a case when assertion.

WITH PERCENTILES AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY NUMPAGES)
AS PERCENTILE_25,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY NUMPAGES)
AS MEDIAN,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY NUMPAGES)
AS PERCENTILE_75
FROM bookshop
)
SELECT
TITLE, TITLE, REVENUE, NUMPAGES,
CASE
WHEN NUMPAGES< (SELECT PERCENTILE_25 FROM PERCENTILES)
THEN 'Quartile 1'
WHEN NUMPAGES BETWEEN (SELECT PERCENTILE_25 FROM PERCENTILES)
AND (SELECT MEDIAN FROM PERCENTILES) THEN 'Quartile 2'
WHEN NUMPAGES BETWEEN (SELECT MEDIAN FROM PERCENTILES)
AND (SELECT PERCENTILE_75 FROM PERCENTILES) THEN 'Quartile 3'
WHEN NUMPAGES > (SELECT PERCENTILE_75 FROM PERCENTILES)
THEN 'Quartile 4'
END AS PAGELENGTH_QUARTILE
FROM bookshop
ORDER BY REVENUE DESC

Alternatively (for SQL dialects that don’t help percentile features, like SQLite), the quartiles may be calculated individually earlier than inputting them manually to the case when assertion.

--For SQLite dialect
SELECT TITLE, REVENUE, NUMPAGES,
CASE
WHEN NUMPAGES < 318 THEN 'Quartile 1'
WHEN NUMPAGES BETWEEN 318 AND 375 THEN 'Quartile 2'
WHEN NUMPAGES BETWEEN 375 AND 438 THEN 'Quartile 3'
WHEN NUMPAGES > 438 THEN 'Quartile 4'
END AS PAGELENGTH_QUARTILE
FROM bookshop

ORDER BY REVENUE DESC

Operating this similar question in Python, I outline the percentiles utilizing numpy earlier than utilizing the reduce perform to type the books into their buckets, then sorting the values by guide langth in pages. As earlier than, this course of is noticeably extra complicated than the equal script in SQL.

# Outline the percentiles utilizing numpy
percentiles = np.percentile(bookshop['NumPages'], [25, 50, 75])

# Outline the bin edges utilizing the calculated percentiles
bin_edges = [-float('inf'), *percentiles, float('inf')]

# Outline the labels for the buckets
bucket_labels = ['Quartile 1', 'Quartile 2', 'Quartile 3', 'Quartile 4']

# Create the 'RUNTIME_BUCKET' column based mostly on bin edges and labels
bookshop['RUNTIME_BUCKET'] = pd.reduce(bookshop['NumPages'], bins=bin_edges,
labels=bucket_labels)

end result = bookshop[['Title', 'Revenue',
'NumPages', 'PAGELENGTH_QUARTILE']].sort_values(by='NumPages',
ascending=False)

The output to this question may be visualised as boxplots utilizing seaborn — a snippet of the script used to generate the boxplots may be seen under. Notice that the runtime buckets had been manually sorted into the right order to have them correctly offered.

# Set the fashion for the plots
sns.set(fashion="whitegrid")

#Setting order of revenue buckets
pagelength_bucket_order = ['Quartile 1', 'Quartile 2',
'Quartile 3', 'Quartile 4']

# Create the boxplot
plt.determine(figsize=(16, 10))
sns.boxplot(x='PAGELENGTH_QUARTILE', y='Income',
knowledge=pagelength_output, order = pagelength_bucket_order,
showfliers=True)

# Add labels and title
plt.xlabel('PageLength Quartile')
plt.ylabel('Income ({Dollars})')
plt.title('Boxplot of Income by PageLength Bucket')

# Present the plot
plt.present()

The boxplots may be seen under —notice the median income for every guide size quartile tendencies upwards because the books get longer. This means that longer books are extra worthwhile on the bookshop.

Moreover the vary of the 4th quartile is far wider in comparison with the opposite quartiles, indicating there’s extra variation within the worth level for bigger books.

Ultimate Ideas and Additional Purposes

To conclude, using SQL for knowledge evaluation queries is often far more simple than utilizing equal operations in Python; the language is simpler to jot down than Python queries, whereas broadly able to producing the identical outcomes. I wouldn’t argue that both is best than the opposite — I’ve used a mixture of each languages on this evaluation — reasonably, I imagine that utilizing a mixture of each languages collectively can produce extra environment friendly and efficient knowledge evaluation.

Subsequently, given the upper readability in writing SQL queries over queries in Python, I feel it’s far more pure to make use of this when performing the preliminary EDA for a venture. SQL is way simpler to learn and write as I’ve proven on this article, making it particularly advantageous for these early exploratory duties. I typically use it when starting on a venture, and I’d advocate this strategy to anybody who already has an honest grasp of SQL querying.