• AIPressRoom
  • Posts
  • Construct Your Personal PandasAI with LlamaIndex

Construct Your Personal PandasAI with LlamaIndex

Pandas AI is a Python library that leverages the ability of generative AI to supercharge Pandas, the favored knowledge evaluation library. With only a easy immediate, Pandas AI lets you carry out advanced knowledge cleansing, evaluation, and visualization that beforehand required many traces of code.

Past crunching the numbers, Pandas AI understands pure language. You may ask questions on your knowledge in plain English, and it’ll present summaries and insights in on a regular basis language, sparing you from deciphering advanced graphs and tables.

Within the instance beneath, we supplied a Pandas dataframe and requested the generative AI to create a bar chart. The result’s spectacular.

pandas_ai.run(df, immediate="Plot the bar chart of sort of media for every year launch, utilizing completely different colours.")

 Word: the code instance is from Pandas AI: Your Guide to Generative AI-Powered Data Analysis tutorial. 

On this submit, we shall be utilizing LlamaIndex to create related instruments that may perceive the Pandas knowledge body and produce advanced outcomes as proven above. 

LlamaIndex permits pure language querying of information by way of chat and brokers. It permits massive language fashions to interpret non-public knowledge at scale with out retraining on new knowledge. It integrates massive language fashions with numerous knowledge sources and instruments. LlamaIndex is an information framework that permits for the simple creation of Chat with PDF functions with only a few traces of code.

You may set up the Python library by utilizing the pip command. 

By default, LlamaIndex makes use of OpenAI gpt-3.5-turbo mannequin for textual content technology and text-embedding-ada-002 for retrieval and embeddings. To run the code hassle-free, we should arrange the OPENAI_API_KEY. We will register and get the API key free of charge on a new API token web page.

import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"

In addition they help integrations of Anthropic, Hugging Face, PaLM, and extra fashions. You may be taught all the things about it by studying the Module’s documentation. 

Let’s get to the principle matter of making your personal PandasAI. After putting in the library and organising the API key, we’ll create a easy metropolis dataframe with the town title and inhabitants because the columns. 

import pandas as pd
from llama_index.query_engine.pandas_query_engine import PandasQueryEngine
df = pd.DataFrame(
    {"metropolis": ["New York", "Islamabad", "Mumbai"], "inhabitants": [8804190, 1009832, 12478447]}
)

Utilizing the PandasQueryEngine, we’ll create a question engine to load the dataframe and index it. 

After that, we’ll write a question and show the response. 

query_engine = PandasQueryEngine(df=df)

response = query_engine.question(
    "What's the metropolis with the bottom inhabitants?",
)

As we are able to see, it has developed the Python code for displaying the least populated metropolis within the dataframe. 

> Pandas Directions:
```
eval("df.loc[df['population'].idxmin()]['city']")
```
eval("df.loc[df['population'].idxmin()]['city']")
> Pandas Output: Islamabad

And, if you happen to print the response, you’ll get “Islamabad.” It’s easy however spectacular. You do not have to provide you with your personal logic or experiment across the code. Simply sort the query, and you’ll get the reply. 

You may as well print the code behind the end result utilizing the response metadata. 

print(response.metadata["pandas_instruction_str"])
eval("df.loc[df['population'].idxmin()]['city']")

Within the second instance, we’ll load the World YouTube Statistics 2023 dataset from Kaggle and carry out some elementary evaluation. It’s a step up from the easy examples. 

We’ll use read_csv to load the dataset into the question engine. Then we’ll write the immediate to show solely columns with lacking values and the variety of lacking values.

df_yt = pd.read_csv("World YouTube Statistics.csv")
query_engine = PandasQueryEngine(df=df_yt, verbose=True)

response = query_engine.question(
    "Listing the columns with lacking values and the variety of lacking values. Solely present lacking values columns.",
)
> Pandas Directions:
```
df.isnull().sum()[df.isnull().sum() > 0]
```
df.isnull().sum()[df.isnull().sum() > 0]
> Pandas Output: class                                    46
Nation                                    122
Abbreviation                               122
channel_type                                30
video_views_rank                             1
country_rank                               116
channel_type_rank                           33
video_views_for_the_last_30_days            56
subscribers_for_last_30_days               337
created_year                                 5
created_month                                5
created_date                                 5
Gross tertiary schooling enrollment (%)    123
Inhabitants                                 123
Unemployment fee                          123
Urban_population                           123
Latitude                                   123
Longitude                                  123
dtype: int64

Now, we’ll ask direct questions on common channel sorts. In my view, the LlamdaIndex question engine is very correct and has not but produced any hallucinations.

response = query_engine.question(
    "Which channel sort have essentially the most views.",
)
> Pandas Directions:
```
eval("df.groupby('channel_type')['video views'].sum().idxmax()")
```
eval("df.groupby('channel_type')['video views'].sum().idxmax()")
> Pandas Output: Leisure
Leisure

In the long run, we’ll ask it to visualise barchat and the outcomes are wonderful. 

response = query_engine.question(
    "Visualize barchat of high ten youtube channels primarily based on subscribers and add the title.",
)
> Pandas Directions:
```
eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(form='bar', x='Youtuber', y='subscribers', title="Prime Ten YouTube Channels Based mostly on Subscribers")")
```
eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(form='bar', x='Youtuber', y='subscribers', title="Prime Ten YouTube Channels Based mostly on Subscribers")")
> Pandas Output: AxesSubplot(0.125,0.11;0.775x0.77)

With a easy immediate and question engine, we are able to automate our knowledge evaluation and carry out advanced duties. There’s a lot extra to LamaIndex. I extremely suggest you to learn the official documentation and attempt to construct one thing wonderful.

In abstract, LlamaIndex is an thrilling new instrument that permits builders to create their very own PandasAI – leveraging the ability of enormous language fashions for intuitive knowledge evaluation and dialog. By indexing and embedding your dataset with LlamaIndex, you’ll be able to allow superior pure language capabilities in your non-public knowledge with out compromising safety or retraining fashions.

That is only a begin, with LlamaIndex you’ll be able to construct Q&A over paperwork, Chatbots, Automated AI, Data Graph, AI SQL Question Engine, Full-Stack Internet Utility, and construct non-public generative AI functions.  Abid Ali Awan (@1abidaliawan) is a licensed knowledge scientist skilled who loves constructing machine studying fashions. At present, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in Expertise Administration and a bachelor’s diploma in Telecommunication Engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college students scuffling with psychological sickness.