Pandas tips I wish I knew before
How does pivot work? What is the main pandas building block? And more …
I’ve been a pandas power user for a few years now. There were times when I thought I’ve mastered it and after a few months discovered I was coding like a noob. We’ve all been there ???? The tips I’m sharing here with you are the ones I learned most recently.
To run the examples download this Jupyter notebook.
To Step Up Your Pandas Game, read:
- 5 lesser-known pandas tricks
- Exploratory Data Analysis with pandas
- How NOT to write pandas code
- 5 Gotchas With Pandas
- Pandas tips that will save you hours of head-scratching
- Display Customizations for pandas Power Users
- 5 New Features in pandas 1.0 You Should Know About
- pandas analytics server
- Pandas presentation tips I wish I knew earlier
- Pandas analysis of coronavirus pandemic
Let’s start with an Explosive tip ????
Pandas has a function explode. Don’t worry, it is completely safe to use it ???? The function is useful when you have lists stored in a DataFrame column. It unpacks the values in a list and it duplicates all other values — hence explode ????
Let’s create a DataFrame with a column that has a random number of elements in lists.
n = 10
df = pd.DataFrame(
{
"list_col": [[random.randint(0, 10) for _ in range(random.randint(3, 5))] for _ in range(10)],
}
)
df.shape(10, 1) # output
Now, let’s execute the explode function.
df = df.explode("list_col")
df.shape(40, 1) #output
In the table below, we can observe that the index was duplicated due to unpacking values from the list.
DataFrame is not pandas main building block
Pandas is almost synonymous with DataFrames, but you may be surprised if I tell you that it is not its main building block. Pandas DataFrame is build on top of pandas Series. So each column in a DataFrame is a Series. We could go deeper, what is pandas Series based on? Numpy! But let’s don’t go that deep.
I even wrap Python’s list in a pandas Series when benefits outweigh the cost of installing pandas in the project — I am referring to the backend and not analytics projects here.
Let’s calculate a rolling sum of values in a list with window 2.
values = [1, 2, 0, 0, 0, 0, 1, 2, 3]
sr = pd.Series(values)
sr.rolling(2).sum()
The code is much cleaner this way.
Pivot — Data Scientist’s Essential tool
What is a pivot? A pivot table is a table of statistics that summarizes the data of a more extensive table. Sounds useful! Let’s try it.
Let’s say we have a fruit shop and a few customers.
df = pd.DataFrame(
{
"fruit": ["apple", "orange", "apple", "avocado", "orange"],
"customer": ["ben", "alice", "ben", "josh", "steve"],
"quantity": [1, 2, 3, 1, 2],
}
)
Now, let’s pivot the table with fruit and customer columns and aggregate quantity values. We expect a table where fruits are on one axis and customers on the other. Quantity values should be aggregated
df.pivot(index="fruit", columns="customer", values="quantity")
We get a “ValueError: Index contains duplicate entries, cannot reshape”. What does it mean?
Let’s try pivot with a different command (but easier to interpret), which has the same effect as pivot function.
df.set_index(["fruit", "customer"])["quantity"].unstack()
The command returns the same error as the pivot function, but it is more clear what is happening behind the scenes. It seems that fruit and customer columns when combined don’t have a unique index.
Now, let’s try with pivot_table function, which is also the recommended approach. Note, that the function also supports aggfunc, which is np.mean by default. We use np.sum in the example below.
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum)
Few worthy mentions
Accessing elements
Accessing the first element (and the last element) in a DataFrame is as easy as:
df = pd.DataFrame({"col": [1, 2, 3, 4, 5]})df.iloc[0]df.iloc[-1] # the last element
You might ask what is the difference between loc and iloc? We use loc when accessing rows by index, where index can be string, integer or some other type.
We also use loc to set a new column value on a certain index (we cannot do that with iloc — it would return ValueError):
df.loc[0, "col"] = 1
Appending values
Pandas has an append function that enables you to append the values to the DataFrame. But the trick is that the values appended also need to be in the DataFrame.
df = df.append(pd.DataFrame({'col': [999]}))
df
Rows to columns
My favorite way to turn rows into columns is to use transpose. It also works in numpy.
df.T
Conclusion
These were the tips that I wish someone would tell me when I started using pandas. Sometimes, a friend asks me, how to turn rows into columns or how to change certain values in a DataFrame. It is as simple as described above. Hopefully, these tips save you a minute or two of unnecessary googling.
This post is originally from Towards Data Science.