Pandas

Language: Python

Data Science

Pandas was created by Wes McKinney in 2008 to provide a high-performance, user-friendly data analysis tool for Python. It has become the standard library for data manipulation in Python, widely used in data science, finance, research, and analytics.

Pandas is a powerful Python library for data manipulation and analysis. It provides fast, flexible, and expressive data structures such as Series and DataFrame for working with structured data.

Installation

pip: pip install pandas
conda: conda install pandas

Usage

Pandas allows for easy reading, writing, and manipulation of data from multiple sources including CSV, Excel, SQL databases, and more. You can filter, aggregate, group, pivot, merge, and reshape datasets efficiently.

Loading a CSV and viewing data

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())

Reads a CSV file into a DataFrame and displays the first five rows.

Selecting columns and rows

print(df['column_name'])
print(df.iloc[0])

Access columns by name and rows by index.

Filtering and grouping data

filtered = df[df['age'] > 30]
grouped = filtered.groupby('department').mean()

Filter rows based on a condition and then group by a column to calculate mean values.

Merging DataFrames

merged = pd.merge(df1, df2, on='id', how='inner')

Combine two DataFrames on a common column using an inner join.

Pivot tables

pivot = df.pivot_table(index='department', columns='gender', values='salary', aggfunc='mean')

Create a pivot table to summarize data efficiently.

Error Handling

FileNotFoundError: Ensure the file path is correct when reading CSV/Excel files.
KeyError: Verify column names exist before accessing them.
ValueError: Check the shape and alignment of DataFrames when merging or concatenating.

Best Practices

Use vectorized operations instead of loops for performance.

Clean data before analysis: handle missing values, duplicates, and inconsistent types.

Use descriptive column names for readability.

Leverage built-in aggregation functions for efficiency.

Profile large datasets with df.info() and df.describe() before processing.