Introduction to Numpy and Pandas

Master 1 (203) in Financial Markets

Juan F. Imbet

Paris Dauphine - PSL University

2025-10-28

Part 1: NumPy Foundations

NumPy: History and Motivation

Created in 2005 by Travis Oliphant, building on Numeric and Numarray
Problem: Python lists are flexible but slow for numerical computations
Solution: NumPy provides efficient array operations in C
Used by: Pandas, SciPy, scikit-learn, and most scientific Python libraries
Key advantage: Vectorized operations eliminate slow Python loops

Why NumPy Arrays are Faster than Lists

Homogeneous types: All elements have the same type (no type checking overhead)
Contiguous memory: Data stored in adjacent memory locations
C implementation: Core operations written in optimized C code
Vectorization: Operations applied to entire arrays at once
No Python object overhead: Direct access to raw data

import numpy as np
import time

# Python list
python_list = list(range(1000000))
start = time.time()
result = [x * 2 for x in python_list]
print(f"Python list: {time.time() - start:.4f}s")

# NumPy array
numpy_array = np.arange(1000000)
start = time.time()
result = numpy_array * 2
print(f"NumPy array: {time.time() - start:.4f}s")

NumPy Data Types

Integers: int8, int16, int32, int64 (signed), uint8, uint16, etc. (unsigned)
Floats: float16, float32, float64 (default)
Complex: complex64, complex128
Boolean: bool_
Memory efficiency: Choose appropriate type for your data

import numpy as np

# Explicitly specify data type
arr_int = np.array([1, 2, 3], dtype=np.int32)
arr_float = np.array([1.0, 2.0, 3.0], dtype=np.float64)

print(f"Integer array uses: {arr_int.itemsize} bytes per element")
print(f"Float array uses: {arr_float.itemsize} bytes per element")

Arrays of Different Dimensions

0D (scalar): Single value
1D (vector): List of values [1, 2, 3]
2D (matrix): Rows and columns [[1, 2], [3, 4]]
3D and beyond: Tensors for complex data structures
Shape attribute: Returns dimensions as tuple

import numpy as np

scalar = np.array(42)
vector = np.array([1, 2, 3, 4])
matrix = np.array([[1, 2, 3], [4, 5, 6]])
tensor = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(f"Scalar shape: {scalar.shape}")  # ()
print(f"Vector shape: {vector.shape}")  # (4,)
print(f"Matrix shape: {matrix.shape}")  # (2, 3)
print(f"Tensor shape: {tensor.shape}")  # (2, 2, 2)

Creating NumPy Arrays

import numpy as np

# From Python list
arr1 = np.array([1, 2, 3, 4, 5])

# Range of values
arr2 = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

# Evenly spaced values
arr3 = np.linspace(0, 1, 5)  # [0.0, 0.25, 0.5, 0.75, 1.0]

# Zeros and ones
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))

# Random values
random_arr = np.random.rand(3, 3)  # Uniform [0, 1)
normal_arr = np.random.randn(3, 3)  # Standard normal

Operations Between Arrays: Element-wise

Arithmetic: +, -, *, /, ** applied element-by-element
Broadcasting: Operations between arrays of different shapes
No loops needed: All operations vectorized
Much faster: Than iterating through Python lists
Intuitive syntax: Mathematical operations look natural

import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# Element-wise operations
print(a + b)    # [11, 22, 33, 44]
print(a * b)    # [10, 40, 90, 160]
print(a ** 2)   # [1, 4, 9, 16]
print(b / a)    # [10.0, 10.0, 10.0, 10.0]

Broadcasting Rules

import numpy as np

# Broadcasting: scalar with array
arr = np.array([1, 2, 3, 4])
result = arr + 10  # [11, 12, 13, 14]

# Broadcasting: different shapes
matrix = np.array([[1, 2, 3], [4, 5, 6]])
vector = np.array([10, 20, 30])
result = matrix + vector  
# [[11, 22, 33],
#  [14, 25, 36]]

# Broadcasting: column vector
col_vector = np.array([[1], [2]])
result = matrix + col_vector
# [[2, 3, 4],
#  [6, 7, 8]]

Matrix Operations

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Element-wise multiplication
elementwise = A * B  # [[5, 12], [21, 32]]

# Matrix multiplication (dot product)
matmul1 = A @ B      # [[19, 22], [43, 50]]
matmul2 = np.dot(A, B)  # Same result

# Transpose
A_T = A.T  # [[1, 3], [2, 4]]

# Inverse (for square matrices)
A_inv = np.linalg.inv(A)

Useful NumPy Functions

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

# Statistical functions
print(np.mean(arr))        # 3.5
print(np.std(arr))         # 1.707...
print(np.sum(arr))         # 21
print(np.sum(arr, axis=0)) # [5, 7, 9] (sum columns)
print(np.sum(arr, axis=1)) # [6, 15] (sum rows)

# Other useful functions
print(np.max(arr))         # 6
print(np.argmax(arr))      # 5 (index of max)
print(np.sqrt(arr))        # Element-wise square root

Part 2: Pandas Fundamentals

What is Pandas?

Created in 2008 by Wes McKinney
Name from: “Panel Data” - econometric term for multidimensional data
Purpose: Data manipulation and analysis tool
Built on NumPy: Uses NumPy arrays under the hood
Key structures: Series (1D) and DataFrame (2D)

Pandas is Built on NumPy

Series: 1D labeled array backed by NumPy array
DataFrame: 2D labeled data structure with NumPy arrays for each column
Inherits speed: Vectorized operations from NumPy
Adds flexibility: Labels, missing data handling, heterogeneous types
Best of both worlds: NumPy speed + high-level data manipulation

import pandas as pd
import numpy as np

# Creating a Series from NumPy array
arr = np.array([10, 20, 30, 40])
series = pd.Series(arr, index=['a', 'b', 'c', 'd'])

# Accessing underlying NumPy array
print(series.values)  # NumPy array: [10 20 30 40]
print(type(series.values))  # <class 'numpy.ndarray'>

Creating DataFrames

import pandas as pd
import numpy as np

# From dictionary
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# From NumPy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df2 = pd.DataFrame(data, columns=['A', 'B', 'C'])

# From CSV file
df3 = pd.read_csv('data.csv')

# Basic info
print(df1.head())      # First 5 rows
print(df1.info())      # Data types and memory
print(df1.describe())  # Statistical summary

Operations Between Columns

Arithmetic operations: Just like NumPy arrays
Apply functions: Use .apply() for custom operations
Vectorized: All operations are fast and efficient
Create new columns: Assign results to new column names
Combine columns: Mathematical or logical operations

import pandas as pd

df = pd.DataFrame({
    'price': [100, 200, 150],
    'quantity': [10, 5, 8],
    'discount': [0.1, 0.2, 0.15]
})

# Create new columns from operations
df['total'] = df['price'] * df['quantity']
df['discounted_price'] = df['price'] * (1 - df['discount'])
df['revenue'] = df['discounted_price'] * df['quantity']

# Apply function to column
df['price_category'] = df['price'].apply(
    lambda x: 'High' if x > 150 else 'Low'
)

Selecting Rows: Index-Based

.loc[]: Label-based indexing
.iloc[]: Integer position-based indexing
Single row: Returns a Series
Multiple rows: Returns a DataFrame
Slicing: Select ranges of rows

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['Paris', 'London', 'Berlin', 'Madrid']
}, index=['A', 'B', 'C', 'D'])

# Label-based selection
print(df.loc['A'])           # Single row (Series)
print(df.loc[['A', 'C']])    # Multiple rows (DataFrame)
print(df.loc['A':'C'])       # Slice (inclusive)

# Position-based selection
print(df.iloc[0])            # First row
print(df.iloc[[0, 2]])       # First and third rows
print(df.iloc[1:3])          # Rows 1 and 2 (exclusive end)

Selecting Rows: Logical Conditions

Boolean indexing: Use logical conditions to filter rows
Multiple conditions: Combine with & (and), | (or), ~ (not)
.isin(): Check if values are in a list
.between(): Check if values are in a range
Query method: String-based filtering for cleaner syntax

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'salary': [50000, 60000, 55000, 75000]
})

# Single condition
young = df[df['age'] < 35]

# Multiple conditions (use parentheses!)
filtered = df[(df['age'] > 25) & (df['salary'] > 55000)]

# isin method
selected = df[df['name'].isin(['Alice', 'Charlie'])]

# Query method
result = df.query('age > 30 and salary < 70000')

GroupBy Operations: The Split-Apply-Combine Pattern

Split: Divide data into groups based on criteria
Apply: Perform operations on each group independently
Combine: Aggregate results back into a data structure
Common aggregations: sum(), mean(), count(), min(), max()
Powerful tool: Essential for data analysis

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales', 'IT', 'Sales', 'IT', 'HR'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'salary': [50000, 60000, 55000, 65000, 52000]
})

# Group by department and calculate mean salary
avg_salary = df.groupby('department')['salary'].mean()

# Multiple aggregations
stats = df.groupby('department').agg({
    'salary': ['mean', 'min', 'max', 'count']
})

# Apply custom function
df.groupby('department')['salary'].apply(lambda x: x.max() - x.min())

Advanced GroupBy Examples

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'region': np.random.choice(['North', 'South'], 100),
    'sales': np.random.randint(100, 1000, 100)
})

# Group by multiple columns
multi_group = df.groupby(['product', 'region'])['sales'].sum()

# Transform: keep original shape
df['pct_of_product_total'] = df.groupby('product')['sales'].transform(
    lambda x: x / x.sum()
)

# Filter groups
high_sales = df.groupby('product').filter(
    lambda x: x['sales'].sum() > 10000
)

Practical Exercise

Exercise: Titanic Dataset Analysis

Dataset: Titanic passenger data from Kaggle

Tasks: 1. Load the data and explore its structure 2. Calculate survival rates by passenger class 3. Find average age by gender and survival status 4. Identify which deck had the highest survival rate 5. Create a new feature combining age groups and class

You will practice: - Loading data with Pandas - Selecting rows and columns - GroupBy operations - Creating new features

Loading and Exploring the Data

import pandas as pd
import numpy as np

# Load the Titanic dataset
df = pd.read_csv('titanic.csv')

# Explore the structure
print(df.head())
print(df.info())
print(df.describe())

# Check for missing values
print(df.isnull().sum())

# Basic statistics
print(f"Total passengers: {len(df)}")
print(f"Survival rate: {df['Survived'].mean():.2%}")
print(f"\nPassengers per class:")
print(df['Pclass'].value_counts().sort_index())

Solution Part 1: Survival by Class

# Task 1: Calculate survival rates by passenger class
survival_by_class = df.groupby('Pclass')['Survived'].agg([
    ('count', 'count'),
    ('survived', 'sum'),
    ('survival_rate', 'mean')
])

print("Survival rates by class:")
print(survival_by_class)

# Visualization insight:
# 1st class: ~63% survival
# 2nd class: ~47% survival  
# 3rd class: ~24% survival
# Clear pattern: higher class = higher survival rate

Solution Part 2: Average Age Analysis

# Task 2: Average age by gender and survival status
age_analysis = df.groupby(['Sex', 'Survived'])['Age'].mean()
print("\nAverage age by gender and survival:")
print(age_analysis)

# More detailed view
age_detail = df.groupby(['Sex', 'Survived']).agg({
    'Age': ['mean', 'median', 'std', 'count']
})
print("\nDetailed age statistics:")
print(age_detail)

# Insight: Women and children first policy visible in data
# Younger passengers more likely to survive

Solution Part 3: Survival by Deck

# Task 3: Survival rate by deck (extracted from Cabin)
# First, extract deck letter from Cabin
df['Deck'] = df['Cabin'].str[0]

# Calculate survival rates by deck
deck_survival = df.groupby('Deck')['Survived'].agg([
    'count', 'mean'
]).sort_values('mean', ascending=False)

print("\nSurvival rates by deck:")
print(deck_survival)

# Filter out decks with few passengers for more reliable statistics
reliable_decks = deck_survival[deck_survival['count'] >= 10]
print("\nDecks with 10+ passengers:")
print(reliable_decks)

Solution Part 4: Feature Engineering

# Task 4: Create age groups and combine with class
# Define age groups
def categorize_age(age):
    if pd.isna(age):
        return 'Unknown'
    elif age < 18:
        return 'Child'
    elif age < 35:
        return 'Young Adult'
    elif age < 60:
        return 'Adult'
    else:
        return 'Senior'

df['Age_Group'] = df['Age'].apply(categorize_age)

# Combine with passenger class
df['Class_Age_Group'] = (df['Pclass'].astype(str) + '_' + 
                          df['Age_Group'])

# Analyze survival by this new feature
survival_by_combo = df.groupby('Class_Age_Group')['Survived'].agg([
    'count', 'mean'
]).sort_values('mean', ascending=False)

print("\nSurvival rates by class and age group:")
print(survival_by_combo.head(10))