Day 1: Large Language Models (LLMs)

APIs + Hugging Face quickstart

Juan F. Imbet

Agenda

What LLMs are (quick intuition)
Use commercial LLM APIs from Python (OpenAI, Anthropic)
Good practices: auth, costs, streaming, retries
Hugging Face: install, pick a model, run locally
CPU-friendly demo models for laptops (tiny-gpt2, distil models)

How LLMs work (under the hood)

LLMs read text in small chunks (called tokens) and learn patterns about which words tend to follow others.
They are trained to predict the next token, billions of times, over very large text collections.
After pretraining, they’re refined to follow instructions and be helpful/safe.
At generation time, the model picks the next token repeatedly to form sentences and paragraphs; you can make it more or less random with the temperature setting.

Plain-language picture: - Break input into tokens → look at context → predict the next token → repeat until you hit a stop condition.

Using commercial APIs from Python (overview)

Providers: OpenAI, Anthropic, Azure OpenAI, Google AI Studio.
Typical flow: set API key (env var), send prompt, get text/tool outputs.
Billing by tokens; control with temperature, max tokens; prefer shorter prompts and structured outputs (JSON).
Keep keys in environment variables, never hard-code.

OpenAI API (chat completions)

Installation:

pip install openai

Environment:

export OPENAI_API_KEY=...   # macOS/Linux (zsh)

Python (finance example):

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])  # or rely on env var only

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise finance tutor."},
        {"role": "user", "content": "Explain the CAPM intuition in 3 bullets for a portfolio manager."}
    ],
    temperature=0.2,
)

print(resp.choices[0].message.content)

OpenAI chat completions: key arguments (from docs)

model: model ID (e.g., “gpt-4o-mini”).
messages: array of chat messages with {role, content}.
temperature: 0–2; higher is more random. Recommend changing this or top_p, not both.
top_p: nucleus sampling; consider only the top p probability mass.
max_completion_tokens: upper bound on generated tokens.
stop: up to 4 stop sequences.
n: number of choices to generate (cost scales with choices).
presence_penalty: -2 to 2; encourage talking about new topics.
frequency_penalty: -2 to 2; discourage repeating tokens.
logprobs: whether to return log probabilities of output tokens.
top_logprobs: 0–20; how many highest-probability tokens to include with logprobs.
logit_bias: map of token_id -> bias [-100,100] to adjust likelihoods.
tools: list of tools the model may call; see function calling.
tool_choice: control whether/which tool is called (none | auto | required | specific function).
response_format: configure JSON mode or JSON schema for structured outputs.
parallel_tool_calls: enable parallel function calls during tool use.
service_tier: processing tier (auto | default | flex | priority).
store: whether to store this completion for later listing.

OpenAI: log probabilities (finance example)

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a finance assistant."},
        {"role": "user", "content": "Continue the sentence: 'In CAPM, expected return equals the risk-free rate plus beta times the ...'"}
    ],
    temperature=0.0,
    logprobs=True,
    top_logprobs=5,
)

choice = resp.choices[0]
print("Output:", choice.message.content)
if choice.logprobs and choice.logprobs.content:
    # Show top candidate tokens with log probabilities for the first generated token
    first = choice.logprobs.content[0]
    for cand in first.top_logprobs:
        print(cand.token, cand.logprob)

Notes:

logprobs returns per-token log probabilities for the generated output.
top_logprobs controls how many alternatives are returned for each position.

Anthropic (Claude) API

Installation:

pip install anthropic

Environment:

export ANTHROPIC_API_KEY=...

Python (finance example):

from anthropic import Anthropic
import os

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])  # or via env only

msg = client.messages.create(
    model="claude-3-5-sonnet-latest",
    max_tokens=300,
    messages=[
        {"role": "user", "content": "Give a 1-paragraph intuition for hedging with index futures for an equity portfolio."}
    ],
)

print(msg.content[0].text)

Practical tips for API usage

Guardrails: set max tokens; keep temperature low for determinism; validate outputs (e.g., pydantic/JSON schema).
Retries: exponential backoff for rate limits and transient errors.
Streaming: render partial tokens for better UX on long responses.
Cost control: prefer smaller models first, cache prompts, measure with provider usage endpoints.

Build a Finance LLM CLI (Typer + Rich)

Goal: a friendly command-line app that answers finance questions via an LLM with a finance-focused system message.

Install dependencies:

pip install openai typer rich python-dotenv

We’ll place the runnable script at src/lectures/day1-llms/finance_llm_cli.py.

CLI app: imports, config, and client

"""
src/lectures/day1-llms/finance_llm_cli.py
"""
import os
import sys
from typing import Optional

import typer
from rich.console import Console
from rich.panel import Panel
from rich.markdown import Markdown
from rich.table import Table
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()  # loads .env if present
console = Console()

DEFAULT_SYSTEM = (
    "You are a finance research assistant specialized in financial markets. "
    "Answer concisely, show clear assumptions, and prefer plain-language explanations."
)

def create_client() -> OpenAI:
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        console.print(
            Panel.fit(
                "Missing OPENAI_API_KEY. Set it in your environment or .env file.",
                title="Configuration error",
                style="bold red",
            )
        )
        raise typer.Exit(code=1)
    return OpenAI(api_key=api_key)

CLI app: define the Typer command

app = typer.Typer(help="Finance LLM CLI — ask questions about markets")

@app.command()
def ask(
    question: str = typer.Argument(..., help="Your finance question"),
    model: str = typer.Option("gpt-4o-mini", help="Model ID"),
    temperature: float = typer.Option(0.2, min=0.0, max=2.0, help="Sampling temperature"),
    max_tokens: int = typer.Option(500, help="Max new tokens for the answer"),
    system: Optional[str] = typer.Option(None, help="Override the finance system message"),
):
    """Ask a finance question and print a nicely formatted answer."""
    client = create_client()
    system_msg = system or DEFAULT_SYSTEM
    console.rule("Finance LLM")
    console.print(Panel.fit(question, title="Question", style="bold cyan"))

    resp = client.chat.completions.create(
        model=model,
        temperature=temperature,
        max_completion_tokens=max_tokens,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": question},
        ],
    )

    content = resp.choices[0].message.content
    console.print(Panel(Markdown(content), title="Answer", border_style="green"))

    # Show token usage if available
    if getattr(resp, "usage", None):
        usage = resp.usage
        table = Table(title="Token usage")
        table.add_column("prompt_tokens", justify="right")
        table.add_column("completion_tokens", justify="right")
        table.add_column("total_tokens", justify="right")
        table.add_row(str(usage.prompt_tokens), str(usage.completion_tokens), str(usage.total_tokens))
        console.print(table)

def main():
    try:
        app()
    except KeyboardInterrupt:
        console.print("\nInterrupted.")

if __name__ == "__main__":
    main()

Run the CLI

python src/lectures/day1-llms/finance_llm_cli.py ask \
  "Summarize key drivers of yield curve inversion in 5 bullet points" \
  --temperature 0.2 --max-tokens 400

Tip:

Use –system to override the default finance assistant instructions.
Use –model to try other models (e.g., gpt-4o-mini, gpt-4o).

Secrets with .env and .gitignore

Create a .env file (same folder as you run the command or project root):

# .env
OPENAI_API_KEY=sk-...your-key-here...

Load automatically with python-dotenv (already in the code via load_dotenv()).

Add .env to your .gitignore to avoid committing secrets:

# never commit secrets
.env

Optional: provide a .env.example without real keys for collaborators.

Hugging Face: install and pick a model

Installation (CPU-friendly baseline):

pip install -U transformers accelerate

Notes:

For Apple Silicon: recent PyTorch works with mps automatically; no extra setup needed in recent wheels.
Use tiny models for demos: sshleifer/tiny-gpt2, distilgpt2, or google/flan-t5-small for instruction tasks.

Quick start: pipeline (text-generation)

from transformers import pipeline

# Tiny model: fast on CPU for classroom demos
pipe = pipeline("text-generation", model="sshleifer/tiny-gpt2")

out = pipe(
    "Write one sentence explaining the equity risk premium:",
    max_new_tokens=30,
    do_sample=False,
)
print(out[0]["generated_text"])

Local LLM with Auto classes

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "distilgpt2"  # small, runs on CPU

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float32,
    device_map="auto",  # uses MPS/GPU if available, otherwise CPU
)

prompt = "Summarize the yield curve and why it can invert, in 2 lines:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Optional: slightly larger models and efficiency

Quantization (int8/int4) and low-precision loading reduce memory; see bitsandbytes and AutoGPTQ models.
Example (NVIDIA Linux recommended): load 4-bit quantized weights; on macOS prefer smaller FP16/FP32 models instead.
For instruction-following locally: try TinyLlama/TinyLlama-1.1B-Chat-v1.0 with enough RAM.

When to use API vs local

Start with commercial APIs for best quality and tools; great for prototyping and production.
Use local models for privacy, offline use, or cost control; expect lower quality and more ops work.
Hybrid: route easy prompts to small local models, hard ones to API.

References

Vaswani et al. (2017) “Attention Is All You Need”
Hugging Face docs: https://huggingface.co/docs/transformers
OpenAI Python SDK: https://github.com/openai/openai-python
Anthropic Python SDK: https://github.com/anthropics/anthropic-sdk-python