APIs, Web Scraping & Async HTTP
HTTP Fundamentals
Every web API interaction is an HTTP message exchange. Understanding the protocol makes debugging trivial and error handling natural.
HTTP Methods — each has a semantic contract:
| Method | Semantic | Body? | Idempotent? | |--------|----------|-------|-------------| | GET | Read a resource | No | Yes | | POST | Create a resource | Yes | No | | PUT | Replace a resource entirely | Yes | Yes | | PATCH | Partially update a resource | Yes | No | | DELETE | Remove a resource | No | Yes |
Idempotent means calling it N times has the same effect as calling it once. A GET on /users/42 always returns the same user (assuming no concurrent writes). A POST to /orders creates a new order each time.
Status Code Groups:
1xx— Informational (rarely seen in REST APIs)2xx— Success:200 OK,201 Created,204 No Content3xx— Redirection:301 Permanent,302 Temporary4xx— Client error:400 Bad Request,401 Unauthorized,403 Forbidden,404 Not Found,422 Unprocessable Entity,429 Too Many Requests5xx— Server error:500 Internal Server Error,503 Service Unavailable
Key Headers:
urllib: The Standard Library HTTP Client
Pyodide (which runs Python in your browser) doesn't include the requests library, but urllib is part of the standard library and always available. In practice, you'd use requests or httpx in real projects — they have better ergonomics — but urllib teaches the underlying mechanics clearly.
JSON APIs: Parsing and Authentication
Error Handling for HTTP: Retries, Timeouts, and Circuit Breakers
Web Scraping Concepts & HTML Parsing
Web scraping extracts structured data from HTML pages. In a browser-based environment we can't make network requests, but we can demonstrate full parsing logic with mock HTML:
Async HTTP with asyncio
Synchronous HTTP is simple but wasteful. When making 10 API calls to 10 different servers, synchronous code waits for each one to complete before starting the next. With I/O-bound work, asyncio lets you start all 10 requests, then collect results as they arrive — the total time becomes roughly equal to the slowest single request, not the sum of all.
The key concept: await suspends this coroutine and gives control back to the event loop, which can run other coroutines while we wait for I/O. No threads, no locks — just cooperative multitasking.
httpx: Modern HTTP for Sync and Async
httpx is the modern successor to requests. It has an almost identical API for synchronous use, but also supports async:
Key advantages over requests:
- Built-in async support with
httpx.AsyncClient - HTTP/2 support
raise_for_status()is clean and standard- Connection pooling via context managers
- Timeouts configurable per-operation or per-client
PROJECT: Weather Data Aggregator
PROJECT: Async URL Processor
Rate Limiting and Being a Good API Citizen
Key Takeaways
- HTTP methods carry semantic contracts: GET is safe and idempotent, POST creates resources, PUT replaces, PATCH partially updates — using the wrong verb breaks caching and causes unexpected client behavior
4xx= client's fault,5xx= server's fault: only5xxand429should trigger retries; retrying on400/404is wasteful and masks bugs in your code- Exponential backoff with jitter prevents thundering herds: when many clients retry simultaneously, random jitter spreads the load; full jitter (
random(0, cap)) is better than equal-interval retries urllib.requestis stdlib;requests/httpxare ergonomic wrappers: in constrained environmentsurllibworks everywhere; in real projectshttpxadds HTTP/2, async support, and cleaner error handling- Async HTTP is O(1) in memory and O(max_latency) in time:
asyncio.gather(task1, task2, ..., task10)sends all requests at once and waits for the slowest — not the sum of all latencies asyncio.gatherwithreturn_exceptions=Trueprevents one failure from aborting the rest: without it, a single failed request in a batch cancels all others; use it when partial results are useful- Rate limiting is your responsibility as an API consumer: implement token-bucket or leaky-bucket limiting client-side, respect
Retry-Afterheaders, and cache aggressively - Parse and validate API responses defensively: use
.get()with defaults, check for missing fields, and handle API version changes — remote data is untrusted data