Architecting an AI Inference Stack
Notes on architecting AI inference stacks and TPUs from Google's learning path, "Inference on TPUs".

Notes on architecting AI inference stacks and TPUs from Google's learning path, "Inference on TPUs".

Notes on architecting multi-agent systems from Google's learning path, "Architect Multi-Agent Systems with Agent Development Kit".

Notes and ideas on annotations and LLMs. Using annotations in conjunction with LLM dev tooling as well as generating annotation processors with LLMs

I never write about caches and caching, so I thought I'd cover some basics on LLM caching. Covers inference and prompt caching.

Notes on the TensorZero LLM gateway. Covers templates, schemas, feedback, retries, evals, DICL, MIPRO, model-prompt-inference optimization.

Some basics on Ollama. Includes some details on quantization, vector DBs, model storage, model format and modelfiles.

Comparisons of the OpenAI service offering with that of Anthropic. Includes context window, rate limits and model optimization.

My notes on the design of Anthropic's APIs and some general design considerations for provider based APIs and SDKs. Covers rate limiting, service tiers, SSE flow and some of the REST API endpoints.

Notes on AI/LLM guardrails and safety patterns from a book on "Agentic Design Patterns" by one of Google's Distinguished Engineers, Antonio Gulli.

Notes on workflows and agents from Anthropic's course on Claude. Covers evaluator-optimizer pattern, chaining, routing, parallelization.
