Anass Ez-zouaine — Senior Backend Engineer · Software Architect · AI Engineer — Architecture

Caching for speed: Redis and semantic layers in RAG

Tue, 26 May 2026 00:00:00 GMT

You finally shipped your RAG pipeline. It works. The retrieval is accurate. The LLM is snappy. But then you look at your cloud bill and your P99 latency. Every single query — even "what are your shipping times?" asked for the tenth time — triggers a full chain of embedding, vector search, and an expensive LLM call.

At scale, this is a disaster. You are essentially paying for the same computation over and over again. Your users are waiting two seconds for answers that should take twenty milliseconds. Your "denial of wallet" risk is through the roof.

The solution isn't a bigger model or a faster vector DB. It's a smarter cache. I'm talking about semantic caching with Redis. It cuts latency from hundreds of milliseconds to single digits and slashes your API costs by up to 80 percent.

Here is how I build these systems to handle production traffic.

The two-tier cache architecture

Standard caching relies on exact matches. If a user asks "How do I reset my password?" and another asks "how do i reset my password", they might hit the same key if you normalize the string. But if the second user asks "Can you help me change my password?", a traditional cache fails.

In a modern RAG stack, I use a two-tier approach.

Exact cache — a simple key-value store in Redis. I normalize the query (lowercase, trim, strip punctuation) and hash it. It's your first line of defense. It costs almost nothing and has zero false positives.
Semantic cache — if the exact cache misses, I embed the query and look for "near enough" matches in a Redis vector index. If I find a previous question with a similarity score of 0.95 or higher, I serve that cached response instead of hitting the LLM.

This architecture ensures that you never do the heavy lifting twice for the same intent.

Why Redis is the king of semantic caching

Most developers think of Redis as just a key-value store. But with the Redis Vector Library (RedisVL), it becomes a high-performance vector database.

Why use Redis for this instead of your main vector DB like Pinecone or Weaviate?

Latency.

Your main vector DB is likely optimized for searching through millions of document chunks. Your semantic cache is much smaller — it only stores recent queries and answers. By co-locating this cache in Redis, which likely already sits in your application tier, you reduce network hops.

I typically see vector lookups in Redis finish in under 5ms. Compare that to an embedding API call that takes 100ms and an LLM generation that takes 1500ms. The math is simple.

Implementing the semantic layer

The trick to a good semantic cache is the similarity threshold. Too low, and you give users wrong answers (the "semantic trap"). Too high, and you never get a cache hit.

I usually start with a distance threshold of 0.1 for cosine distance, which translates to roughly 90 percent similarity. You can implement this quickly using the RedisVL extensions.

from redisvl.extensions.llmcache import SemanticCache

# Initialize the cache with a conservative threshold
llm_cache = SemanticCache(
    name="production_rag_cache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.1,
)

# Check for a hit
query = "how do i update my billing info?"
hit = llm_cache.check(prompt=query)

if hit:
    return hit[0]["response"]

# If miss, run full RAG and then store
# response = run_rag_pipeline(query)
# llm_cache.store(prompt=query, response=response)

This simple wrapper handles the embedding of the incoming query, the vector search in Redis, and the logic for returning the most relevant cached response.

Avoid the semantic trap: context and versioning

Semantic caching is powerful but dangerous if you aren't careful. If your underlying data changes, your cache might still be serving old, incorrect information.

I always include a context_version in my cache keys or metadata. If I re-index my product catalog or update my documentation, I bump the version. The cache immediately starts missing for old entries, forcing a refresh with the new data.

Another trap is tenant isolation. If User A asks "what is my balance?", you absolutely cannot serve that cached response to User B. I solve this by partitioning the cache:

Use namespaces — cache:tenant_id:query_hash
Include metadata — add tenant_id to the vector index filters.

This ensures that semantic matches only happen within the correct security boundary. For more on building secure, multi-tenant systems, check out my thoughts on Laravel multi-tenancy which shares similar isolation principles.

Managing TTL and staleness

In a standard cache, you just set an expiry of 3600 seconds and forget it. With a semantic cache, I prefer a tiered TTL strategy.

Exact matches — 1 hour TTL. If the user asks the exact same thing, they probably want the exact same answer.
Semantic matches — 4 hour TTL. These are more expensive to generate, so we want to keep them longer, but we also include a "last validated" timestamp.
Proactive invalidation — if my Shopify store updates a product price, I trigger a Redis worker to purge all cache entries related to that product ID.

This hybrid approach keeps the system responsive without serving stale data. I've written about similar event-driven patterns here if you want to dive deeper into how to handle these updates at scale.

Measuring success: precision and recall

Don't just turn on the cache and walk away. You need to monitor two specific metrics:

Cache hit rate — what percentage of queries are being handled by Redis? I aim for 30–50 percent for general FAQ-style bots.
Semantic precision — are the cached answers actually correct?

I log every semantic hit along with its similarity score. Once a week, I sample hits with scores between 0.85 and 0.92 and manually review them. If I see too many "near misses" that are actually different questions, I tighten the threshold.

Final takeaways for senior engineers

Implementing Redis as a semantic layer isn't just about speed. It's about making your AI systems sustainable. If you are serious about moving from a prototype to a production-ready SaaS, caching is not optional.

Here is your checklist for next week:

Install redisvl and set up a basic vector index in your dev environment.
Implement a two-tier lookup (exact then semantic).
Set your distance threshold conservatively (start at 0.05 or 0.1).
Add a tenant_id or context_version to your metadata to avoid cross-talk.
Monitor your hit rate and watch your API bill drop.

Building in public means sharing the war stories, not just the successes. For more technical deep dives into modern architecture, I suggest looking at 7 RAG mistakes in production to see what else might be slowing you down.

What is the one query in your system that keeps hitting your LLM unnecessarily? Drop a note via contact — let's figure out if a semantic cache would have caught it. 🤘

Scaling on demand: smart auto-scaling for modern AI apps

Mon, 25 May 2026 00:00:00 GMT

Your AI application is lagging, users are complaining, but your cloud dashboard says everything is fine. Your CPU usage is hovering at a comfortable 20 percent while your inference requests are timing out.

This is the classic scaling trap for AI engineers. Traditional auto-scaling is built for web servers where CPU and memory are the primary bottlenecks. In the world of large language models and vector databases, those metrics are practically useless.

If you wait for your CPU to hit 80 percent before spinning up a new pod, your service will be dead in the water long before the second instance even starts its boot sequence. GPU-bound workloads require a completely different playbook.

To build a resilient, cost-effective AI SaaS, you need to move beyond reactive hardware metrics. You need to scale on intent, queue pressure, and the specific physics of GPU memory.

Why the CPU lie is killing your UX

Most horizontal pod autoscalers (HPA) are configured to watch CPU utilization by default. For a Laravel or Node.js API, this works great. The work is linear — more requests equal more CPU cycles.

AI models are different. The CPU handles the "boring" stuff like tokenization, request routing, and managing HTTP headers. The heavy lifting happens on the GPU.

I have seen production clusters where the GPU is pinned at 100 percent while the CPU sits idle. Kubernetes sees the low CPU usage and thinks the pod is healthy. It might even try to pack more pods onto that node, leading to a catastrophic failure.

GPU utilization vs occupancy: the hardware layer

When you finally switch to monitoring GPUs, you encounter two confusing metrics: utilization and occupancy.

GPU utilization is essentially a duty cycle. It tells you the percentage of time the GPU was active over a sample period. It is a lagging indicator. By the time it hits 90 percent, your request queue has likely been building for 30 seconds.

Occupancy is more granular. It measures how many "warps" or hardware slots are filled within the streaming multiprocessors (SM). You can have high utilization but low occupancy if your batch size is too small.

For scaling, utilization is the baseline, but it isn't the truth. You need to look at what is happening before the request even hits the silicon.

Queue depth: your best leading indicator

If you want to stop fires before they start, monitor your queue depth. In vLLM or SGLang, this is the number of requests waiting for a slot in the inference engine.

Queue depth is a direct predictor of latency. If you know your model can handle 16 concurrent requests before P99 latency starts to climb, set your scaling trigger at 12.

Scaling on queue depth lets you provision capacity while the current hardware is still performing within SLO. It gives you that 60-second head start you need to pull a fresh container and load a 20GB model weights file into memory.

Token velocity and the KV cache

In generative AI, not all requests are created equal. A 10-token summary request is light. A 4,000-token RAG retrieval analysis is a heavyweight.

This is where token velocity and KV-cache usage come in. The KV cache is the memory on the GPU that stores the context of current conversations. If your KV cache is 95 percent full, the next long request will trigger an eviction or a "swap to CPU" event.

Latency will skyrocket. Your P99 will look like a mountain range.

I recommend scaling based on a combination of:

Token velocity — total tokens per second across all active instances.
KV-cache pressure — the percentage of available cache blocks currently occupied.

When the cache is full, it doesn't matter how low your GPU utilization is. You cannot fit more work onto that chip. You must scale.

Predictive scaling with ARIMA

Reactive scaling is always playing catch-up. Even with fast boot times, there is a delay. For enterprise apps with predictable traffic patterns, I use ARIMA (Auto-Regressive Integrated Moving Average) models to forecast load.

If I know traffic historically spikes at 9:00 am every Monday, I don't wait for the queue to grow. I use a time-series forecast to spin up the "base load" pods at 8:55 am.

This turns your infrastructure into a proactive system rather than a reactive one. You pay for what you use, but you ensure the capacity is there before the first user clicks "Generate."

Practical steps for your stack

Implementing this doesn't have to be a nightmare. Here is how I structure it:

Use KEDA — the Kubernetes Event-Driven Autoscaler is the gold standard. It lets you scale based on Prometheus metrics like queue depth or P99 latency instead of just CPU.
Set TTFT SLOs — measure time-to-first-token (TTFT). This is the most critical metric for user perception. If TTFT P99 exceeds 500ms, you need more replicas.
Blur the lines — don't rely on a single metric. Create a composite score of GPU utilization, queue depth, and cache pressure.
Fix your RAG — sometimes the scaling issue is actually a retrieval issue. If your vector search is slow, the inference engine waits longer, hogging the GPU. Check out these common RAG mistakes to ensure your bottleneck isn't upstream.
Optimize the frontend — for Shopify apps or custom SaaS, ensure your agentic workflows handle retries gracefully when the infrastructure is scaling up.

Scaling AI isn't about having the biggest GPUs. It is about having the smartest triggers. By moving to service-level metrics, you save money on idle compute and save your users from the dreaded "thinking..." spinner.

Are you still scaling on CPU, or have you made the jump to queue-based triggers yet? Drop a note via contact — I love this conversation. 🤘

GPU-aware load balancing: managing AI compute like a pro

Sun, 24 May 2026 00:00:00 GMT

You just scaled your RAG application to a hundred concurrent users. Suddenly, your latency spikes. Some users get their answers in two seconds, while others are staring at a loading spinner for thirty. You check your load balancer and it says everything is fine. CPU is at 40%. RAM is stable. But your GPUs are screaming, and your P99 latency is in the gutter.

The problem is that you are treating your AI models like traditional web servers. Sending a 4,000-token prompt to the same GPU that is currently generating a 50-token summary is a recipe for disaster. Round-robin routing is a relic of the past when it comes to LLM inference. If you don't account for the unique way GPUs handle compute and memory, you aren't just wasting money. You are killing your user experience.

The solution isn't just "more GPUs." It is building a load balancer that actually understands what is happening inside the model. We need to talk about GPU-aware routing, prefill vs decode disaggregation, and why your KV cache is the most valuable asset in your stack.

Why round-robin is a trap for LLMs

In traditional software development, a request is a request. Whether it's a GET /users or a POST /orders, the variance in resource consumption is usually predictable and small. Standard load balancers like Nginx or HAProxy work great here. They look at basic health checks and send traffic to the next available worker.

AI is different. A single request to an LLM has a massive variance in "weight." One user might ask "what is 2+2?" while another uploads a 50-page PDF and asks for a deep analysis. If your load balancer sends both to the same GPU, the heavy request will hog the compute resources, forcing the light request to wait in a queue.

This is why CPU-based metrics are useless. A GPU can be at 100% utilization while performing very different types of work. Some work is compute-bound, meaning it needs raw processing power. Other work is memory-bound, meaning it is limited by how fast data can move in and out of VRAM. To solve this, we have to look deeper into the inference lifecycle.

Prefill vs decode: the performance gap

LLM inference happens in two distinct phases. Understanding the difference between them is the "aha!" moment for GPU load balancing.

The first phase is prefill. This is when the model reads your entire prompt and processes all the tokens at once. It is a heavy, compute-intensive task that builds something called the KV cache (key-value cache). Prefill loves big batches and high-performance tensor cores. It is where the "heavy lifting" happens.

The second phase is decode. This is where the model generates the response one token at a time. Each new token only needs to look at the previously generated tokens and the KV cache. This phase is surprisingly light on compute but incredibly heavy on memory bandwidth. It is slow and long-lived.

When you mix these two on the same GPU without a smart scheduler, the "prefill" of a new request will often pause the "decode" of existing requests. This causes the jittery, stuttering text generation that users hate. By using GPU-aware load balancing, we can prioritize these phases differently across our fleet.

Metrics for the real world

To build a better router, you need to stop looking at CPU and start looking at these four metrics:

Token queue depth — how many tokens are waiting to be processed? This is a much more accurate representation of "load" than simple request counts.
KV cache utilization — GPUs have a limited amount of VRAM. The KV cache stores the "memory" of ongoing conversations. If a GPU's VRAM is 90% full of KV cache, it literally cannot accept a large new prompt, even if it's currently "idle."
Time to first token (TTFT) — this measures the latency of the prefill phase. If your TTFT is climbing, your prefill pool is congested.
Inter-token latency (ITL) — this measures the speed of the decode phase. If this is high, your GPUs are likely memory-bandwidth constrained.

I often recommend using tools like vLLM because they expose these metrics out of the box. You can pipe these into a custom gateway that makes routing decisions based on real-time VRAM availability rather than just "is the server up?"

The prefix-aware hack: SkyWalker-style routing

Here is a secret — the most expensive part of a RAG request is often re-processing the same system prompt or long context over and over again. If you send five consecutive questions about the same document to five different GPUs, each GPU has to perform the "prefill" phase for that document from scratch.

This is where prefix-aware routing (sometimes called SkyWalker-style routing) comes in. Instead of routing randomly, your load balancer tokenizes the start of the prompt and looks for a GPU that already has that specific content in its KV cache.

By matching the "prefix" of a prompt to a specific GPU, you can skip the prefill phase entirely for large chunks of text. This cuts latency from hundreds of milliseconds to almost zero. It is the single most effective way to optimize costs in production RAG systems. I've written before about common RAG mistakes, and ignoring cache locality is definitely one of them.

Splitting the fleet into specialized pools

As you scale, you should stop treating every GPU as a generalist. A senior move is to create disaggregated inference fleets.

I like to split my GPUs into two pools:

The prefill pool — high-compute GPUs (like H100s) optimized for processing massive amounts of context quickly. These nodes handle the initial prompt and then "hand off" the state.
The decode pool — memory-optimized GPUs (like A100s or even cheaper L40s) that focus on churning out tokens for existing requests.

This separation lets you scale based on your specific workload. If your users are uploading huge documents but only asking for short summaries, you scale your prefill pool. If they are having long, chatty conversations, you scale your decode pool.

This is the same logic we use in modern DevOps with Coolify. You wouldn't put your heavy database on the same tiny instance as your frontend — why would you mix your heavy prefill work with your light decode work?

Implementing your first GPU-aware router

You don't need to build a custom engine from scratch to start doing this. Here is the practical path I follow when setting this up for a new SaaS:

Centralize your metrics — use Prometheus to scrape vLLM or TGI metrics from every GPU node.
Use a smart gateway — implement a middleware in Go or Rust (or even a heavy-duty Lua script in OpenResty) that queries these metrics before choosing a target.
Prioritize KV cache — check if the conversation_id has been seen by a specific node recently. If it has, and that node isn't at 100% KV utilization, send it there.
Set hard limits — if a GPU reaches 85% VRAM usage, take it out of the rotation for new prompts until some sessions finish.

Managing AI compute is about moving from "black box" infrastructure to "context-aware" infrastructure. When your load balancer knows the difference between a 10-token greeting and a 10,000-token context window, your costs go down and your users stay happy.

It's easy to get lost in the hype of "agentic systems" and context-aware agents, but none of that matters if your underlying infrastructure is buckling under the weight of unoptimized routing.

If you are still using round-robin for your AI models, what is the biggest bottleneck you are seeing in your P99 latency right now? Drop a note via contact — I love this conversation. 🤘

Circuit breakers: preventing cascading failures in your vector DB

Sat, 23 May 2026 00:00:00 GMT

You built a beautiful RAG pipeline. It works perfectly on your machine with a few hundred vectors. Then you launch. Traffic spikes. Suddenly your managed vector database starts sweating. A single similarity search that used to take 50ms is now taking 5 seconds. Your web workers are all tied up waiting for responses that aren't coming. The database isn't technically down — but it is slow enough to kill your entire application. Your users see spinning loaders until the request finally times out. This is a classic cascading failure, and it is the fastest way to drain your "innovation budget" and your users' patience.

The problem is that we often treat external APIs and databases as if they are always healthy. We write code that assumes the vector DB will return results. When it doesn't, we wait. And while we wait, we hold onto memory and CPU cycles. The solution is an old-school electrical engineering concept applied to software: the circuit breaker.

In this guide I want to show you how to wrap your AI infrastructure in protective logic so a slow dependency doesn't take your whole SaaS down with it.

Understanding the three states

The circuit breaker pattern is a state machine that sits between your application code and your external service. It monitors every call you make. It has three specific states that dictate how it handles traffic.

Closed — the healthy state. In the closed state, the circuit is complete. Requests flow through to your vector database or LLM provider normally. The breaker is silently watching. It keeps a count of how many requests failed or took too long. As long as the failure rate stays below your threshold, it stays closed. This is the "everything is fine" mode.

Open — the fail-fast state. Once the failure threshold is hit — let's say 50% of requests failed in the last 30 seconds — the breaker "trips" and moves to the open state. Now, every time your application tries to call the vector DB, the breaker immediately throws an error or returns a fallback response without even attempting the network call. This gives your database room to breathe and recover. It also ensures your application doesn't waste time waiting on a service that is clearly struggling.

Half-open — the recovery test. After a cooldown period, the breaker moves to the half-open state. It allows a small number of "test" requests to pass through. If these test calls succeed, the breaker assumes the service is healthy again and moves back to the closed state. If they fail, it immediately goes back to open for another cooldown cycle. This is a controlled way to probe the system before fully re-engaging.

Why your RAG pipeline needs this

RAG pipelines are particularly vulnerable because they usually involve multiple high-latency network hops. You have to embed the query, search the vector DB, and then call the LLM. If any of these pieces fail or slow down, the whole experience breaks.

Most developers make the mistake of only handling hard errors like a 404 or a 500 status code. But in production, "slow" is often more dangerous than "down." A slow vector DB creates a bottleneck that backs up your entire request queue. By the time you realize there is a problem, your server is out of memory because it is holding open thousands of connections.

If you have read my previous post on 7 RAG mistakes in production, you know that reliability is the difference between a demo and a product. The circuit breaker is your insurance policy against these types of outages.

Implementing fallback strategies

Tripping the breaker shouldn't always mean showing an error message to the user. The best AI systems use fallbacks to maintain a level of service even when parts of the stack are failing.

Hot and cold tiers. You can think of your vector DB as your "hot" knowledge tier. If it fails, you should have a "cold" fallback. Maybe you fall back to a standard keyword search in your primary Postgres or MySQL database. The results might not be as contextually relevant as a vector search, but a "decent" answer is always better than a "timed out" error.

Cached responses. Another great strategy is semantic caching. If the circuit is open, you can check a Redis cache for similar queries that were answered recently. Even if you can't generate a fresh answer, you might be able to serve a cached one. This keeps the user moving while your backend recovers.

LLM-only mode. If your retrieval step is what's failing, you can still send the user's prompt to the LLM with a note that external knowledge is currently unavailable. The LLM can then answer based on its general training data. It is a degraded experience, but it is still functional. Transparency here is key — tell the user that the "live" data isn't available so they know to verify the response.

Building it in Laravel

Since I spend a lot of time in the Laravel ecosystem, I often use tools that make this easy to implement. You don't need to write the state machine from scratch. Packages like spatie/resilience or even building a custom wrapper around the illuminate/http client can get the job done.

The goal is to wrap your API calls in a block that understands these states. Here is a simplified look at how that logic looks in practice.

When you call your vector DB client, you wrap it in the breaker. If the call fails multiple times, the breaker trips. In the catch block, you handle the CircuitBreakerOpenException by returning your fallback data. This keeps your controllers clean and your architecture robust.

You can also integrate this with your SaaS hosting on Coolify to ensure that your containers don't get killed by health checks just because an external API is slow. The breaker prevents the resource bloat that usually triggers those health-check failures.

Live telemetry and smart routing

Senior engineers don't just set a circuit breaker and walk away. They monitor it. You need live telemetry to see how often your circuits are tripping. Tools like Prometheus or even simple logs piped to a dashboard can tell you a lot.

If you see that your primary vector DB in us-east-1 is constantly tripping but your secondary in eu-west-1 is healthy, you can implement smart routing. Your circuit breaker can act as a signal to your load balancer or your internal router to shift traffic to the healthy region.

This kind of event-driven architecture makes your system self-healing. It doesn't wait for a human to wake up at 3am to fix a database. It detects the failure, trips the breaker, uses the fallback, and tries to recover automatically.

Practical steps to get started

If you are ready to harden your AI infrastructure, start here:

Identify your weakest links — list every external call in your RAG pipeline. Usually it is the embedding API and the vector DB.
Define your thresholds — how many slow requests are you willing to tolerate? Start with a 50% failure rate over 30 seconds and a 2-second timeout.
Choose your fallbacks — decide what happens when the breaker is open. Do you show an error, use a cache, or switch to keyword search?
Wrap your clients — use a library to wrap your HTTP or database calls. Don't try to build the state machine logic yourself unless you have a very specific use case.
Monitor the trips — set up an alert when a circuit stays open for more than a few minutes. This usually indicates a major provider outage that needs your attention.

The goal is to fail gracefully. Every system has issues, but the ones that survive are the ones that don't let a small fire in a dependency burn down the whole house.

Have you ever had a slow dependency take down your entire application, or are you still relying on long timeouts and luck? Drop a note via contact — I love this conversation. 🤘

Message queues: handling the heavy lifting of document processing

Fri, 22 May 2026 00:00:00 GMT

If you are running your document embeddings inside your request-response cycle, you are playing with fire. I have seen too many junior devs build a beautiful RAG application that falls over the second a user uploads a 50MB PDF. The browser spins, the Nginx timeout hits, and the database locks up while your worker tries to chunk 500 pages of legal jargon in real time.

This is the classic "heavy lifting" problem in AI engineering. Document processing — OCR, text extraction, semantic chunking, and embedding — is slow, unpredictable, and resource-heavy. Trying to force it into a synchronous web request is a recipe for a bad user experience and a fragile system.

The solution is decoupling. I'm talking about message queues. In this guide, I'll walk you through why async work belongs in a queue and how to build a production-grade ingestion pipeline that doesn't melt your server.

The synchronous trap

Imagine a user uploads a document to your SaaS. Your code receives the file, sends it to an extraction API, waits for the response, loops through the text to create chunks, sends each chunk to an embedding model, and finally saves it to pgvector.

If any of those steps take more than 30 seconds, the connection drops. If the embedding API has a momentary blip, the whole process fails, and the user has to start over. Worse, while your server is busy doing this heavy work, it's not responding to other users.

This is where we apply the first rule of senior engineering: if it takes more than 100ms, consider making it async. By moving this work to a message queue, you give your users immediate feedback ("we're processing your file!") while the heavy lifting happens safely in the background.

The anatomy of a document pipeline

A robust RAG pipeline isn't just one big function. It's a series of decoupled stages. I like to break it down into modular steps, each triggered by a message in a queue. This lets you scale different parts of the system independently.

Here is how I usually structure it:

Ingestion & discovery — a user uploads a file. You save it to S3 and push a small message to the queue containing the file_path and tenant_id.
Parsing & normalization — a worker picks up the message, downloads the file, and runs it through a parser like pdfplumber or an OCR service. It emits the raw text to the next queue.
Chunking — this worker takes the text and splits it into semantic sections. Doing this in its own stage means you can easily swap chunking strategies (e.g., recursive character vs semantic) without re-running the heavy parsing step.
Embedding & indexing — the final stage batches the chunks, hits your embedding API (like OpenAI or a local model), and pushes the vectors into your vector DB.

This stage-based approach is exactly what I discuss in my post on 7 RAG mistakes to avoid in production. It provides backpressure control — if your vector DB slows down, the "index" queue grows, but the "parsing" workers keep humming along.

Retries and the beauty of dead letters

In the real world, things break. APIs time out. PDFs are malformed. Workers crash.

When you use a message queue like Redis (with BullMQ or Laravel Queues) or SQS, you get retries for free. If a worker fails, the message goes back onto the queue to be tried again after a short delay. Exponential backoff is your best friend here — don't hammer a failing API every 5 seconds. Wait 10, then 60, then 300.

But what happens when a document simply cannot be processed? Maybe it's a password-protected PDF or a corrupted file. You don't want it retrying forever and clogging up your workers.

This is where a Dead Letter Queue (DLQ) comes in. After a certain number of failed attempts, the message is moved to the DLQ. This acts as a "quarantine" zone. I can then inspect these failed jobs, fix the underlying issue, and manually re-queue them. It's a safety net that keeps your main production line moving.

Batching for efficiency

If you are processing 10,000 chunks, you do not want to make 10,000 individual API calls to your embedding provider. That's slow and expensive.

Most embedding APIs and vector databases perform much better with batches. A good worker pattern involves pulling multiple messages from the queue (or aggregating them in memory) and sending them as a single bulk request.

In a Laravel environment, I often use job batching to track the progress of a large document. I can see exactly when 95% of a PDF is processed and update a progress bar for the user. If you're interested in how this fits into a larger architecture, check out my thoughts on event-driven pub/sub systems.

Event-driven prefetching

Here is a "senior" tip — queues aren't just for ingestion. You can use them for prefetching.

If a user is chatting with an AI agent and the conversation is heading toward a specific topic, you can fire off a background job to fetch related documents and warm up the cache before the user even asks the next question. This makes your AI feel lightning fast because the context is already "ready" when the retrieval step hits.

By using an event bus, you can decouple the chat interface from these optimization tasks. The chat app just emits a user_asked_question event, and a background worker decides whether it should pre-fetch more data or update the semantic cache.

Monitoring the heart of your app

Once you move to a queue-based system, your most important metric is no longer just "request latency." You need to watch your queue depth.

If the queue depth is growing faster than your workers can clear it, you have a bottleneck. This is where tools like Docker and Coolify make life easy — I can spin up five more worker containers to handle a sudden surge in document uploads. You can read more about how I manage this infra in my Coolify and Docker guide.

Practical takeaways for your pipeline

Never store large files in the queue — only pass references (like an S3 key). Keep messages small for better performance.
Make tasks idempotent — assume a message might be processed twice. Use upsert instead of insert in your vector DB to avoid duplicates.
Use structured logging — every worker log should include the doc_id and tenant_id. Searching for "why did this file fail?" is impossible without it.
Scale on queue depth — set up your autoscaler to add workers based on how many messages are waiting, not just CPU usage.
Separate worker pools — have one set of workers for "fast" tasks (like metadata updates) and another for "slow" tasks (like OCR/embedding). Don't let a huge PDF upload block a simple name change.

Building a document pipeline is about respecting the time it takes to process data. By moving that work into a queue, you build a system that is resilient, scalable, and — most importantly — provides a smooth experience for your users.

How are you currently handling long-running AI tasks? Are you still fighting with request timeouts, or have you embraced the queue? Drop a note via contact — I love this conversation. 🤘

Rate limiting: protecting your AI wallet

Thu, 21 May 2026 00:00:00 GMT

One runaway agent loop is all it takes to wake up to a $5,000 OpenAI bill.

If you're building AI-powered SaaS or RAG systems, your biggest threat isn't a server crash. It's a "denial of wallet" attack. A buggy client, a malicious user, or even your own experimental agent can spam your API endpoints and burn through your tokens (and credits) in minutes.

Traditional web apps care about requests per second to keep the CPU from melting. In the world of LLMs, we care about tokens per minute to keep the bank account from draining. Standard rate limiting isn't enough anymore. You need an architecture that understands cost, context, and the "noisy neighbor" problem before a single prompt even hits your vector DB.

Why requests per second (QPS) is a lie for AI

In a standard Laravel or Node app, a request is a request. Sure, some take longer than others, but they generally consume similar resources. In AI engineering, one request might be a 50-token greeting, while another is a 128,000-token context dump for a RAG pipeline.

If you only limit requests per second, a single user can stay within their "10 requests per minute" limit while still costing you 100× more than everyone else combined. This is where the noisy neighbor problem becomes a financial crisis.

You aren't just protecting your infrastructure. You're protecting your margins. To do this effectively, we have to move from counting "pings" to counting "value."

The anatomy of a denial of wallet (DoW) attack

A denial of wallet attack is the AI equivalent of a DDoS. The goal isn't necessarily to take your site down. It's to exhaust your API quotas or financial budget until your service stops functioning — or you're forced to pay a massive bill.

I've seen this happen in three ways:

The agentic loop — an autonomous agent gets stuck in a logic loop, calling your tool-use functions repeatedly without a "max steps" ceiling.
The scrapers — malicious bots trying to exfiltrate your entire knowledge base by querying every possible permutation of your RAG system.
The dev mistake — a frontend developer accidentally puts an LLM-powered "autocomplete" on a search bar that triggers on every keystroke.

Without token-aware rate limiting, your provider (like OpenAI or Anthropic) will eventually hit you with a 429 error. But by that time, the damage to your wallet is already done.

Solving the noisy neighbor with hierarchical limits

To solve this, I implement a three-layer rate limiting strategy at the API gateway level. This ensures that even if one tenant goes rogue, the rest of the platform stays healthy.

1. The global provider layer

This is your final line of defense. If your OpenAI quota is 500,000 tokens per minute (TPM), set your internal global limit to 450,000. This leaves a safety buffer and prevents you from actually hitting the provider's hard ceiling, which can sometimes lead to temporary account bans or throttled priority.

2. The tenant layer

Every customer gets their own bucket. I usually tie this to their subscription tier. A "Pro" user might get 50,000 TPM, while a "Free" user is capped at 2,000. This ensures no single company can eat up your entire global quota.

3. The user/session layer

Inside a single tenant, you still need limits. You don't want one single employee at a customer's company hogging all the tokens allocated to that entire organization. I set these at about 20% of the total tenant capacity.

Implementation: the token bucket algorithm

For most of my builds, I use a "token bucket" or "leaky bucket" algorithm backed by Redis. It's the gold standard for handling bursty traffic while maintaining a steady flow.

Here is the logic: each user has a "bucket" of tokens. Every time they send a prompt, we estimate the total cost (input tokens + expected max_tokens). If the bucket has enough, they proceed and the tokens are deducted. The bucket refills at a constant rate over time.

If you're modernizing your stack or building a SaaS on the LEMP stack, you can implement this efficiently in Laravel using middleware and a fast storage layer like Redis.

// A simplified token-bucket check in Laravel middleware
public function handle($request, Closure $next)
{
    $tenantId = $request->user()->tenant_id;
    $estimatedTokens = $this->tokenizer->estimate($request->input('prompt'));

    if (!$this->limiter->consume("tenant:{$tenantId}:tokens", $estimatedTokens)) {
        return response()->json(['error' => 'token budget exceeded'], 429);
    }

    return $next($request);
}

Token-budget routing and adaptive throttling

What happens when a user hits their limit? Most devs just throw a 429 error. But as a senior engineer, I prefer a more graceful degradation. We call this adaptive throttling.

Instead of a hard "no," you can:

Degrade the model — switch the request from GPT-4o to a cheaper, faster model like GPT-4o-mini.
Truncate the context — if the user is over budget, strip out some of the retrieved RAG documents to lower the input token count.
Queue the request — for non-interactive tasks (like background summarization), move the request to a message queue and process it when the token bucket refills.

This keeps the user experience intact while protecting your margins. It's about being smart, not just being a gatekeeper.

The RAG context: limiting the "hidden" calls

In a RAG (retrieval-augmented generation) system, one user query often triggers multiple backend actions:

One embedding call for the query.
One search query to the vector database.
One (or more) LLM calls for the final answer.

If you only rate limit the final LLM call, your vector database might still get hammered by search queries. You need to treat the entire "RAG flow" as a single unit of work with its own combined budget. I cover some of these common RAG production mistakes frequently, but rate limiting the "flow" is often the most overlooked fix.

Practical steps to protect your system today

If you're launching an AI feature this week, do these three things:

Set a hard daily spend cap — most API providers let you set a maximum dollar amount per month. Set it. It's your parachute.
Enforce max_tokens — never let a user request an uncapped response. Always set a sane default for max_tokens in every API call.
Implement per-request timeout — if an LLM call takes longer than 30 seconds, kill it. Slow calls are often the symptom of a system that is about to spiral out of control.

Rate limiting isn't just a "security" feature. In the AI era, it's a core part of your business model. You can't scale a product that allows a single user to run up a thousand-dollar bill in their first hour.

Build for fairness. Build for cost. Build for the "noisy neighbor."

Have you ever seen a "denial of wallet" happen in the wild, or are you still running on a wing and a prayer with global provider limits? Drop a note via contact — I love this conversation. 🤘

API Gateway: the front door of your AI stack

Wed, 20 May 2026 00:00:00 GMT

Stop exposing your models to the wild.

If you are building a production AI app, sending requests directly from your frontend to a RAG orchestrator or — god forbid — straight to an LLM provider is a liability. It is slow. It is insecure. And it is the fastest way to wake up to a five-figure bill you didn't plan for.

I have spent over a decade building software, and if there is one thing I have learned, it is that engineering for "it works" is not the same as engineering for "it scales." In the world of AI, scale isn't just about traffic. It is about cost, latency, and data safety.

Imagine a "denial of wallet" attack where a malicious script spams your completions endpoint. Without a gatekeeper, your API keys are just sitting ducks. Or worse, imagine a multi-tenant app where one user's prompt accidentally retrieves another user's private data from your vector DB.

This is where the API gateway comes in. It is the first line of defense and the brain of your infrastructure. It handles the boring but critical stuff so your RAG logic can stay focused on actually being smart.

The gatekeeper pattern

At its core, an API gateway is a reverse proxy that sits between your users and your backend services. But for an AI stack, it does more than just forward traffic. It acts as a centralized brain for auth, routing, and rate limiting.

When a request hits your gateway, it goes through a gauntlet of checks before it ever touches a model. This "gatekeeper" ensures that every millisecond of GPU time or every cent of token cost is intentional.

Authentication and tenant isolation

In a typical SaaS, authentication is about knowing who the user is. In an AI-powered SaaS, it is about data sovereignty.

If you are building a RAG system, your biggest risk is cross-tenant data leakage. If you want to avoid common RAG mistakes, you must handle identity at the very edge.

I prefer using JWTs (JSON Web Tokens) with custom claims. When a request hits the gateway, I validate the token and extract the tenant_id. That ID is then injected into the headers of the request before it is passed to the RAG orchestrator.

This means the orchestrator doesn't have to "guess" who the user is. It receives a verified x-tenant-id header and uses it to apply metadata filters on the vector database. The user only "sees" data they are allowed to see. No tenant ID? No query. Period.

Smart routing for model flexibility

The AI world moves fast. Today you are using GPT-4o. Tomorrow, Claude 3.5 Sonnet might be the better play. Next week, you might want to test a fine-tuned Llama 3 model running on your own infrastructure via Docker and Coolify.

If your model logic is hardcoded into your frontend or a single monolithic backend, switching models is a nightmare. An API gateway solves this with smart routing.

I use the gateway to create "model aliases." Instead of the frontend calling a specific model, it calls a generic endpoint like /v1/chat/completions. The gateway then decides where to send that request based on:

User tier — free users get routed to a cheaper, faster model like GPT-4o-mini. Pro users get the heavy hitters.
Versioning — run an A/B test by routing 10% of traffic to a new model version without changing a single line of client-side code.
Failover — if OpenAI is having an outage, the gateway can automatically reroute traffic to an Anthropic backup.

This level of abstraction is what separates a weekend project from a resilient SaaS product.

Rate limiting: protecting the wallet

We used to rate limit to protect our CPUs. Now, we rate limit to protect our bank accounts.

AI requests are asymmetric. A user sends a 50-word prompt, and the model might generate a 1,000-word response. The cost difference is massive.

A good API gateway implementation allows for tiered rate limiting. Set global limits to prevent your entire system from being overwhelmed, but also set per-tenant or per-user limits.

I usually implement this using Redis. The gateway checks the user's quota in real time. If they have exceeded their daily token limit or their requests-per-minute (RPM) cap, the gateway returns a 429 Too Many Requests immediately.

This saves your backend from doing expensive work that you won't get paid for. It also stops "noisy neighbors" — one user scripting an automated tool that hogs all your capacity and makes the app slow for everyone else.

Handling the AI-specific quirks

Gateways for AI need to handle two things differently than traditional web apps: streaming and long-running requests.

Streaming support

Most modern AI apps use Server-Sent Events (SSE) to stream responses word by word. Some older gateways or load balancers try to "buffer" the entire response before sending it to the client. This kills the user experience.

Make sure your gateway (whether you are using Kong, Tyk, or a custom Laravel solution) is configured to disable buffering for AI routes. The data should flow through the gateway like water through a pipe, not like a bucket that needs to be filled.

Extended timeouts

Traditional APIs expect a response in 1–2 seconds. A complex RAG query involving multiple vector searches and a large model generation might take 30 seconds or more.

You need to adjust your gateway's "upstream timeout" settings. If you keep the default 5-second timeout, your users will see constant 504 Gateway Timeout errors even when your models are working perfectly.

Practical steps for your stack

You don't need a massive team to set this up. Here is how I usually approach it depending on the project size:

For startups — use a cloud-native gateway like AWS API Gateway or Azure API Management. They are serverless, scale automatically, and integrate directly with Cognito or Entra ID for auth.
For self-hosters — Kong is the gold standard. It has a great ecosystem of plugins for rate limiting and auth. If you are comfortable with PHP, a thin Laravel app acting as a gateway works surprisingly well for custom logic.
For Shopify devs — if you are building agentic commerce tools, use the gateway to handle the specific Shopify HMAC validation before passing the request to your AI agents.

Wrapping up

The API gateway isn't just a piece of infrastructure. It is a design philosophy. It says that your AI logic is too valuable — and too expensive — to be left unprotected.

By centralizing auth, routing, and rate limiting, you make your system more modular. You can swap models, change pricing tiers, and update security policies without touching the core code that makes your AI "smart."

Are you still letting your frontend talk directly to your LLM providers? If so, what is the one thing stopping you from putting a gateway in front of it?

Stay sharp. — a senior dev

Actionable takeaways

Centralize auth — never let your RAG orchestrator handle raw user authentication. Do it at the gateway.
Inject tenant context — use the gateway to verify the user and inject a tenant_id header to enforce data isolation.
Implement global + per-user limits — protect your wallet from both malicious attacks and accidental bugs.
Configure for streaming — ensure your gateway doesn't buffer responses, or your "typing" effect will break.
Use model aliases — route to /chat/pro instead of a specific model name to keep your stack flexible.

Scaling with RabbitMQ: why message brokers matter

Sat, 16 May 2026 00:00:00 GMT

The monolith is screaming. Every time a user hits the "checkout" button, your server has to generate a PDF, send a welcome email, update the inventory, and ping three different third-party APIs. Your request/response cycle is hanging by a thread. If any of those external services take more than two seconds to respond, your user sees a 504 gateway timeout.

It starts with a small delay. Then it becomes a bottleneck. Before you know it, you are throwing more RAM at a problem that cannot be solved by bigger hardware. This is the "monolith wall." When everything is synchronous, a single failure in a secondary task brings down the entire user experience.

I have been in these trenches. I have watched dashboards turn red during a marketing spike because the database was too busy processing background reports to handle new signups. The solution isn't just "faster code." It is a change in how your services talk to each other. It is about decoupling. It is about RabbitMQ.

Why your request path is too crowded

In a standard web application, we often fall into the trap of doing too much inside the controller. A user makes a request, and we feel the need to finish every related task before sending back a "200 OK." This is fine for a side project with ten users. For a scaling SaaS, it is a recipe for disaster.

Think of it like a coffee shop. If the person taking your order also has to grind the beans, froth the milk, and hand-draw the logo on the cup before taking the next order, the line will wrap around the block. The shop fails because the cashier is "tightly coupled" to the barista's work.

To scale, you need a system where the cashier takes the order, writes it on a slip, and hands it off. They are immediately free for the next customer. The work happens "in the background." That slip of paper is your message. The counter where they put the slips is your message broker.

The RabbitMQ magic: more than just a queue

RabbitMQ is an open-source message broker that acts as the "middleware" for your architecture. It doesn't just store messages — it routes them with surgical precision.

At its core, RabbitMQ uses a few key concepts:

Producers — your web applications or APIs that create a task.
Exchanges — the "post office" that decides which queue a message should go to based on rules.
Queues — the temporary storage where messages sit until they are processed.
Consumers — the background workers (often running in Docker containers) that actually do the heavy lifting.

By putting RabbitMQ in the middle, your web tier only needs to do one thing: tell RabbitMQ that a task needs to be done. This takes milliseconds. The user gets an instant confirmation, while the heavy work happens whenever your workers are ready.

Smoothing out the spikes

One of the biggest pains in custom web development is handling "noisy neighbors" or sudden traffic bursts. If a large enterprise client uploads a 100,000-row CSV for processing, you don't want that to slow down the login page for everyone else.

With RabbitMQ, those 100,000 rows become 100,000 individual messages in a queue. Your workers will chew through them at a steady pace. If the queue gets too long, you don't need to scale your entire application — you just spin up more worker instances.

This is called horizontal scaling. Since the workers are decoupled from the web server, you can scale them independently based on the specific load. If you use modern tools like Laravel and Vue, you can easily manage these background jobs using built-in queue drivers that talk directly to RabbitMQ.

How to move from sync to async

You don't have to rewrite your entire codebase overnight. I usually recommend the "strangler pattern." Pick one slow, non-critical process. Maybe it is the "forgot password" email or an image resize task.

Here is a simplified look at how you might dispatch a job in a modern PHP environment:

// instead of sending the email directly
// $emailService->sendWelcome($user);

// we dispatch a job to RabbitMQ
ProcessWelcomeEmail::dispatch($user)->onQueue('high-priority');

// the user gets a response instantly
return response()->json([
    'message' => 'welcome! check your inbox soon.',
]);

Now, even if your email provider (like SendGrid or Mailgun) is having a bad day, your application stays up. The message stays safely in the RabbitMQ queue until the service is back online.

Building for the future

Moving to a message-broker-first mindset is the first step toward a microservices architecture. Once your monolith starts publishing "events" (like order.placed or user.registered), other services can start listening to those events without you ever changing the original code.

It creates a system that is resilient, observable, and significantly easier to debug. You can look at the RabbitMQ management UI and see exactly how many tasks are pending and how fast they are being processed. No more guessing why the server is slow.

Already running on GCP? The same patterns apply with Google Pub/Sub — pick the broker that matches your hosting stack, not the trend cycle.

Key takeaways for your next build

Don't block the user. If a task takes more than 100ms, it probably belongs in a queue.
Decouple early. Use RabbitMQ to separate your "thinking" (web tier) from your "doing" (workers).
Idempotency is key. Since messages can sometimes be delivered twice, make sure your workers can handle the same task more than once without causing errors.
Monitor your queues. A massive queue is a leading indicator that you need more workers or that a service is failing.

Scaling a SaaS isn't about working harder. It is about working smarter by giving your data room to breathe. RabbitMQ is that breathing room.

What is the slowest part of your application right now? Could it be a background job instead? Tell me — I bet we can move it off the request path.

Mastering event-driven architecture with Google Pub/Sub

Sat, 02 May 2026 00:00:00 GMT

Building a modern web application usually starts simple. You have a request and you send a response. But as your business grows, that simple flow starts to feel heavy. Maybe you need to send a welcome email, update a CRM, and trigger a data warehouse sync all at once. If you do this synchronously, your users are stuck staring at a loading spinner. If one service fails, the whole request dies. Your system becomes a house of cards.

This is the problem of tight coupling. Your application logic is tangled like old headphones in a pocket. Every new feature adds more risk and more latency. You want to scale, but your monolithic approach is holding you back. You need a way to let your services talk without being glued together.

The solution is event-driven architecture (EDA). And in the Google Cloud world, the heart of that architecture is Google Pub/Sub. It is a globally distributed messaging service that decouples the services that produce events from the services that consume them. It allows you to build systems that are truly scalable, resilient, and ready for the future of AI and big data.

Understanding topics and subscriptions

At its core, Google Pub/Sub is built on two main concepts: topics and subscriptions. I like to think of a topic as a radio station. It broadcasts information out into the void. It doesn't care who is listening or what they do with the music. It just plays the hits.

On the other side, you have subscriptions. These are the listeners. A subscription represents a stream of messages from a specific topic. The beauty of this system is the decoupling. The service sending the message (the publisher) only needs to know about the topic. It doesn't need to know if there are ten consumers or zero.

In a typical software development workflow, this is a game changer. When a user signs up on your site, you publish a UserSignedUp event to a topic. Your main app is done. It returns a success message to the user immediately. Meanwhile, various subscribers pick up that event and do their jobs in the background.

The power of fan-out

One of the most effective patterns in Google Pub/Sub is the fan-out. This is where you publish a single message to a topic, but multiple subscriptions receive a copy of that message.

Imagine you are running an e-commerce store. When an order is placed, you might have three different services that need to act:

An inventory service to update stock levels.
A shipping service to generate a label.
An analytics service to track revenue.

Instead of your checkout service calling three different APIs, it sends one message to an order-events topic. Three separate subscriptions (one for inventory, one for shipping, one for analytics) each get their own copy of that order message. They process it at their own pace. If the analytics service is down for maintenance, it doesn't stop the shipping label from being created. The messages just wait in the queue until the service is back online.

Pull vs push delivery

When you set up a subscription, you have to decide how you want to receive messages. Google Pub/Sub gives you two main options: pull and push.

Push subscriptions are great for serverless architectures. Google Cloud will literally "push" the message to a webhook URL you provide. This is perfect for cloud infrastructure built on Cloud Run or Cloud Functions. It scales automatically and you only pay for what you use. However, you have to make sure your endpoint can handle the sudden spikes in traffic.

Pull subscriptions work differently. Your consumer service asks Google Pub/Sub for messages when it is ready. This gives you much more control over backpressure. If your worker is busy, it doesn't ask for more work. This is the preferred method for long-running services or when you are using tools like Laravel's queue workers. Pull delivery is generally more robust for heavy processing tasks where you want to fine-tune concurrency.

Building resilient systems with DLQs

In a distributed system, things will fail. A database might time out or an external API might be down. If a message can't be processed, you don't want to lose it. This is where Dead Letter Queues (DLQs) come in.

A DLQ is just another topic where Google Pub/Sub sends messages that have failed to be acknowledged after a certain number of attempts. Instead of retrying forever and clogging up your main pipeline, the "poison" message is moved aside.

I always recommend setting up a DLQ for every critical subscription. It acts as a safety net. You can then build a separate dashboard or a small script to inspect these failed messages, fix the underlying issue, and replay them. It is a professional approach to error handling that prevents data loss and keeps your system moving.

Integrating Google Pub/Sub with Laravel

For those of us in the PHP and Laravel ecosystem, integrating Google Pub/Sub is incredibly smooth. While Laravel comes with great support for Redis and SQS, using a package like google/cloud-pubsub allows you to tap into GCP's global scale.

You can treat Google Pub/Sub as a custom queue driver. Here is a quick look at how you might publish a message in a typical service class:

use Google\Cloud\PubSub\PubSubClient;

$pubsub = new PubSubClient([
    'projectId' => 'your-gcp-project-id',
]);

$topic = $pubsub->topic('user-events');

$topic->publish([
    'data' => json_encode([
        'user_id' => 123,
        'action' => 'signup',
    ]),
    'attributes' => [
        'event_type' => 'UserSignedUp',
        'priority' => 'high',
    ],
]);

By using attributes, you can even filter messages at the subscription level. This means a subscriber can choose to only listen for messages where event_type is UserSignedUp. This saves compute power and money because your worker never even sees the messages it doesn't care about.

Monitoring and cost management

Monitoring is not an afterthought. It is a requirement. Google Cloud provides deep integration with Cloud Monitoring for Google Pub/Sub. You should keep a close eye on your "unacked message count." If this number is climbing, it means your subscribers can't keep up with the producers.

Cost is another factor to watch. Google Pub/Sub is very cheap for low volumes, but as you scale to millions of messages, those bytes add up. Use batching on the publisher side to reduce the number of API calls. Also, be mindful of message retention. If you don't need to keep messages for seven days, shorten the retention period to save on storage costs.

Wrap up and takeaways

Moving to an event-driven architecture with Google Pub/Sub is a major step toward building senior-level systems. It gives you the flexibility to grow your application without it becoming a tangled mess. It is the backbone of many high-performance web applications I build for clients today.

Here are the key takeaways for your next project:

Start by identifying "facts" in your system (e.g., OrderPlaced) and turn them into events.
Use the fan-out pattern to keep your services decoupled and focused on one task.
Always implement a Dead Letter Queue to handle failures gracefully.
Use message attributes for efficient filtering at the subscription level.
Design your consumers to be idempotent. If they receive the same message twice, it doesn't cause errors or double-charges.

Building these kinds of systems takes a bit more planning upfront, but the payoff in stability and scalability is worth every second.

Are you still using synchronous API calls for everything, or have you started moving toward an event-driven flow? Let me know what's stopping you from making the switch.

Vibe coding and the architectural shift to agentic workflows

Sun, 22 Mar 2026 00:00:00 GMT

I've spent the last decade building Laravel applications, managing Docker clusters, and fine-tuning Shopify stores. For most of that time, "coding" meant one thing: translating a business requirement into a specific syntax that a machine could execute. It was a manual, linear process of writing line by line, debugging stack traces, and managing state.

But recently, the ground has shifted. We're moving away from the era of "writing code" and into the era of "orchestrating intent."

This transition — often playfully called vibe coding — is more than just a meme. It represents a fundamental architectural shift in how we build software, moving from sequential instruction to agentic loops powered by protocols like MCP (Model Context Protocol).

The friction of the manual syntax

The traditional development lifecycle is riddled with invisible friction. You have an idea (the "vibe"), you break it down into tasks, and then you spend 80% of your time fighting with syntax, configuration, and boilerplate.

In a standard Laravel environment, even a simple feature — say, an automated reporting tool — requires you to set up routes, controllers, service classes, and database migrations. You are the compiler. You are the architect. You are the labor.

The problem is that our human cognitive load is being consumed by the "how" rather than the "what." We get stuck in the weeds of PHP version compatibility or Docker networking issues, losing sight of the actual user value. This manual micromanagement doesn't scale as fast as the demands of modern business.

The agitation of the "black box" assistant

When AI first entered the scene with basic autocomplete, it felt like a shortcut. But it wasn't a solution. We ended up with what I call "the Copilot paradox": the AI suggests code, but you still have to copy-paste it, test it, find the error, and feed it back to the AI.

It's a broken feedback loop. The AI is a "black box" that doesn't actually know your system. It doesn't know your database schema, your MCP servers, or your deployment status on Coolify. You are still the manual bridge between the AI's logic and your local environment.

This creates a new kind of fatigue. Instead of writing code, you're now a high-speed code reviewer, constantly context-switching between your editor and a chat interface. This isn't "vibe coding" — it's just accelerated manual labor.

The solution: agentic workflows and MCP

True vibe coding isn't about being lazy; it's about shifting your role to that of a high-level system architect. This becomes possible through agentic workflows — systems that don't just "complete text" but "execute tasks in loops."

The breakthrough here is the Model Context Protocol (MCP) by Anthropic. MCP acts as the "USB port" for AI. Instead of you manually giving the AI context, the AI uses an MCP client to talk directly to your tools — your PostgreSQL database, your Slack channels, or your GitHub repositories.

The shift from chains to loops

In a traditional chain, you give a prompt and get a result. In an agentic loop, the architecture looks like this:

Intent. You describe the outcome ("build a Laravel dashboard for my Shopify sales").
Reasoning. The AI (like Claude) determines it needs to see the schema.
Action. It uses an MCP tool to query the database.
Observation. It sees a missing table and decides to create a migration.
Correction. If the migration fails, it reads the error and fixes it itself.

I call this "intent-based engineering." You aren't writing the migration — you are approving the architectural decision.

Implementing the agentic stack

As an engineer who values quality, I don't just let the "vibe" take over without guardrails. Here is how I'm currently structuring my agentic stack using Laravel and AI.

1. Defined MCP servers

I build small, dedicated MCP servers that expose only the necessary tools to the AI. This keeps the context window clean and the security tight.

// Conceptual MCP tool definition in a PHP environment
public function defineTools(): array
{
    return [
        'get_database_schema' => [
            'description' => 'Retrieves the structure of the Laravel application tables.',
            'parameters' => [],
        ],
        'run_artisan_command' => [
            'description' => 'Executes an artisan command safely.',
            'parameters' => ['command' => 'string'],
        ],
    ];
}

2. Stateful loops

Instead of one-off chats, I use tools like Cursor, Claude Code, or Windsurf that maintain a stateful connection to my local file system. This allows the agent to "see" the impact of its changes in real-time, just like a human developer would.

3. The human-in-the-loop (HITL)

The most important part of the architecture is the review gate. Even with agentic loops, the human architect must sign off on the "plan" before the "action" phase. This ensures the PHP logic follows clean architecture principles rather than just "making it work."

The takeaway for the modern founder

If you're a founder or a CTO, the takeaway is simple: stop hiring for syntax and start hiring for system design. The technical barrier is collapsing, but the architectural stakes are higher than ever.

Embrace the vibe. Focus on the intent and the user experience.
Invest in infrastructure. Build the MCP connections and the data pipelines that allow AI to be effective.
Think in loops. Design your internal processes so that AI can iterate autonomously, reducing your bottleneck role.

At Ansezz, I'm not just building apps anymore — I'm building agent-ready ecosystems. Whether it's a complex Shopify integration or a custom SaaS, I ensure the architecture is ready for the agentic future.

The code might be generated, but the vision is entirely yours.

Are you ready to stop writing code and start orchestrating your intent? Get in touch — let's design your agent stack together.

From monolith to micro-services: a senior dev's guide to pragmatic scaling

Sun, 22 Feb 2026 00:00:00 GMT

Your monolith is a ticking time bomb and every feature you add makes the explosion more inevitable.

I have seen it happen a dozen times. A startup begins with a clean Laravel or Rails app. It is fast. It is easy. It is productive. Then the team grows. The code base swells. Suddenly, a simple change to the checkout logic breaks the authentication system. Deployments that used to take five minutes now take forty. You are not scaling your business anymore — you are managing technical debt.

This is the point where most developers start dreaming of micro-services. They imagine a world where every service is isolated and deployments are instant. But the reality is often a nightmare. If you do it wrong, you end up with a distributed monolith. You get all the complexity of networking with none of the benefits of isolation.

The solution is not a "big bang" rewrite. It is pragmatic scaling. I use the strangler fig pattern to move from monoliths to micro-services without losing my mind or my job.

The problem with the big bang

When a monolith becomes too heavy, the immediate reaction is to want to scrap it. I have seen companies spend two years on a rewrite only to ship a product that has half the features of the original. The business dies while the engineers play with new toys.

The monolith is not your enemy. It is just a phase. The real problem is coupling. When every part of your app knows too much about every other part, you cannot move. You are stuck in a web of dependencies. If you try to jump straight into micro-services, you will likely just port those dependencies into a network layer. Now, instead of a function call failing, you have a 500 error across a network socket.

I prefer a slower, more deliberate approach. I focus on high-value extractions. I look for the parts of the app that hurt the most. Is the image processing service slowing down the web server? Is the reporting engine locking up the database? Those are your first candidates for micro-services.

The strangler fig pattern in practice

I named this approach after a tree that grows around another tree. It starts as a small vine and eventually replaces the host entirely. In software, this means building new features as services while the old monolith remains.

The process starts with an API gateway or a load balancer. I use Nginx or Cloud Armor on Google Cloud to route traffic. If a request comes for /api/v1/orders, it goes to the new service. Everything else goes to the old monolith.

This allows me to test the new service in production with real traffic while the monolith acts as a safety net. If the new service fails, I just flip the routing back. I do not have to migrate everything at once. I can migrate one endpoint at a time.

Containerization with Docker

You cannot do micro-services without Docker. I treat every service as a black box. The monolith might be running on an old version of PHP, while the new service is a lean Go binary or a modern Laravel instance. Docker makes this possible.

I start by containerizing the monolith. Even if it stays as a monolith for another year, putting it in a container forces me to define its environment. It makes the infrastructure reproducible.

# a simplified example of a service container
FROM php:8.3-fpm

WORKDIR /app
COPY . /app

RUN apt-get update && apt-get install -y \
    libpq-dev \
    && docker-php-ext-install pdo_pgsql

EXPOSE 9000
CMD ["php-fpm"]

Once the monolith is containerized, I can deploy it to a platform like Google Kubernetes Engine (GKE). This is where the real power of micro-services comes in. I can scale the order service to fifty instances during a sale while keeping the blog service at two.

Communication and the anti-corruption layer

The hardest part of micro-services is not the code. It is the data. Your monolith has a single database. Your micro-services should each have their own. But how do they talk?

I use an anti-corruption layer (ACL). When I extract a service, I do not let it reach back into the monolith's database. That would be cheating. Instead, I create an interface. If the new service needs user data, it asks the monolith via a private API or a message queue like Google Pub/Sub.

This keeps the new service clean. It does not care about the messy database schema of the legacy app. It only cares about the data it receives through the ACL. Eventually, when the user logic is also migrated, I just update the ACL to point to the new user service.

Cloud infrastructure and DevOps

Scaling a monolith usually means buying a bigger server. Scaling micro-services means managing a fleet. I rely heavily on cloud-native tools to manage the complexity.

I use Terraform to manage my infrastructure as code. This ensures that my staging and production environments are identical. If I need a new database for a service, I define it in code. I do not click around in a dashboard.

On the DevOps side, I use tools like GitHub Actions or Coolify for deployments. Every service has its own pipeline. If I update the checkout service, I only deploy the checkout service. I do not have to worry about the rest of the system.

The hidden costs of micro-services

I would be lying if I said this was all sunshine and rainbows. Micro-services come with a "complexity tax." You now have to deal with distributed logging, service discovery, and eventual consistency.

I tell my clients that they should only move to micro-services when the pain of the monolith is greater than the cost of the complexity tax. If your team is three people and your app is simple, stay in the monolith. You will move faster.

But if you are hitting walls every day and your developers are afraid to touch the code, it is time to start strangling.

Pragmatic takeaways for your next move

Start with an API gateway to handle routing.
Containerize your monolith first to normalize the environment.
Use the strangler fig pattern to migrate one domain at a time.
Build an anti-corruption layer to keep new services clean.
Invest in infrastructure as code early on.
Only split when the monolith starts to hurt your productivity.

Migration is a marathon, not a sprint. I have spent months on a single extraction just to make sure it was perfect. The goal is not to have micro-services. The goal is to have a system that can grow with your business.

Have you ever tried a "big bang" rewrite only to regret it six months later? Tell me about it — I collect these stories for a reason.

AI integration vs traditional development: which is better for your business in 2026?

Sun, 25 Jan 2026 00:00:00 GMT

Most teams are asking the wrong question.

The real problem is not "AI or traditional development?" — it is what kind of speed, control, and risk your business can actually afford.

I see this mistake a lot. Teams chase AI because it feels faster. Or they reject it because it feels messy. Then they end up with the same problem from both directions. Rushed systems with weak foundations, or polished systems that ship too late.

The better move is to understand where each approach wins, where it breaks, and where a hybrid model gives you the best return.

Both approaches work. They just solve different problems.

AI-powered development: the speed revolution

AI integration changes how I build software. Instead of manually writing every repetitive piece, I can use tools that understand context, generate scaffolding, speed up testing, and remove a lot of the drag from delivery.

The core advantage: speed

This is where AI shines.

For standard workflows, admin panels, CRUD-heavy systems, internal tools, and first-pass prototypes, AI can cut a serious amount of time. What used to take weeks can often be reduced to days if the scope is clear and the review process is tight.

That speed usually comes from a few places:

Automated code generation. Prompts turn into usable boilerplate and feature drafts.
Faster testing. AI can draft test cases and edge-case coverage quickly.
Debugging support. It helps narrow down likely failures faster.
Documentation help. It can turn rough implementation details into clean internal docs.

Who benefits most from AI development

I would lean toward AI-heavy workflows when speed matters more than perfect customization on day one.

That usually means:

Startups trying to reach product-market fit before the runway gets tight.
Small teams that need leverage more than headcount.
Businesses shipping standard features that already follow familiar patterns.
Teams where non-technical stakeholders want to contribute to discovery and prototyping.

In those cases, AI acts like a power tool. It does not replace the builder. It just makes the first cut much faster.

The trade-offs to consider

This is where a lot of teams get burned.

AI is fast at common patterns. It is weaker at deep product nuance, strange business rules, and systems that need careful long-term architecture. If you skip review, you can ship something that looks finished but behaves like a prototype wearing a production costume.

That means I would not treat AI output as truth. I would treat it as a draft.

Traditional development: the control champion

Traditional development is slower, but it gives me tighter control over how the system is shaped.

This is the path I trust most when the business rules are complex, the architecture matters, or the cost of failure is high. Every part of the system is designed with intent instead of inferred from a prompt.

The core advantage: control

Traditional development is better when the software needs precision.

That matters for:

Complex enterprise systems — lots of moving parts and layered business logic.
Regulated industries — where auditability and traceability matter.
Mission-critical applications — where downtime or bad behavior is expensive.
Custom architectures — where the product does not fit common patterns.

The predictability factor

One underrated benefit of traditional development is predictability.

Manual design, explicit code reviews, architecture decisions, and planned testing give me a clearer picture of trade-offs. It is like building with blueprints instead of assembling furniture from a photo.

That slower process often saves time later because fewer assumptions make it into production.

The time investment reality

The downside is obvious.

Manual coding, reviews, debugging, refactoring, and testing take time. You need stronger engineering talent, and you need the discipline to keep standards high when deadlines start squeezing the team.

Traditional development gives more control, but you pay for it in time and cost.

Head-to-head comparison

Factor	AI-Powered Development	Traditional Development
Development speed	30–50% faster completion	Standard industry timelines
Cost structure	Lower long-term expenses	Higher labor costs
Team requirements	Mixed skill levels acceptable	Requires senior expertise
Customization level	Limited by AI training data	Unlimited customization
Quality assurance	Automated testing and fixes	Manual review processes
Risk management	Variable based on AI reliability	Predictable risk factors
Scalability	Rapid scaling through automation	Scales with team growth

Making the right choice for your business

Choose AI integration when

Choose AI when your bottleneck is delivery speed and the work is close to known patterns.

That usually applies when:

Your market window is tight.
You are building standard business apps like portals, dashboards, e-commerce flows, or content systems.
Your team wants quick prototypes before committing engineering time.
Your budget is better spent on iteration than on deep custom engineering from day one.

Choose traditional development when

Choose traditional development when the cost of being wrong is higher than the cost of being slower.

That usually means:

The app needs a unique architecture.
Compliance and audit trails are mandatory.
Reliability matters more than release velocity.
Your team wants direct ownership of code quality and system design.

The hybrid strategy: best of both worlds

This is the option I recommend most often.

The strongest teams do not treat this like a religion. They use AI where speed helps and switch to traditional engineering where judgment matters.

A practical hybrid setup looks like this:

Generate boilerplate and first drafts with AI, then review and reshape manually.
Use AI for prototyping, then rebuild critical paths carefully.
Automate repetitive testing tasks, but keep human review for logic and architecture.
Use AI to accelerate docs and support material, while keeping final technical decisions human-led.

The hybrid model works because it treats AI like a junior accelerator, not like an autopilot.

Implementation guidelines

Starting with AI integration

If I were introducing AI into an existing team, I would start small.

Begin with low-risk features.
Define a review process for all AI-generated code.
Choose tools that fit the current workflow.
Train the team on prompting, verification, and code quality checks.

Maintaining traditional excellence

If the team stays mostly traditional, I would protect the basics.

Invest in strong senior review.
Keep documentation current.
Use clear architecture standards.
Avoid rushing complex work into fragile implementations.

Building hybrid capabilities

If the goal is balance, then the workflow matters more than the tools.

Identify which tasks are repetitive and safe to automate.
Keep humans responsible for architecture and business logic.
Add quality gates before merge and deployment.
Measure outcomes, not just speed.

The future-ready approach

The teams that will win in 2026 are not the ones that blindly choose AI or reject it.

They are the ones that know where speed is enough, where control is non-negotiable, and where a hybrid model gives them leverage without chaos.

That is the real solution.

Use AI to remove friction. Use traditional engineering to protect the parts that matter. Combine both when the business needs speed and reliability at the same time.

Your development strategy should match your business goals, not the trend cycle. If you had to choose today, which matters more for your next product: speed, control, or a hybrid path? Reach out — I'd love to hear which side you're leaning toward.