Anass Ez-zouaine — Senior Backend Engineer · Software Architect · AI Engineer — AI

Claude MCP: why I'm connecting my dev tools to LLMs for real-time context

Wed, 27 May 2026 00:00:00 GMT

Every time I have to build a custom integration for a new tool, a little piece of my developer soul dies. It is a maintenance nightmare that never ends. We have reached a point where building the actual product is often faster than setting up the pipes to make it work with our data.

If you have spent any time building agentic workflows and vibe coding, you know exactly what I am talking about. You have an LLM like Claude that is incredibly smart but essentially locked in a room with no windows. To give it context, you have to manually copy-paste code, export CSV files, or spend three days writing a brittle wrapper for a third-party API just so your assistant can "see" your work.

This fragmentation is the biggest bottleneck in modern software development. We have powerful models, but they are isolated from our local files, our databases, and our production logs. It is like having a world-class architect who isn't allowed to visit the construction site. They are just guessing based on the photos you decide to send them.

Enter the Model Context Protocol — or as I have been calling it: the USB port for LLMs.

The fragmentation tax is killing your productivity

The problem is simple but massive. Every AI application — whether it is Claude Desktop, a custom agent, or an IDE extension — wants to talk to your data. On the other side, every data source — your GitHub repos, your Postgres databases, your Slack channels — has its own specific API and authentication flow.

Without a standard, we are stuck in an M×N problem. If you have 5 AI apps and 10 data sources, you need 50 different integrations. This is why most "AI-powered" tools feel shallow. They only support a few basic integrations, and if you want to use your internal company data, you are back to writing custom glue code.

This agitation is real. We are wasting hours building the same connectors over and over again. We are worried about security because every new integration is another potential leak. And we are frustrated because the "magic" of AI disappears the moment we hit a data silo.

The solution: MCP as a universal standard

Claude MCP (Model Context Protocol) is the first serious attempt to standardize how AI applications discover and interact with data and tools. Instead of building a specific connector for every model and every tool, you build an MCP server.

This server acts as a translator. It sits between your data and the AI, exposing a consistent interface that any MCP-compliant host (like Claude Desktop) can understand. It is exactly like the USB standard. It doesn't matter if you are plugging in a mouse, a keyboard, or an external drive. The protocol is the same, so it just works.

This shifts the entire paradigm of context-aware agents. Instead of hard-coding logic into the agent, you simply "plug in" the servers you need.

How the architecture actually works

There are three main players in the MCP ecosystem:

The host — this is the environment the user interacts with. It could be Claude Desktop, a terminal, or an IDE like Cursor. The host is responsible for managing the lifecycle of the connection.
The client — this is the part of the host that speaks the protocol. It does the "handshake" with the server to find out what it can do.
The server — this is a lightweight program that provides context (resources), actions (tools), and prompt templates.

For example, if I want Claude to have access to my local project files, I run a local MCP server that exposes those files as "resources." The host (Claude Desktop) asks the server: "what do you have?" The server replies: "I have these 10 files and a tool to run grep searches."

The model can then decide to call the "grep" tool whenever it needs to find a specific function definition. I didn't have to write a single line of logic inside Claude to make that happen. I just connected the server.

Modularity and the MCP server ecosystem

The beauty of this modularity is that once a server is built, anyone can use it. The community has already started building servers for everything you can imagine. I have been using a few in my daily workflow that have completely changed how I code:

Postgres MCP — I can point Claude at a local or remote database. It can inspect schemas and even run queries to help me debug data issues without me leaving the chat.
GitHub MCP — this allows the model to search through my repositories, list issues, and even create pull requests. It is like having a junior dev who actually knows where the code is.
Google Drive MCP — perfect for when I need to cross-reference technical documentation stored in docs with the actual implementation in my IDE.

This also solves a massive pain point in agentic commerce for Shopify. Imagine an agent that can talk directly to your Shopify store via MCP to check inventory levels or update product descriptions in real-time, all while maintaining a secure, standardized connection.

Security first: the sandbox model

The biggest question I get when I talk about connecting dev tools to an LLM is: "is it safe?"

Security is baked into the design of MCP. Because the server is a separate process, it runs in its own sandbox. It only has access to the specific resources you grant it.

For local servers, the protocol typically uses stdio (stdin/stdout). This means the server can only talk to the host through a very narrow pipe. It doesn't have open network ports listening for connections. It only exists as long as the host is running it.

For remote servers, MCP uses OAuth 2.1. This allows for fine-grained permissions. You can authorize a GitHub MCP server to only read public repositories, or a database server to only access specific tables.

This is a huge improvement over the "give me your master API key" approach that we have seen in the past. We can now treat AI tools with the same "least privilege" mindset we use for any other service in our stack. This is especially important when you are trying to avoid RAG mistakes in production, where data leakage is a top-tier risk.

Why I am betting on MCP

I have been a developer for over a decade, and I have seen plenty of "standards" come and go. What makes MCP different is its simplicity and its backers. Anthropic has made this open source because they realize that the more context a model has, the more valuable it becomes.

We are moving toward a world of "agentic" software development. In this world, we don't just use AI to write snippets of code. We use AI as an orchestrator that can reach into our cloud infrastructure on GCP, check our Docker logs, and suggest fixes for a failing Laravel app.

Without a protocol like MCP, that vision is impossible to scale. It would be too expensive and too risky to build. But with MCP, we are building a world where tools are plug-and-play.

Practical takeaways for senior engineers

If you are ready to start experimenting with this, here is what I recommend:

Install the Claude Desktop app — it is currently the most mature host for MCP.
Try the filesystem server — this is the easiest way to feel the power. Give Claude access to a specific folder and watch it navigate your codebase.
Don't build, search first — check the official MCP GitHub repository. There are already servers for Brave search, Postgres, Slack, and more.
Think in tools, not just prompts — start thinking about what "tools" your internal systems could expose. If you have a custom admin panel, could it be an MCP server?

Connecting your dev tools to an LLM isn't just about speed. It is about reducing the cognitive load of switching between tabs, terminals, and documentation. It allows you to stay in the "flow" longer.

Are you ready to stop copy-pasting your code into a chat box and start connecting your tools directly to the brain? What is the one internal tool you wish you could "plug in" to Claude right now? Drop a note via contact — let's figure it out. 🤘

7 mistakes you're making with your production RAG stack (and how to fix them)

Sun, 17 May 2026 00:00:00 GMT

Getting a RAG (retrieval-augmented generation) demo working is easy. You take a few PDFs, throw them into a vector database like Chroma or Pinecone, and ask a question. It feels like magic.

But shipping RAG to production is where the magic dies.

I've seen too many teams launch a feature only to realize that their users are getting irrelevant answers, waiting 10 seconds for a response, or worse, getting hit with "I don't know" for questions that are clearly in the documentation. When the "vibe check" fails at scale, your users lose trust.

You're likely making at least one of these seven structural mistakes that turn a cool demo into a production nightmare. I've spent the last few years building custom web applications and AI systems, and I've had to fix these same leaks in my own stacks.

Here is how to bridge the gap between "it works on my machine" and a production-grade AI system.

1. Naive chunking is killing your context

Most people start with a simple character-based or token-based splitter. You tell the library to "give me chunks of 500 tokens with a 50-token overlap."

This is a mistake.

This "naive chunking" treats your data like raw soup. It might cut a sentence in half, split a table in the middle of a row, or separate a coding example from the explanation that precedes it. If the retriever pulls only one of those halves, the LLM has zero chance of giving a correct answer.

The fix: use semantic or structural chunking.

I always recommend chunking based on the actual structure of the document first. Use headers (H1, H2, H3), paragraphs, or even markdown delimiters to ensure related ideas stay together. If you're working with complex data, consider recursive character splitting that respects newlines and punctuation before falling back to raw token counts.

2. Skipping the reranker step

Vector search is great at finding "roughly similar" stuff, but it's not a precision instrument. It relies on cosine similarity, which can be easily fooled by documents that share a similar "vibe" but don't actually contain the answer.

If you're just taking the top 5 results from your vector store and shoving them into your LLM prompt, you're leaving quality on the table.

The fix: add a reranking step.

I look at retrieval as a two-stage process. Stage one is the "fast and broad" search where you pull the top 20 or 50 candidates from your vector database. Stage two is using a cross-encoder or a specialized reranking model (like Cohere's Rerank or BGE-Reranker) to score those 50 candidates against the query more accurately.

The reranker acts like a bouncer at a club. It doesn't care if a document looks "okay." It only lets in the ones that are actually relevant to the question.

3. Ignoring embedding drift and versioning

This is the silent killer. I've seen teams upgrade their embedding model from text-embedding-ada-002 to text-embedding-3-small without re-indexing their entire database.

Suddenly, the vectors being generated for new queries don't "line up" with the vectors stored in the index. The similarity scores go haywire. Even worse is when you change the preprocessing logic (like how you format the chunks) but keep the old vectors.

The fix: pin your models and version your index.

Treat your embedding model like a database schema. If you change the model, you must re-index. I always include the model name and version in the metadata of every index I build. This way, if I need to test a new model, I can run them side-by-side without breaking the production flow. My experience in cloud infrastructure has taught me that consistency is better than a "better" model that doesn't match its data.

4. The "needle in a haystack" latency problem

Everyone wants more context. We see context windows of 128k or even 1M tokens and think, "great, I'll just give the LLM everything!"

This is a trap for two reasons. First, latency — feeding 50k tokens of context into an LLM can make your response time balloon to 20 or 30 seconds. Second, models still struggle with "lost in the middle" problems: they tend to ignore information buried in the center of a massive context window.

The fix: optimize your latency budget.

I start with a "latency budget." If the user expects a response in under 2 seconds, I can't afford to send 20 chunks. I limit my retrieval to the top 3–5 high-quality chunks and use streaming as soon as the first token is ready.

If you need more data, consider using a multi-step approach: use a cheaper model to summarize the retrieved chunks before passing the refined info to your main model.

5. Forgetting about hybrid search

Vector search is terrible at finding specific keywords or unique identifiers. If a user asks for "error code XF-904," a vector search might return documents about "general error handling" because the "vibe" is similar. But it might miss the one specific document that actually mentions "XF-904."

The fix: implement hybrid search.

I always combine dense vector search with traditional sparse search (like BM25). By blending these two results using something like Reciprocal Rank Fusion (RRF), you get the best of both worlds. You get the semantic understanding of vectors and the keyword precision of full-text search. This is non-negotiable for enterprise search or technical documentation.

6. Failing to filter by metadata

If your RAG system contains documents for different clients, versions, or dates, pure vector search will betray you. You might ask about "API changes in 2024" and get results from 2022 because they share similar keywords.

Relying on the LLM to "ignore" the wrong dates in the context is a waste of tokens and a recipe for hallucinations.

The fix: use hard metadata filters.

Before the vector search even happens, apply filters. If you know the user is looking for "v2" of your documentation, filter the vector query to only include chunks with version: '2'. This drastically reduces the search space and improves accuracy. I use this heavily when building Shopify apps where data must be strictly siloed by shop ID.

7. Vibe-based evaluation

How do you know your RAG stack is getting better? Most devs just ask a few questions, see that the answer looks okay, and ship it.

This is called "vibe-checking," and it doesn't work. When you change a prompt or a chunk size, you might improve one answer while breaking ten others you didn't check.

The fix: build a golden evaluation set.

I use the Ragas framework or simple LLM-as-a-judge patterns to run automated evals. I maintain a "golden set" of 50–100 questions with ground-truth answers. Every time I change the architecture, I run the eval and look for three metrics:

Faithfulness — is the answer actually derived from the context?
Answer relevance — does it answer the user's question?
Context precision — are the retrieved chunks actually useful?

Practical takeaways for your stack

Start with metadata — don't let the vector database guess. If you have categories or dates, use them as hard filters.
Rerank by default — it's the single biggest quality jump you can make for the lowest effort.
Monitor retrieval, not just generation — if your retriever fails, the best LLM in the world can't save you. Log your top-k retrieval results separately.
Don't over-engineer — sometimes a simple long-context prompt is better than a complex agentic workflow. Measure before you add complexity.

Building production RAG is a game of millimeters. It's about cleaning your data, pinning your models, and actually measuring what's happening under the hood.

I've spent years moving from "it works" to "it's reliable." If you're struggling with a specific part of your AI pipeline, what's the one thing that's currently keeping you from hitting that "deploy" button?

Drop a comment or reach out if you're hitting a wall with your architecture. 🤘

MCP and the future of tool-use: building context-aware agents

Sun, 05 Apr 2026 00:00:00 GMT

The current state of AI agents is a mess of fragmented integrations. Every time I want to give an LLM access to a new data source or a specific tool I find myself writing custom glue code that breaks the moment an API version changes. It is a frustrating cycle of building brittle wrappers. We are effectively forcing highly intelligent models to peer through a keyhole when they should have a wide-open window into our data ecosystems.

This fragmentation creates a massive technical debt for developers. You spend eighty percent of your time on plumbing and maybe twenty percent on the actual intelligence of the agent. Without a unified way to share context the model often hallucinates because it lacks the grounding of real-time data. It is stuck in a loop of "I don't have access to that" or worse "I'll guess what that data looks like" — which leads to unreliable outputs and a poor user experience.

The Model Context Protocol (MCP) changes this dynamic entirely. It is an open standard that allows me to build context-aware agents that connect to any data source using a universal language. By standardizing how servers and clients communicate I can focus on building sophisticated logic rather than managing endless API endpoints. It is the missing link in the agentic workflow.

Why MCP matters for developers

I have spent years building custom web applications and one of the biggest hurdles has always been data silos. When I work on complex technical challenges the goal is usually to make data actionable. Traditional tool-use requires the developer to define every schema and every function call manually for the model. MCP flips this script.

MCP acts as a bridge. It defines a clear boundary between the AI application (the client) and the data sources (the servers). This separation of concerns means I can swap out the underlying model without rebuilding the entire data integration layer. If I move from Claude to another model that supports MCP, the tools and resources remain the same.

It also solves the "context window" problem. Instead of stuffing a massive document into the prompt I can expose it as an MCP resource. The model only fetches what it needs when it needs it. This is significantly more efficient and cost-effective. It allows me to build agents that are aware of their environment without being overwhelmed by it.

The three pillars: tools, resources, and prompts

To understand how to build with MCP I look at its three core primitives. These are the building blocks for any context-aware system.

Tools are model-controlled actions. When I give an agent a tool I am giving it the ability to change the world. This could be writing a file to a disk or making a POST request to a Shopify API. The model decides when to call the tool based on the user's intent.
Resources are application-controlled data. Think of these as read-only files or database entries that the agent can inspect. Resources provide the necessary grounding. If I am building a support agent the documentation for the product would be a resource. The agent can search and read it to provide accurate answers.
Prompts are user-controlled templates. They help guide the interaction. By using MCP prompts I can standardize how users interact with the agent across different platforms. It ensures consistency in how the model interprets tasks.

Building your first MCP server

I prefer using TypeScript for building MCP servers because of the robust SDK provided by Anthropic. However the protocol itself is language-agnostic. Here is a simplified look at how I structure a basic server that exposes a weather tool.

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";

const server = new Server(
  {
    name: "weather-server",
    version: "1.0.0",
  },
  {
    capabilities: {
      tools: {},
    },
  },
);

server.setRequestHandler(ListToolsRequestSchema, async () => {
  return {
    tools: [
      {
        name: "get_weather",
        description: "get the current weather for a location",
        inputSchema: {
          type: "object",
          properties: {
            location: { type: "string" },
          },
          required: ["location"],
        },
      },
    ],
  };
});

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === "get_weather") {
    const location = request.params.arguments?.location;
    // logic to fetch weather from an api goes here
    return {
      content: [{ type: "text", text: `it is sunny in ${location}` }],
    };
  }
  throw new Error("tool not found");
});

const transport = new StdioServerTransport();
await server.connect(transport);

This snippet illustrates the simplicity of the protocol. I define the tool and how to handle the call. The MCP client handles the rest. This modular approach is exactly what I look for when managing cloud infrastructure or complex backend systems. It is clean and scalable.

Security and the MCP ecosystem

Security is a major concern when giving an AI agent access to your data. I have seen many implementations where API keys are hardcoded or permissions are too broad. MCP addresses this by using a client-server architecture where the server controls exactly what is exposed.

The server acts as a gatekeeper. I can implement fine-grained access control at the server level. For example an MCP server connecting to a database can be restricted to only specific tables or read-only queries. This level of control is essential for enterprise-grade applications.

The ecosystem is growing rapidly. We are seeing early adoption from major players in the dev tools space. Tools like Zed, Cursor, and Claude Code are already integrating MCP to help developers write better code by giving their AI assistants better context. This trend will only accelerate as more developers realize the power of standardized context.

Practical steps for getting started

If you are a developer looking to dive into MCP I recommend following these steps.

Explore the existing MCP servers on GitHub. There are already servers for file system access and SQLite databases. See how they are structured.
Pick a simple data source you use every day. It could be your Obsidian notes or a local directory of markdown files. Build a basic server to expose these as resources.
Use a client like Claude Desktop to test your server. See how the model interacts with your data. Adjust the tool descriptions to make them more intuitive for the AI.
Compose multiple MCP servers once you are comfortable. Imagine an agent that can read your calendar and then write a draft email based on your upcoming meetings.

MCP is more than just a new protocol. It is a fundamental shift in how we build AI applications. It moves us away from the era of "black box" agents and toward a world of transparent and context-aware assistants. I am excited to see how this technology evolves and how it will transform our development workflows.

How are you planning to use MCP in your next project? Drop me a line — happy to swap notes on real-world MCP server design.

Why your RAG implementation is failing in production (and how to fix it)

Sun, 08 Mar 2026 00:00:00 GMT

You built a RAG (Retrieval-Augmented Generation) demo. On a local machine, with a handful of PDF files, it looked convincing. The answers felt coherent. The system appeared capable.

Then you pushed it to production.

That is usually where the illusion breaks.

Users start reporting that the LLM is "hallucinating" when the real issue is retrieval. Obvious answers go missing even though they exist in the documentation. Irrelevant chunks surface because they are semantically adjacent, not actually useful.

If your RAG system feels unreliable in production, you are not dealing with a model problem first. You are dealing with a retrieval design problem. Most production RAG systems fail because they rely too heavily on vector search and confuse a strong demo with a robust system.

I've spent a lot of time building custom AI solutions at Ansezz, and one pattern keeps showing up: a demo proves possibility, but production demands precision.

The "vector noise" trap

The philosophical shift from demo RAG to production RAG is simple: in a demo, semantic resemblance often feels good enough. In production, "good enough" is where failures begin.

Embeddings are useful. They let us map text into vectors and retrieve by meaning rather than exact wording. That is powerful. But semantic similarity is not the same thing as retrieval accuracy.

The problem. Vector search is strong at finding related concepts, but weak at handling specificity.

If a user searches for "Project-X-99 deployment logs," a vector search might return documents about "Project-A deployment" or "logging best practices" because they are semantically close. It can miss the exact identifier "X-99" because that string carries little semantic weight in a high-dimensional space.

The agitation. Once retrieval drifts, the LLM inherits the drift. The model cannot reason its way out of missing or irrelevant context. You end up paying for tokens that produce confident but unhelpful answers, and users lose trust for a reason that often sits one layer below the model itself.

The solution: hybrid search (vector + BM25)

The move from demo RAG to production RAG usually starts with one realization: meaning alone is not enough. You need semantic retrieval and lexical precision working together. This is hybrid search.

What is BM25?

BM25 (Best Matching 25) is the standard lexical ranking method behind classic search systems. It does not try to infer meaning. It rewards exact terms based on how important they are within a document and across the collection.

Why you need both

Vector search handles synonyms, multi-lingual queries, and conceptual matching.
BM25 search handles exact matches, IDs, SKUs, product codes, and technical jargon.

Production systems need both because user questions are rarely pure meaning or pure keyword. They are usually a mix of the two.

Technical insight: reciprocal rank fusion (RRF)

When you run two different retrieval strategies, you also create a new design problem: how should they be combined?

A practical answer is Reciprocal Rank Fusion (RRF). It is simple, reliable, and does not require you to pretend that scores from different retrieval systems are directly comparable.

The logic breakdown:

Assign a score. For every document returned by either search method, calculate a new rank-based score.
The formula. score = 1 / (rank + k). The k value (often 60) prevents lower-ranked items from contributing too aggressively.
Sum it up. If a document appears in both the vector and BM25 result sets, its scores are added together.
Sort. The documents with the highest combined scores are passed to the LLM.

Here's the minimal PHP version I drop into a Laravel service:

function reciprocalRankFusion(array $resultSets, int $k = 60): array
{
    $scores = [];

    foreach ($resultSets as $results) {
        foreach ($results as $rank => $docId) {
            $scores[$docId] = ($scores[$docId] ?? 0.0) + 1 / ($rank + 1 + $k);
        }
    }

    arsort($scores);

    return $scores;
}

This gives you a cleaner retrieval layer. If a document is semantically relevant and lexically precise, it moves toward the top for a reason.

The "second pass": using re-rankers

Hybrid search is a strong retrieval foundation, but production RAG usually needs one more layer of judgment.

If you want more precise results, add a re-ranker.

A re-ranker such as Cohere Rerank or BGE-Reranker is a cross-encoder model that evaluates the query and the document together. That matters because relevance is relational. It is not just about what a document contains. It is about whether that document answers this question.

Step 1. Retrieve the top 50 results using hybrid search.
Step 2. Pass those 50 results through a re-ranker.
Step 3. Send only the top 5 re-ranked results to your LLM.

This reduces context stuffing and improves the quality of what reaches the model. In practice, it is one of the clearest differences between a RAG demo and a production RAG system that behaves consistently.

Your production RAG checklist

The problem

A RAG system can feel impressive in a demo and still be structurally weak in production.

The agitation

Once real users, messy documents, and ambiguous queries enter the picture, weak retrieval turns the LLM into expensive guesswork. That is when confidence and correctness start drifting apart.

The solution

To move from demo RAG to production RAG, I focus on a few non-negotiables:

Stop relying on vector-only search. Add a BM25 layer.
Implement RRF. Fuse lexical and semantic retrieval without overcomplicating score calibration.
Tune chunking deliberately. If chunks are too small, they lose context. If they are too large, they add noise. I usually find 512–1024 tokens with a 10–15% overlap works well for technical documentation.
Add a re-ranker. Refine the final candidate set before anything reaches the LLM.
Evaluate with RAGAS. Measure faithfulness and relevance instead of trusting intuition.

Building AI is easy. Building reliable AI is hard. It requires a deeper understanding of retrieval, ranking, and context design, not just the ability to connect an API.

If you are looking to build a high-performance SaaS or need help modernizing your digital presence with AI that actually works, check out what I do at Ansezz. I specialize in solving these exact types of technical problems.

Where does your own system still behave like a demo when it should be behaving like production? Get in touch — I read every war story.

Picking the right RAG stack: vector databases for AI engineering

Sun, 14 Dec 2025 00:00:00 GMT

You built a cool chatbot. It works great on your local machine until you feed it 50,000 internal documents. Suddenly, it's hallucinating. It's slow. It's pulling data from three years ago when you specifically asked for last week's report.

Building a Retrieval-Augmented Generation (RAG) system sounds like a weekend project. But once you move past the "hello world" stage, you hit the database wall. Choosing the wrong vector store early on is a silent killer. It leads to high latency, soaring cloud costs, and a painful migration six months down the line when your data outgrows your infrastructure.

I've spent over a decade building custom web applications and scaling cloud infrastructure. I've seen teams get paralyzed by the sheer number of options in the AI ecosystem. You don't need a perfect database. You need the right tool for your specific scale and team.

Let's break down the 2026 vector database landscape so you can stop scrolling and start shipping.

Why the database matters in RAG

An LLM like Claude or GPT-4 is a genius without a memory. RAG gives it that memory. Your vector database is the librarian. If the librarian is slow or loses books, the genius can't do its job.

When we talk about RAG stacks, we're looking for three things:

Latency — can it find the right "memory" in under 50ms?
Hybrid search — can it search by meaning (vectors) and exact keywords (full-text)?
Developer experience — how much time are you going to spend on DevOps?

The contenders: which one is yours?

1. pgvector — the "I already have a database" choice

If you are already running Postgres for your web applications, pgvector is usually your first stop. It's not a new database. It's an extension that adds vector support to the database you already trust.

It's perfect if you have under 10 million vectors. You get ACID compliance, easy backups, and your relational data stays right next to your embeddings. No new infra. No new security audits.

Pros

Zero new infrastructure if you use Postgres.
Perfect for joining vector data with user metadata.
Huge ecosystem support (Laravel, Django, Node.js).

Cons

Scaling to 100M+ vectors requires serious server muscle.
Hybrid search requires manual tuning with Postgres full-text search.

2. Pinecone — the "I want zero ops" choice

Pinecone is the gold standard for managed service. It's a serverless vector database. You don't manage clusters. You don't tune indexes. You just send vectors and get results.

In 2026, Pinecone is the go-to for teams that want to scale from zero to a billion vectors without hiring a dedicated DevOps engineer. Their serverless architecture means you only pay for what you use.

Pros

Truly managed. Pick a region and go.
World-class performance and low latency.
Great enterprise features like SOC2 compliance.

Cons

It's a black box. You can't self-host it.
Costs can scale quickly if you have high write/read volume.

3. Weaviate & Qdrant — the hybrid powerhouses

If your RAG app needs to combine semantic search with old-school keyword search, these two are the leaders. Weaviate and Qdrant are built from the ground up for high-performance vector retrieval.

Weaviate excels at "out-of-the-box" hybrid search. Qdrant, written in Rust, is incredibly fast and efficient with memory. Both offer open-source versions and managed cloud options.

Pros

Best-in-class hybrid search (BM25 + Vector).
Flexible hosting (self-hosted Docker or managed cloud).
Highly optimized for filtering (e.g., "find documents from '2025' that talk about 'security'").

Cons

More operational overhead than Pinecone.
Requires learning a new database API.

How to choose: the engineering trade-offs

Picking a database isn't about finding the "best" one. It's about matching the tool to your engineering constraints.

Factor 1: the "billions" problem

Most startups don't have a billion vectors. They have a few thousand PDFs. If you're in the sub-1M range, pgvector is almost always the right answer. It's simple and it works.

If you are building something like a global legal search engine or a massive e-commerce recommendation system, you need the distributed architecture of Milvus or Pinecone. Don't build a massive distributed system if you don't have a massive amount of data.

Factor 2: hybrid search is non-negotiable

Pure vector search is actually pretty bad at finding specific technical terms. If you search for "PHP 8.4 features," a pure vector search might give you general "PHP" articles. A hybrid search combines the "vibe" of the vector with the "exactness" of a keyword search.

If search quality is your #1 metric, look at Weaviate or Qdrant. They handle the blending of these two search types natively.

Factor 3: the "DevOps" tax

I'm a huge fan of cloud infrastructure and deployment. But I also know that every new piece of infra you add to your stack is another thing that can break at 3 AM.

If you have a small team, lean on managed services like Pinecone or Zilliz. If you have a strong infra team and want to save on cloud margins at high scale, self-hosting Qdrant on a tool like Coolify or Kubernetes is the move.

Implementing pgvector with Laravel

Since I work a lot with custom web development using Laravel, I want to show you how easy this looks in practice. You don't need a PhD in math to run a vector query.

// finding the most relevant document chunks
$embedding = Ai::embed($query); // get vector from OpenAI/Claude

$results = Document::query()
    ->select('content')
    ->orderByRaw('embedding <=> ?', [$embedding]) // the <=> operator is pgvector's magic
    ->limit(5)
    ->get();

That snippet is essentially the core of a RAG system. You find the content, send it to the LLM, and get a grounded answer.

Three practical tips for your RAG stack

Before you commit to a database, keep these three things in mind. They will save you weeks of refactoring.

1. Index early, but not too early. Vector indexes like HNSW are fast for searching but slow for inserting data. If you are doing a massive initial data load, insert your vectors first, then create the index. It's the difference between minutes and hours.

2. Normalize your vectors. Make sure your embedding model and your vector database are on the same page. If you use cosine similarity, normalize your vectors. It keeps your results consistent and prevents weird ranking bugs.

3. Keep the metadata lean. It's tempting to store the entire JSON object of a document inside your vector database. Don't. Store the vector and a simple ID. Keep the heavy data in your primary database (like Postgres). This keeps your vector index small and fast.

My personal rule of thumb

I've built systems for startups and established businesses. Here is how I usually guide them:

Default to pgvector. It's the path of least resistance for most web apps.
Move to Pinecone if you need high performance and don't want to manage servers.
Choose Weaviate if your application relies heavily on complex hybrid search and metadata filtering.

The "right" stack is the one that lets you ship your AI features today, not the one that looks the best on a benchmark chart.

Are you building a RAG system right now? What's the biggest hurdle you've hit with your data retrieval?

Drop a line or reach out. I'd love to hear your war stories.

Summary takeaways

pgvector is king for teams already on Postgres.
Pinecone is the best zero-ops solution for scaling.
Hybrid search (keyword + vector) is usually better than vector search alone.
Keep your architecture simple. Don't over-engineer for "billions" of vectors if you only have thousands.

Vibe coding: why your next project needs more than just logic

Sun, 30 Nov 2025 00:00:00 GMT

Most developers are obsessed with logic. We spend years mastering syntax, optimizing database queries, and debating the merits of different architectural patterns. We build systems that are technically perfect but somehow feel completely hollow. They work, but they don't sing.

The problem is that your users don't care about your clean code or your clever recursive functions. They care about how the software feels. They care about the "vibe."

If you keep building purely for the machine, you are going to lose. The next generation of successful products won't be the ones with the most features or the tightest algorithms. They will be the ones that master the art of vibe coding.

The logic trap

I've spent over a decade in the trenches of software development. I've built custom web applications for startups and managed complex cloud infrastructure on Google Cloud and AWS. For a long time, I thought my job was to be a logic machine. I thought that if I followed every best practice and wrote the most efficient Laravel code possible, the project would be a success.

I was wrong.

Logic is just the foundation. It is the skeleton that keeps the building from falling down. But nobody wants to live in a skeleton. People want a home with character, warmth, and a specific feeling. In software, that character comes from the vibe.

When we focus purely on logic, we end up with "boring" software. It is the kind of software that does what it says on the tin but leaves the user feeling nothing. Or worse, it feels frustrating because the developer didn't think about the emotional friction of a slow-loading button or a confusing layout.

Entering the era of vibe coding

The term "vibe coding" was popularized recently by Andrej Karpathy. It describes a shift in how we build things in the age of AI tools like Cursor and Claude. It is a transition from being a writer of code to being a curator of intent.

Vibe coding is about letting go of the need to micro-manage every semicolon. It is about using natural language to describe the feel and behavior you want, and then letting AI handle the heavy lifting of the implementation.

In this world, your value as an engineer isn't in how fast you can type. It's in your taste. It's in your ability to recognize when a user interface feels "off" and knowing how to steer the AI to fix it. It is about prioritizing the outcome over the output.

I've seen this shift firsthand in my own work at Ansezz. When I'm working on a Shopify store development project, the technical logic of the checkout is important. But the vibe of the checkout — the smooth transitions, the reassuring feedback, the perfect typography — is what actually drives conversions for the business.

Tools that fuel the flow

To embrace vibe coding, you need tools that don't get in your way. You need tools that allow you to stay in a state of flow where the distance between your idea and the execution is as small as possible.

Tools like Cursor have changed the game for me. Instead of spending twenty minutes setting up boilerplate for a new Vue component, I can describe the "vibe" of the component in the chat. I can say, "build me a dashboard widget that feels airy and modern, uses a bento grid layout, and gives the user a sense of calm control over their data."

The AI generates the code. I review it. If the vibe isn't right, I don't fix the code line-by-line. I talk to the model again. I give it feedback on the feeling. "This feels too cramped. Give it more white space and make the shadows softer."

This is the essence of vibe coding. It's a high-level conversation about intent.

The senior developer guardrails

Now, I know what some of you are thinking. "This sounds like a recipe for a messy, unmaintainable codebase."

You are right to be worried. If you just "vibe" your way through a project without any discipline, you will end up with a "ball of mud." This is where the senior engineer perspective becomes more critical than ever.

Vibe coding isn't about being lazy. It's about shifting your focus. You use your senior-level expertise to build the "robust core" that allows the "vibe layer" to exist.

For me, that core is often built with Laravel and Docker. I use Laravel because it is built for "developer happiness." The framework itself has a vibe of elegance and simplicity. It provides the solid, logical foundation — the authentication, the database migrations, the API structures — that I can trust.

Once that robust core is in place, I can afford to be more exploratory with the frontend and the user experience. I can "vibe code" the top layer because I know the foundation is solid.

Why Shopify and vibe coding are a perfect match

If you work in e-commerce, vibe coding is your secret weapon. Shopify is a platform that already understands the importance of the feel. They have spent years perfecting the checkout flow and the admin experience.

When I do Shopify customization, I'm not just writing Liquid code. I'm trying to match the brand's vibe. A luxury jewelry brand needs a completely different "vibe" than a high-energy fitness store.

One should feel slow, deliberate, and expensive. The other should feel fast, punchy, and motivating. You can't achieve that through logic alone. You achieve it by obsessing over the details that the logic-only dev ignores.

How to start vibe coding today

If you want to move beyond being a logic-only developer, here are some practical steps you can take:

Prioritize your taste. Start looking at software not just as a tool, but as an experience. What apps do you love using? Why? Is it the speed? The animations? The way the buttons click? Start building a "swipe file" of great vibes.
Embrace AI as a partner, not a tool. Stop using Copilot just for autocompletion. Start using tools like Claude or Cursor to brainstorm high-level concepts. Describe the "feel" you want and see what it gives you.
Build a solid core. Don't let the vibe turn into chaos. Use frameworks like Laravel or tools like Docker to keep your infrastructure predictable and clean. The more you trust your foundation, the more you can play with the surface.
Iterate on the feeling. Instead of trying to get the code perfect the first time, get the "vibe" right first. Build a messy prototype that feels great, and then use your technical skills to refactor and harden it.
Focus on user empathy. Every time you write a piece of logic, ask yourself: "how will this make the user feel?" If the answer is "nothing," you have more work to do.

The future is felt, not just calculated

We are entering a time where "coding" as we knew it is becoming a commodity. Anyone can generate a function to sort an array. But not everyone can create an experience that moves people.

The future of software development belongs to the engineers who can bridge the gap between the machine and the human heart. It belongs to the people who understand that the best code is the code you don't even notice because you're too busy enjoying the vibe.

I've seen the results of this approach in my own projects and for the clients I work with. When you stop fighting the logic and start leaning into the flow, everything becomes easier. The work becomes more fun, and the results become more impactful.

Are you ready to stop just writing logic and start building vibes?

What is the one app you use that just "feels" right, and what can you steal from its vibe for your next project?