Project: Bioinfo Bot

Backed by two knowledge sources: a curated corpus of ~15,000 BioRxiv bioinformatics preprints (RAG), and an in-house PubMed MCP server for live peer-reviewed search. Ask it anything...

e.g. "explain genome assembly like I'm 5" or "find recent papers on CRISPR base editing"

What

Bioinfo Agent is a proof of concept agentic research assistant for bioinformatics. It draws on two complementary knowledge sources: a curated BioRxiv preprint corpus for depth on niche computational biology topics, and a live PubMed MCP server for peer-reviewed literature currency. Claude Sonnet 4.6 acts as the agent, deciding which sources to consult — and in what combination — based on the question.

Question → Answer flow

Your question is embedded via Voyage AI (voyage-4-large, 1024 dimensions)
A cosine similarity search runs against ~15,000 BioRxiv abstracts stored in Cloud SQL (pgvector) — only papers within a relevance threshold are kept
Claude receives your question + any relevant BioRxiv context + prior PubMed results (if this is a follow-up), then decides whether to call the PubMed MCP server
If Claude calls PubMed, the request is authenticated via GCP service-to-service identity tokens and routed to a FastMCP server on Cloud Run, which queries NCBI's E-utilities API
Claude synthesises a response citing papers inline (Author et al., year), streamed back as Server-Sent Events (SSE) and rendered word-by-word in the browser

GCP architecture

Service	Role
Cloud Run (Django)	Main web app — handles requests, RAG retrieval, Claude streaming
Cloud Run (FastMCP)	In-house PubMed MCP server — wraps NCBI E-utilities, exposes `search_pubmed`, `get_paper_metadata`, `find_related_papers`
Cloud SQL (PostgreSQL)	Stores BioRxiv paper embeddings via the pgvector extension
Secret Manager	Stores API keys (Anthropic, Voyage AI)
Artifact Registry + Cloud Build	Container image storage and CI/CD pipeline

v1 → v2 → v3

	Agent v1 (2023)	Agent v2 (2025)	Agent v3 (2026)
LLM	Claude Sonnet 2.x	Claude Sonnet 4.6	Claude Sonnet 4.6
Embeddings	OpenAI ada-002	Voyage AI voyage-3-large	Voyage AI voyage-4-large
Vector DB	Pinecone (managed)	pgvector on Cloud SQL	pgvector on Cloud SQL
Chat state	Global Python list (shared)	Django sessions (per-user cookie)	Per-page-load UUID (in-memory, tab-isolated)
Knowledge sources	BioRxiv RAG only	BioRxiv RAG only	BioRxiv RAG + live PubMed MCP (agent decides)
PubMed tools	—	—	search, get metadata by PMID, find related papers
MCP auth	—	—	GCP service-to-service identity tokens
References	None	Top 5 BioRxiv papers per response	BioRxiv preprints + PubMed live results, badged by source

Why

Demonstrate that an AI agent backed by structured retrieval and live tool use can meaningfully compress the time cost of literature review. The BioRxiv corpus alone represents more than 15 years of daily reading at 3 papers/day — PubMed extends that to tens of millions of peer-reviewed records, searched on demand.