Project: Bioinfo Bot

Trained on ~15,000 Bioinformatics papers in BioRxiv. Ask it anything...

e.g. "explain genome assembly like I'm 5" or "compare BLAST, HMMer, MMSeqs2"

Powered by Claude Sonnet, Voyage AI embeddings, and pgvector

robot_image
What

Bioinfo bot is a proof of concept that displays how an AI with knowledge of bioinformatic papers from bioRxiv could be used to speed up ideation and literature reviews.

How

The abstract of each paper was embedded (i.e. turned into a numerical vector with 1024 dimensions) using the Voyage AI voyage-4-large model. These embeddings are stored in PostgreSQL using pgvector. When you submit a message, the question is embedded using the same model, compared against the database via cosine similarity, then the top 5 most related paper abstracts are fed to Claude Sonnet with a custom prompt to synthesize a response.

Why

The bot enables researchers to keep up and analyse across/within an ever expanding literature corpus. To keep up with this dataset alone, a researcher would need to read 3 papers everyday for 15 years.

v1 vs v2
v1 (2023) v2 (2026)
LLM GPT-4 8k Claude Sonnet
Embeddings OpenAI text-embedding-ada-002 (1536d) Voyage AI voyage-4-large (1024d)
Vector DB Pinecone (managed) pgvector on Cloud SQL PostgreSQL
Tokeniser tiktoken (client-side truncation) Native API token limits
Chat state Global Python list (shared across all users) Django sessions (per-user)
References None Top 5 BioRxiv papers linked per response
More details

blog post here