e.g. "explain genome assembly like I'm 5" or "compare BLAST, HMMer, MMSeqs2"
What
Bioinfo bot is a proof of concept that displays how an AI with knowledge of bioinformatic papers from bioRxiv could be used to speed up ideation and literature reviews.
How
The abstract of each paper was embedded (i.e. turned into a numerical vector with 1024 dimensions) using the Voyage AI voyage-4-large model. These embeddings are stored in PostgreSQL using pgvector. When you submit a message, the question is embedded using the same model, compared against the database via cosine similarity, then the top 5 most related paper abstracts are fed to Claude Sonnet with a custom prompt to synthesize a response.
Why
The bot enables researchers to keep up and analyse across/within an ever expanding literature corpus. To keep up with this dataset alone, a researcher would need to read 3 papers everyday for 15 years.
v1 vs v2
|
v1 (2023) |
v2 (2026) |
| LLM |
GPT-4 8k |
Claude Sonnet |
| Embeddings |
OpenAI text-embedding-ada-002 (1536d) |
Voyage AI voyage-4-large (1024d) |
| Vector DB |
Pinecone (managed) |
pgvector on Cloud SQL PostgreSQL |
| Tokeniser |
tiktoken (client-side truncation) |
Native API token limits |
| Chat state |
Global Python list (shared across all users) |
Django sessions (per-user) |
| References |
None |
Top 5 BioRxiv papers linked per response |
More details
blog post here