Washing your dirty RAGs
‘Retrieval’ is the hardest part of retrieval augmented generation. Plus: we made an Enneagram app
Hello! This is our first newsletter of the year — welcome to 2024. In case you missed it, I sent out an essay earlier in the month about hallucination; expect more random essays like this from me. As usual, if someone forwarded this to you, please sign up so you get all future issues straight to your inbox!
Why retrieval augmented generation doesn’t always work as expected
Firstly, in case you are unaware of of what retrieval augmented generation (RAG) is, it’s possible you’ve already done it in some form when using AI tools: it’s when you give the model some information or content from your end, and then you use the model to query that information. So an obvious use-case might be to upload a 100-page report (or a 1000 page book… anything that doesn’t make sense to put into your prompt) and then ask questions about it (MyAskAI has a great wrapper for a simple version of this if you want to try it out).
Essentially, RAG is when your generated outputs are augmented by information retrieved from a specific source, provided by the user. The hard part of this process is not the generation part — it’s the retrieval part. In this context, ‘retrieval’ just means ‘search’, and search is notoriously difficult to do well.
Right now, the default with RAG is to semantically match the user’s query with what’s in the database of provided content, and this is pretty unreliable. Let’s say your database is filled with jokes, and you ask the model “what’s the best joke?” — the model will try and find a joke with words similar to ‘best’ or even ‘joke’ in it, and likely fail. Perhaps the best joke in the database was ‘why did the chicken cross the road?’ but none of the words in that joke are semantically similar to ‘best joke’. In this case, the model may not be able to really ‘retrieve’ anything, so it will make something up. This isn’t great when you’re trying to ask it questions that you don’t know the answer to!
There are some language-model-based solutions emerging, like HyDE. HyDE is a methodology where, instead of getting the LLM to do a straight semantic search on your documents, it actually generates an intermediary doc that may not accurately answer your query, but it “encapsulates patterns and nuances that resonate with similar documents in a reliable knowledge base”. Then, content from the generated document is used to retrieve information from your documents. This does often work (and we use it our projects), but its still kind of a hack and doesn’t fix everything. For the foreseeable future, good retrieval methods will be specifically crafted to the kind of data and queries expected of a system.
For retrieval to work well, the parameters and architecture of the system need to be carefully crafted by a human developer. It’s great that sometimes you can just throw a bunch of random document chunks into GPT-4 and it will come up with good answers, but with retrieval the LLM isn’t going to ‘do it for you’ in this way. As a practitioner, it’s important to understand where the method’s edges are, where it will fail, and how you can plan for that failure.
Panda Smith from Elicit put it very well when they said “build a search engine, not a vector database”. Just remember, there’s a company out there that does search at scale really well, and it’s not OpenAI…
We made a small web app to help with tough conversations
It’s called Sisu and I’d love for you to try it out and tell me what you think. It helps you work through a tough conversation you want to have with someone, based on your respective Enneagram types. I would say using it takes around 5-15 minutes, depending on how much depth you want to go into with it.
I made this with my friend Gina Gutierrez. She is currently between roles after leaving a company she founded called Dipsea, which produces original audio stories to help listeners discover and understand their sexuality through narrative.
Gina is much more into Enneagram than I am, but I think that solid frameworks like these are 100% the best content to use when building generative AI apps. Enneagram is so easy to work with because it provides a clear set of ‘types’ of people, and then you can get your model to extrapolate outputs from those types. I’ve been using Enneagram this way since 2020: one of the first things I ever made with GPT-3 was a custom Enneagram generator that produced a new Enneagram type based on someone’s name.
One of the main reasons Gina and I made this was to learn more about how AI systems work — the best way to do this is to build things. A secondary reason for me was that as an organization, we’re really interested in how generative AI and design will impact each other. Gina is a great designer, so it made sense to team up and make something that looks nice and is easy to use, while gently experimenting with how gen AI apps can look and feel.
I would also say that making little tools like this pushes you to think about how the functionality could be used in other contexts — this tool doesn’t have to use Enneagram, or necessarily be about tough conversations; this could also be used for personal assessments in a business environment. That’s the beauty of gen AI. You can quite easily throw together a tool or app for fun, and then later down the line realize its practical applications.
Finally, I just want to point out that technically, most of what Sisu is doing is ‘hallucinating’, which is a feature, not a bug (of gen AI in general). I released a short essay on how hallucination is good a couple of weeks ago. Read that to get more of an idea of what I mean.
Gen AI seems to be ensuring that ‘personalization’ actually means… personalization
—By Georgia Iacovou
…but it’s kinda hard to tell if this is on purpose. Over the last couple of weeks I’ve been playing around with GPTs in the new OpenAI GPT marketplace. I mostly use AI chat interfaces for brainstorming and working through conceptual problems, so I thought I’d look for a decent ‘world builder’ GPT — a bot that can help you shape and define a fictional world in which to set a story.
Not to be rude, but they all suck — and this is not because they are badly made, necessarily. It’s that they are either too generic and open, with no mechanisms to perhaps challenge my ideas, or they are way too specific and not what I need at all. So, I’ve now set about making my own world-builder GPT; I’m designing to take the user through very specific steps so that they can systematically create a world which has deep, consistent lore, and will serve as a setting for stories which are congruous with that world, and therefore perhaps have far less plot holes. It’s very easy, when you’re creating a fantasy, to break the rules of that fantasy — and once you do that the story kind of starts to fall apart.
So anyway, I’m very excited to finish my world builder and then use it (above is a preview). This whole process has made me kind of question the need for a ‘GPT store’. If it’s so easy to put together a simple AI app that’s perfectly suited to my needs, why would I ever need to shop around for one that someone else has made? OpenAI has a wide breadth of resources, and an approach to app design that makes things ‘just work’. This means that the GPT builder is extremely powerful, and relatively straightforward to use; anyone can sit down and whip up something vaguely useful for themselves in a few minutes.
This, really, is true personalization. If you can create online experiences that are perfectly catered to your tastes, then a marketplace seems kind of unappealing, suddenly. Even the boring stuff, like a travel planning assistant (which is one of the top featured GPTs right now), is something you can make on your own to your own specifications, if you really cared. I think that OpenAI have inadvertently created a weird tension where their community marketplace provides a variety of generically ‘useful’ GPTs, while the very visitors of that marketplace are perfectly equipped to create their own GPTs which are useful to no one but themselves.