Understanding the limitations of natural language

A calculator which is 98% accurate is 150% useless

Jeremy Kirshbaum

Lev G

, and

Georgia Iacovou

Nov 29, 2023

Hello, this is The Secret Handshake, a free resource for those wanting to understand generative AI better, and embed it into their work. If someone shared this issue with you, welcome! Please sign up to get this in your inbox every month.

Here’s something that’s harder to understand than generative AI right now: the corporate drama at OpenAI. Just so you know, this newsletter is not where you can read about that. There are plenty of other publications out there that are having a nice time speculating about the board’s decision — The Secret Handshake will always just be about potential applications of the technology itself.

You still can’t use natural language for quantitative analysis

Often people ask me how they might set up a quantitative backend database for their company, and then query it with natural language to get insights. Then, they kind of refuse to believe me when I tell them that right now, this isn’t really possible.

When I have these conversations, people come back to me with “okay but what about Code Interpreter?” or “What about Llamaindex!?” and it’s like yes, these tools exist, but they are not capable to do reliable quantitative analysis yet. There are organisations, like Akkio, who are working on this problem, but I’d say we’re about 6-12 months away from it truly becoming a thing. This is true for a few reasons — and it’s important to bear these in mind now that OpenAI have made it very easy to create virtual assistants with Code Interpreter built right in. The last thing you want is to share a custom chat bot with your users when it’s not fully accurate.

Firstly, generative AI models are stochastic. Their outputs are unpredictable. This is kind of the whole point — they are meant to replicate human language, and you can never predict exactly what a fellow human is going to say. There is pleasure and whimsy in the randomness of a model’s outputs; it’s a feature not a bug. But, if you’re doing quantitative analysis you want the outputs to be predictable and repetitive, because there will only ever be one correct answer to a query. The bar for precision is much higher in quantitative analysis: if you generate a blog post that is 90% accurate it will still be somewhat useful. If the outputs from, say, a calculator are only correct 98% of the time, it is 150% useless.

You absolutely do not want to be creating apps or services where you have to manually check if every output is even correct — or worse, make these systems available to users who have no idea that the outputs might be wrong. That doesn’t mean you shouldn’t be trying this stuff out and planning for features that use natural language to do this kind of data analysis; there is a great need for this so investors are pouring a lot of resources into making it a reality. I think a sensible timeline for embedding these kinds of features into your products would be 6-12 months from now. The capability should be there by then.

So the first point was the unpredictability of model outputs. The second point is the unpredictability of user inputs. With a natural language interface, users can literally structure their queries however they want — and then the model has to try and make sense of them. Obviously, there are things you can do to mitigate this, such as adapting your system to what you think will be the most frequently asked questions, and then constraining the interface so that these are the only questions that users can ask.

But this goes against the whole purpose of systems like this. The impressive and useful thing about them is not that they can do certain calculations for you — we’ve had software like that for decades. The impressive and useful thing is that they can do these calculations with instructions given in natural language. If you’re just going to constrain what users can ask to a predetermined set of questions, you may as well stick with Excel or Tableau. A knowledge system that cannot reliably convert natural language into quantitative analysis is only going to frustrate users at this point. We’re just not there yet.

If you’re interested in this stuff, I made a video which goes a little deeper, where I explore the capabilities and limitations of Llamaindex with a simple toy database. As usual if you have questions, or want me to share the Colab notebook I use in the video, feel free to reply to this email.

Hello, welcome to our lab

As you may have already seen, OpenAI announced a bunch of new products and features earlier this month.

The playground now let’s you create your own Assistants, with Code Interpreter and retrieval of your own data built right in
GPT-4 Turbo, which is faster and cheaper than GPT-4, and has a context window of 128k — this means you can give it a lot of information for it to work with (like a whole book worth)
ChatGPT Plus is also multi-modal now; you can use images as inputs, and ask it to create images with DALL-E 3

At Handshake we’ve had a lot of fun getting to know these new tools. Lev very thoughtfully tested out the capabilities and limitations of GPT-4 Vision and DALL-E 3 in this video:

…and here are Georgia’s observations on the Assistants API

So, I had a play around with Open AI’s Assistants, mostly testing out the Retrieval feature, where you can give it files containing your own data, and essentially ask it questions about those files. I feel like an obvious use-case for this would be to give it an 100-page report that you don’t have time to read, and then ask specific questions about it’s contents.

This is the kind of thing I see in demos all the time, so I really wanted to try a more atypical use-case. In my spare time, I like to write short speculative fiction, so I thought I’d create a character sheet of biographical information about one of my characters and give that to the assistant to refer to. Then, in the instructions, I told it that it was Ellery Christofi, a woman living in the near-future and working a boring job in customer service, and that it must only answer the user as Ellery. I also had to tell it — repeatedly — to not behave like an assistant. I didn’t want it to be helpful or polite; I just wanted it to talk to me as if it was Ellery, and slowly reveal details of her life to me.

Here’s what worked well:

It actually didn’t take as much as I thought it would to ensure that the model did not attempt to help or assist me — it pretty much responded like we were having a normal conversation.
Using a mixture of the instructions (which I guess is like the ‘prompt’) and the info in the character sheet, it managed to extrapolate quite a lot about Ellery’s general viewpoints: she dislikes the status quo, she’s clearly pretty left-wing, she was constantly making jokes about the horrific drudgery of working in late-stage capitalism. This definitely matches her character.

Here’s what did not work well:

You can’t actually tell it when to retrieve data from the document(s) you provide it. This meant that when I was asking questions about Ellery’s life, it only looked at the character sheet when it thought it necessary.
- This resulted in outputs like the above, where I ask it about the name Ellery, and then it initially makes something up. Then I asked it about her surname sounding Greek, and it finally looked in the document and got what I wanted (you can see the reference marks represented as [1])
The general tone, temperament, and flavor of its banter was pretty plain, inoffensive, and kind of cringey at times. This was sort of to be expected; these models are primarily designed to be polite and helpful. I guess I would need to fine-tune the model to drastically alter its tone.

This was actually an extremely useful exercise both for discovering what this tool can and can’t do, and for my writing. We’re tool-agnostic at Handshake so I thought I’d try something similar with a totally different model and interface. I used pi.ai (by Inflection) to have a general chat about my story and its characters. The way you interact with this model feels a lot less like you’re giving instructions, and more like trying to work through a problem together. You can also tell it to adjust it’s behavior as you go, like this:

Just FYI: I spent a good two hours talking to this bot about my ideas; I highly recommend it if you’re trying to work through a problem. It’s like a collaborator who never gets tired.

The Secret Handshake

Understanding the limitations of natural language

A calculator which is 98% accurate is 150% useless

You still can’t use natural language for quantitative analysis

Hello, welcome to our lab

…and here are Georgia’s observations on the Assistants API

Discussion about this post