← Back to indexNiranjan
Essay · IXJul 20234 min read

Behind on AI

Why we shipped a GPT-4 feature on a weekend a few months back, and the prompt I'm still not proud of.

We shipped our first AI feature a few months back. Every product I followed was launching something. Most of our competitors had said the word "AI" in a press release. We had not. I felt left behind, and I built something on a weekend so we'd have a thing to point at.

The thing I built was a button. The user records a few interactions in our recorder, clicks the button, and the API ships those interactions to GPT-4 with a prompt asking for up to 10 additional test cases extrapolated from what was just recorded. Positive and negative cases. Streaming response back to the dashboard. The mechanic was good. The recorder already captured one happy path; GPT-4 turning that into ten was a small leap on a feature people wanted anyway.

The prompt I shipped is in a file called prompts.ts. One constant, fourteen numbered objectives. I am looking at it as I type this.

1. Generate upto 10 possible positive and negative useful regression test cases...
2. Generate separate tests cases for each component...
...
7. Each test case should be
8. Focus on testing single components or functions...

Item seven is the whole sentence. "Each test case should be." I never finished it. The prompt also says "tests cases" instead of "test cases" in two places. The output format was a custom CSV-ish thing with a TEST_CASES section and a STEPS section, IDs joined by hyphens, because I had tried JSON first and GPT-4 at the time could not be trusted to close a brace consistently. CSVs are more forgiving to parse.

The whole feature was about 157 lines. Raw OpenAIApi client, streaming via openai-ext. The directory was just services/gpt/. No provider interface, no model abstraction. Rate limiting was whatever GPT-4 itself enforced.

It worked. It worked the way LLM features in 2023 worked, which is to say it generated useful test cases most of the time and occasionally produced something that referenced an element the user had not recorded, or hallucinated an action verb that did not exist in our supported list. We shipped it anyway. Customers liked the demo.

Three months on, I went back and paid the debt. The single prompts.ts got split into a prompts/ directory because we were adding a second AI feature, generating validations from a plain-English description, and one constant in one file was no longer the shape of the problem. The validations feature shipped end of May.

The thing the cleanup didn't touch was the prompt itself. Item seven still trails off and the typos are still in there. By the time I went back to clean up the surrounding code, the prompt was the part that worked. I wasn't going to be the one to break it for the sake of a typo.

If I had to name what I learned, it's that LLM features are weirdly forgiving of code quality, in a way nothing else in our codebase is. A typo in a prompt was a non-event, and item seven trailing off did not stop the model from understanding what was wanted. The intuitions I had built up over a decade of writing software, where every typo and every malformed line was a potential bug, did not transfer to the inside of a prompt. Half my engineering instincts were now operating on a thing that would shrug them off.

I shipped the feature out of fear. It is the worst reason to ship anything. I'd like to say I learned to ignore the FOMO after that. I did not.

Thanks for reading. Questions, disagreements, or corrections,
.