Opinionated SEO - Digital Marketing News

AI LLM Prompting Tests - My Results on Prompt Engineering

July 23, 2023 Philip Mastroianni
AI LLM Prompting Tests - My Results on Prompt Engineering
Opinionated SEO - Digital Marketing News
More Info
Opinionated SEO - Digital Marketing News
AI LLM Prompting Tests - My Results on Prompt Engineering
Jul 23, 2023
Philip Mastroianni

Send us a Text Message.

I discuss my experience testing different AI systems prompting including Google Bard, OpenAI GPT-4 / GPT 3.5, Anthropic Claude 2, Llama 2, and Jasper to generate location-specific content. Most of this is based on the last 18 months of building out prompts, and now testing on models released over the last 4-6 weeks.


Google Bard

  • Released major update on July 13, 2023
  • Prompt strategy: Long paragraphs, numbered tasks, multiple iterations
  • Couldn't produce high quality content without heavy editing
  • Issues following instructions, needing reminders


OpenAI GPT-4

  • Works well with conversational, transcribed prompt
  • Able to follow directions and produce high quality content
  • No need for shot prompting


OpenAI GPT-3.5

  • Uses revised GPT-4 prompt plus follow up to enforce formatting
  • Gets content production-ready after second prompt
  • Quality close to GPT-4 with additional data/content provided


Anthropic Claude 2

  • No API access, using text interface
  • Required revising prompt structure significantly
  • XML tagging of data types improves context
  • Built-in prompt diagnosis/suggestions helpful
  • Single prompt can produce high quality output


Meta Llama 2

  • Free to use commercially if you have the hardware
  • Expected behavior similar to GPT-3.5
  • GPT-4 prompt worked well
  • Quality closer to GPT-3.5 but better privacy
  • Could refine with prompt chaining
  • Issues following instructions precisely


Jasper API

  • Access useful for building AI tools
  • Long prompt length capability
  • Appears to use GPT-4 or variant
  • Zero shot performs as well as GPT-4
  • Able to produce high quality content easily


Conclusion

  • GPT-4 and Jasper produce quality results most easily
  • Pleasantly surprised by Claude 2 quality and formatting of prompt
  • Llama 2 needs refinement to reach GPT-4 level
  • Curious about prompt strategies working across models

Full show notes: https://opinionatedseo.com/2023/07/ai-prompting/

Show Notes Transcript

Send us a Text Message.

I discuss my experience testing different AI systems prompting including Google Bard, OpenAI GPT-4 / GPT 3.5, Anthropic Claude 2, Llama 2, and Jasper to generate location-specific content. Most of this is based on the last 18 months of building out prompts, and now testing on models released over the last 4-6 weeks.


Google Bard

  • Released major update on July 13, 2023
  • Prompt strategy: Long paragraphs, numbered tasks, multiple iterations
  • Couldn't produce high quality content without heavy editing
  • Issues following instructions, needing reminders


OpenAI GPT-4

  • Works well with conversational, transcribed prompt
  • Able to follow directions and produce high quality content
  • No need for shot prompting


OpenAI GPT-3.5

  • Uses revised GPT-4 prompt plus follow up to enforce formatting
  • Gets content production-ready after second prompt
  • Quality close to GPT-4 with additional data/content provided


Anthropic Claude 2

  • No API access, using text interface
  • Required revising prompt structure significantly
  • XML tagging of data types improves context
  • Built-in prompt diagnosis/suggestions helpful
  • Single prompt can produce high quality output


Meta Llama 2

  • Free to use commercially if you have the hardware
  • Expected behavior similar to GPT-3.5
  • GPT-4 prompt worked well
  • Quality closer to GPT-3.5 but better privacy
  • Could refine with prompt chaining
  • Issues following instructions precisely


Jasper API

  • Access useful for building AI tools
  • Long prompt length capability
  • Appears to use GPT-4 or variant
  • Zero shot performs as well as GPT-4
  • Able to produce high quality content easily


Conclusion

  • GPT-4 and Jasper produce quality results most easily
  • Pleasantly surprised by Claude 2 quality and formatting of prompt
  • Llama 2 needs refinement to reach GPT-4 level
  • Curious about prompt strategies working across models

Full show notes: https://opinionatedseo.com/2023/07/ai-prompting/

Phil:

Welcome to the opinionated SEO. I wanted to talk a little bit about some of the things that I've been working on over the last 18 months, but really more focus in the last few months and especially the last few weeks. I've been spending a lot of time working with some of the new, large language models and how to best utilize them. And the nuances for each as it comes to prompting. So I've been using the following Google Bard, OpenAI, GPT 4 and GPT 3. 5, Anthropx Cloud 2, Llama 2, and Jasper. So I'll go into a little bit about each one, maybe some overview, what I think is kind of exciting about them, what my prompt strategy has been, and what my response quality has been.

So let's talk about what my overall task has been, and I'm going to keep it a little bit general, but the client wanted overview content for specific pages that take into account a unique location for a product or service.

Phil:

This means combining location information with product service content to create something useful in the format of content for the end user. So let's start with Google Bard. Google had a major release update actually on July 13th, 2023. There's a lot more available. There's a lot more languages. They have really enhanced a lot of the coding side, but I wanted to give it a shot and see how it was compared to how I looked at it when the beta first opened. So let's start with my prompt strategy after a lot of. testing. The prompt seemed to work best when I utilized long form paragraphs, numbered tasks, and used refining techniques using multiple prompt iterations. My response quality, I was not able to get a production ready response that I would feel comfortable putting on a website. without serious editing. When I mean production ready, I mean that that content I can copy and paste and put that on the website and feel confident that end users seeing that would feel like it was well written and that it was helpful according to Google. BARD could not follow instructions and I found that it needed me to remind it of at least four different requirements that it kept missing. It would say, you're right, I apologize for the mistake. That was very common when I asked it to verify the instructions were carried out. BARD does not do a good job of giving good prompt advice either, so it was a lot of experimentation, and I found that it just didn't really take to a lot of the refinements. The next is OpenAI GPT 4. So my prompt strategy, actually, my original prompt that I had created for this task was for GPT 3. 5, and I've since adapted it for 4. GPT 4 chat, especially since mid June, works very well when I give it a conversational request, almost as if a conversation with a content writer was transcribed. I'm able to get away with no shot prompting, and it follows all of the directions I give it. From a response quality standpoint, I can get a production ready piece of content in a single response. I can run a list of commands that ensure it's followed my request exactly as a follow up, but oftentimes it doesn't need to make any changes. Let's talk about OpenAI's GPT 3. 5. So my prompt strategy for this is using really my revised GPT 4 prompt, and once it finishes, I have a follow up prompt which forces specific requests that the response just typically lacks. This actually includes specific formatting, and not replacing some text with variables that my CMS will use to pull real time data. So it requires two prompts chained together in order to get a quality response. So let's talk about the quality. My response can be production quality only after running my secondary prompt to clean up and reinforce specific rules and formatting. The quality of the content is very close to GPT 4. I do provide it with a large amount of background data and content that it utilizes. And so it's really synthesizing that. And I feel like that's why it can get to that quality. The next is Anthropx Cod2. I don't have access to the API, so I'm only using the text interface. This is one of my favorite conversational large language models, and they have some good documentation on ways to present data in the prompt to help with context and giving it data to utilize. So, my prompt strategy. Cod2 had me actually completely revise my existing prompt. Even though it can handle a hundred... 1, 000 tokens. I found that it required much more strict structuring of the prompt compared to OpenAI. That's not necessarily a negative thing. It just made me tag things in a certain way. So I followed their documentation. I utilized XML tagging of data types and surrounded contextual and additional data in specific xml tags that I was able to reference in other parts of the prompt. I also found that including an example like a one shot proved to help solidify the format and it didn't overly utilize the phrasing. Often, GPT 4 will use a similar sentence structure that the two pieces of content feel just too similar. They'll borrow too many of the words from my one shot example. CLOD2 has the best prompt diagnosis and suggestions built into the large language model. I was able to fine tune the prompt using the system itself, allowing me to get it to a single prompt response. So let's talk about that quality. I was able to get a production ready response with a single prompt. The quality of the output was as good, or in some cases, better than GPT 4. However, I don't have API access, so it's not as easy for me to test across more content types. Typically, I'm going to run this 10, 15, 20 times and look at the nuances and differences between them. Without having the API, it's a lot harder for me to include all of those different variations between them. The next is llama too. It's free. It's legal to commercial use. Okay. Did I mention it's free? Well. Actually, it's free if you have the hardware to run it. My system, though it's very capable, doesn't quite meet that requirement I wanted to use at least the middle model. So I was able to spin up an AWS endpoint with four NVIDIA Tesla T4 GPUs, which was able to run the 13B model. I'm most excited about this model as it would allow for a completely local text generation with complete control over the model, hardware, and privacy. So let's talk about this prompt strategy. I didn't really work a lot with the Llama V1 as it was really a research only version. Or there were some leaked versions or derivatives, but I did go in expecting it to work similar to GPT 3. 5. And I feel like it does. I am limited in tokens due to some hardware limitations and just some general settings, but I found that my GPT 4 prompt worked pretty well within the Llama 2 LLM. So let's talk about that quality. I would put the quality closer to what I normally get from GPT 3. 5. Before I can pull that second refinement prompt with the positive privacy aspects I feel like there's a lot of opportunity to do prompt chaining to get the response refined It had some issues following the instructions. It really tended to exaggerate what I want it done For example, I wanted it to put some HTML code so P tags around each paragraph But then it decided to also create headings and heading tags That's not something that I wanted and it added that in place. There would be a lot of things that I would have to tell it not to do, and it tended to just kind of keep building as opposed to just sticking straight to the script. Overall, I did like the way it worded the content, though it did feel a bit too salesy, and overly positive compared to using the same tone requirements as I used in other models. Now, I did test the chat version of the large language model and had less success with the non chat versions. In fact, I couldn't get them to really come back with anything that I felt was cohesive. They were going to require some model training. that I just haven't put the time into yet. I do think that with some minor refinements, this could be as good as GPT 4, but it's going to require a lot more work in order to get there, but it could come at an overall lower cost. And a fully private locally run large language model. The last one is Jasper's API. Now having access to this has been really great, as I've been building out AI tools that need multiple models, and I really like their command endpoint through their API. It allows for up to 6, 000 characters in my prompt, and it really lets me push the size. So talking about that strategy, the prompt's focus has been nearly identical to that of GPT 4. I actually suspect it's using GPT 4 or some kind of variation of it as part of the API. Sections of the content with context and instructions along with format requests really make up this prompt, and I found that XeroShot worked just as well as GPT 4. So talking about response quality. I was able to get production quality responses without needing to make any adjustments. So far, it looks like Jasper and GPT 4 are fairly easy to get quality results from. I was pleasantly surprised by Anthropx Clawd 2, and I like the formatting they've trained their model on. I'm hoping to get access to their API so I can put it really to the test. Llama 2 wasn't bad, but it couldn't quite get me production quality content, so I'll have to look into training the model to align closer to what I'm looking to get as a response. I'm curious how many of you have been creating a prompt library aligned with different LLMs, and if you've found a prompt style that maybe works between all of them. I will say this was 100% written by hand, no AI wrote any portion of this content, but does that make this a better article? Would love to hear your thoughts on that too. Thanks for taking the time to listen. This is Phil, the Opinionated SEO, and I guess AI guy at this point, because that seems to be a lot of what we're all doing. Have a great day, and talk to you guys again soon.