Skip to main content

W&B Tone of Voice Experiments

I've been on quite an adventure creating an LLM powered application! Here's the lowdown on how I brought this project to life.
Created on July 21|Last edited on August 31
Check out the finished app at this link: https://wandb-tone-of-voice.streamlit.app




What I aimed to achieve with this project

  • Deepen my understanding of how our product functions in a prompt engineering workflow.
  • Execute a complete cycle of testing, constructing, launching, and monitoring an LLM app.
  • Bring a bit more harmony to our application's tone of voice and communication. Basically, working on making sure our app sounds like us, consistently.
  • Give our designers a bit of a break. By generating text they can trust, we're saving them time and effort that they can use elsewhere.


Exploring the symphony of tone variations in our app

Empty states



Settings descriptions


Learning: Let's conduct a review of our app's text and make sure it's lining up nicely with key use cases like these.
💡

Creating a realistic set of user Prompts

My first step was to develop a set of prompts mirroring realistic asks. These ranged from everyday questions to more complex inquiries, ensuring a comprehensive representation of possible user interactions. The objective was to create a robust foundation that would encourage our LLM to generate a wide array of plausible responses.

Initial user prompts (my dataset)

"write a short sentence for a tool tip that explains what adding a version to the model registry means",

"brainstorm 5 options for a tagline for our brand",

"write a tweet announcing our new produciton monitoring product for LLM's",

"write 2 sentences to display on an empty page of runs that explains what weights and biases runs are and how to create them",

"summarize these bullet points into a single header for a section of the website - Visualize live metrics, datasets, logs, code, and system stats in a centralized location. - Analyze collaboratively across your team to uncover key insights.- Compare side-by-side to debug easily, and build iteratively",

"write a tool tip that explain what an automation is for a model registry",

"write a short placeholder for a form field that is a description for an automation to encourage users to fill it out",

"how do i create a report?",

"write 2 sentences to highlight why a user should creat a report",

"write 1 sentence to encourage users to learn more about reports with our videos and docs",

"rewrite this Want others to see your report? Click the lock icon beside your Projects name in the navbar to make it public.",

"write me three bullet points that tout selling points for weights and biases to enterprise companies"


Question: Should I have logged these as artifacts?
Answer: Yes, I decided to log them to be able to reuse them in notebooks and apps
💡
Question: Is this set of prompts too broad. Should I focus my goal to a more focused set of use cases like documentation and in app messaging
💡

Testing and iterating with ChatGPT

Article from OpenAI on best practices for prompt engineering: https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
💡
I then jumped into a cycle of rapid testing and iterating with ChatGPT, tweaking and refining the prompts based on the responses I received. It was a phase of trial and error, and I learned a lot about how to phrase and structure the prompts to get the desired output.
However, I faced challenges in achieving exactly the tone I wanted. It was a bit of a struggle to get ChatGPT to understand and replicate the fine balance between professional and informal tones that we were aiming for. Knowing we needed to be more specific, I then asked ChatGPT to define a tone that falls somewhere between professional and informal.


Narrowing down to key systems prompts

To help guide the model, I added some context from early exploratory work. These insights from the initial attempts at defining tone and developing prompts were invaluable in guiding the model towards the desired output.
After a few iterations, I managed to narrow down our prompts to a couple of key systems prompts that consistently produced the desired tone. These prompts were then used as the basis for all future iterations and refining.



Experimenting with temperature

With the prompts finalized, I started experimenting with the 'temperature' setting in the model. This parameter controls the randomness of the model's output, allowing us to fine-tune the balance between creativity and consistency in the responses. In the end, a temperature of 0.7 seemed to convey the right amount of creativity, while still providing relevant output.

Evaluation of initial responses

The evaluation process is complicated by our current tools due to their lack of specific functionalities for prompt output responses.


Our existing tables do not support the necessary views for efficient comparisons of multiple prompt outputs. It also lacks the crucial capability to add notes or provide feedback, thereby limiting the ability to quickly iterate.
💡

Building and launching a Streamlit app with the help of ChatGPT

The culminating point of my process was utilizing ChatGPT in the development of a Streamlit application. Leveraging the power of ChatGPT, I was able to integrate the carefully crafted prompts into a user-friendly application. This application is designed to facilitate seamless interaction with the machine learning model.
Finally, after all the development, testing, and refining, I launched the application! With a friendly professional tone and the ability to handle a diverse range of prompts, I'm proud to say that the LLM-powered application is ready to interact with users in a way that's both engaging and respectful.
Check out the finished app at this link: https://wandb-tone-of-voice.streamlit.app
Prompt example: write a tweet announcing our new produciton monitoring product for LLM's


Generating useful text with the app

Using the app to create this report

The culmination of my hard work was using my very own LLM-powered app to generate this very report. With the well-crafted prompts and the immense capabilities of the LLM, I was able to produce a comprehensive, articulate, and insightful report. It's living proof of what an LLM-powered app can do. Here you can see the user prompt I used to generate all this text:

can you expand the following bullet points into a report that aims to tell the process of how I developed prompts and built an llm powered application. please include headers for each section:

- developed a diverse set of prompts of realistic asks
- worked with chatgpt to ask it how best to describe tone of voice
- did a bunch of quick testing and iterating with chatgpt
- not quite getting where i wanted
- wound up asking chatgpt to define a tone between professional an informal
- added some of my context from early explorations
- after a few iterations narrowed it down to a couple systems prompts
- started experimenting with temperature
- built a streamlit app using chatgpt 
- launched app
- we used the app to generate text to create a report
- for next steps i want to implement production monitoring and use that data to add new user prompts to my test set.  i also want to further refine the system prompt to fine tune the tone
- i ran into key scenarios that are lacking in our product.  first it is hard to identify the paramaters that led to the best answers.  second as powerful as tables are they lack some key functionality like annotating results. third is several more minor ux issues with tables that made my work harder

I noticed that there were instances where the report took a bit longer to publish than expected without any status updates... Also, it seems the interface had a few hiccups and crashed a couple of times during saving. 🛠️
💡

Improving our email copy

We utilized the app to make our email content more concise. It effectively helped us reduce verbosity and improve clarity.
Initial text:

Text from our tone of voice generator:


Adding production monitoring to the app

Wanting to effectively monitor usage and gather prompts for dataset enrichment, I implemented a system to track submissions to my application. This was achieved by leveraging the capabilities of a Stream Table. This approach not only provided real-time insights into app usage but also facilitated a continuous influx of valuable data points for my dataset.
I'm excited to be able to show weave panels in reports!
💡

Learnings: Identifying workflow scenario gaps

During the process, I encountered several key scenarios that our product was lacking in.
  • The first major challenge was the difficulty in identifying the parameters that led to the best answers. This is an essential aspect of machine learning, as it allows us to understand what works best and why.
  • Tables are a powerful tool in organizing and displaying data, but we've found that they lack some key functionalities. For instance, the ability to annotate results is a feature that is currently missing. Such a feature would make it easier for users to understand and interpret the data presented.
  • There were also some minor UX issues with tables that slightly hampered the ease of work. We believe that addressing these issues is crucial for improving the overall user experience. Therefore, we are working diligently to rectify these problems and enhance the functionality of tables in our system.

Conclusion

In conclusion, the journey in developing prompts and building an LLM powered application has been both challenging and rewarding. I identified areas of improvement and are already working on implementing changes. I believe that by refining our system and overcoming the identified challenges, we will be able to provide a more efficient and user-friendly experience for our users.