Gathering feedback and iterating
Created on August 31|Last edited on August 31
Comment
Check out the feedback app at this link: https://wandb-tone-of-voice-prompt-feedback.streamlit.app/
Gathering user feedback of performanceAdding user feedback to production monitoringSystem prompt assessmentEnhancing system prompt evaluationLimitations of analyzing with WeaveData Limitations and Stream Table OverhaulRescoping system prompt goals
Gathering user feedback of performance
The initial phase of the project involved incorporating a feedback mechanism into the application, to allow users to rate responses.

Adding user feedback to production monitoring
Building upon my existing prod mon weave board, we added this use feedback. This tool provided a centralized platform for collecting and reviewing feedback, proving to be a beneficial addition to the feedback system.
Production monitoring weave board: https://weave.wandb.ai/?exp=get%28%0A++++%22wandb-artifact%3A%2F%2F%2Fwandb-designers%2Ftone-of-voice-wandb-writer%2FApp-production-monitoring%3Alatest%2Fobj%22%29
This addition was crucial as there had been minimal evaluation of the system prompts prior to this, which was underscored by the feedback scores received. The feedback data revealed that the current production prompt was performing pretty poorly.

Feedback scores with initial prompt
System prompt assessment
Enhancing system prompt evaluation
Knowing that the initial prompt was performing poorly, led me to focus on getting feedback on the system prompts. To further fine-tune the evaluation of system prompt experiments, a decision was made to leverage the power of crowdsourcing. Instead of limiting the assessment to a single perspective, the feedback was expanded to include a wider range of users. An application was developed, allowing users to easily rate responses from the various system prompts.

Our product team jumped in to help by submitting feedback through this rating app. The original prompt was found to be the least effective, with clear alternatives emerging as potential replacements.

Even though we had relatively few data points, between this feedback and the existing user feedback, the decision was made to update the system prompt in production.
The feedback following the replacement has greatly improved the average rating score for the application

Limitations of analyzing with Weave
There are a lot of questions I would like to ask, that are still difficult to do on Weave boards. Some of these are from lack of functionality like:
- What are the scores for specific model versions?
- How many questions/responses are being submitted per day?
- Whats the average feedback score per day?
Some of these questions could be answered, but require very specific logging of data. I tried to normalize my stream tables, and without functionality for joining this made it impossible to create certain valuable charts.
Joining functionality is in the works, however timeseries charts are not easy to create
💡
Data Limitations and Stream Table Overhaul
It was observed that the feedback data was not comprehensive enough to answer specific queries about the rated prompts. This limitation necessitated a complete revamp of the stream table as row/column name edits are not yet possible. During this process, other issues like inconsistent column names were rectified.
Rescoping system prompt goals
Additionally, the scope of the project was refined to concentrate solely on documentation and application, leaving out marketing.
Add a comment