Skip to main content

Translating Weights & Biases' Documentation with GPT-4

In this article, we explore how to create an automated translating tool powered by LangChain and GPT-4 to help get your website to international audiences.
Created on May 3|Last edited on May 4
Overview of the translation pipeline


Introduction

We recently started a project of translating W&B docs into other languages using large language models (LLMs). The idea was to feed a documentation page and ask ChatGPT to translate it into another language. Our initial use case was Japanese!
This naive approach worked surprisingly well, though we made some domain errors and translated some words we probably should've keep in English. We decided to move forward and tune our translation pipeline so we could release more up-to-date translations of our website.
Why is this important? Because documentation sites often change as new features are released, and keeping the translated website up-to-date is very time-consuming and often lags behind the English version. This is why an automated translation workflow is beneficial! Also, our documentation website has more than 250 pages!
You can find the associated Colab with this report here
💡
Here's what we'll be covering in this article:

Table of Contents



Markdown Files, Everywhere!

A Markdown file is just a plain text file with some specific syntax that enables some structure. For instance, you can define headers with # and code blocks with backticks. It also supports quotes, tables, bolding, italics, and other fairly standard text requirements. This rather minimalistic syntax, coupled with render engines, enables people to publish beautiful documentation websites. 

What Does a Markdown (.MD) File Look Like?



As you can see, the file has a relatively simple structure, and we can split it at specific points without breaking the syntax. For this, we can use LangChain ingestion tools and break down the input files nicely. Here's the code:
from pathlib import Path
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# we can grab all mardown files recursevely using `rglob`
docs_path = Path("docs/")
md_files = list(docs_path.rglob("*.md"))
md_files.sort()

# load one file
one_file = md_files[0]
data = TextLoader(one_file)
docs = data.load()
Let's analyze this:
  • We are grabbing all .md files from the docs/ folder recursively and, for simplicity, sorting out the list of files alphabetically.
  • For debugging and illustration purposes, I will grab the first file and load it with TextLoader.
Why you are not using MarkdownTextLoader? It's because that text loader breaks markdown syntax, I want to load the file as raw as possible without any stripping out of the "special" markdown characters that creates the syntax.

Split: Breaking the File Into Chunks

We will break the files into chunks of text to feed the model. We have access to GPT4, but sending long context requests has proven challenging as they often time out (and you're billed anyway). So for safety, we will chunk into far smaller than the context window so our request returns safely!
This is not ideal, and I expect that in the near future, OpenAI is scaling the inference infrastructure so one will be able to pass chunks that are close to the whole context window. You get a better translation that is more consistent when you pass a full documentation page instead of chunks. In the future, when gpt-4-32k becomes available we shouldn't need to do any chunking.
💡
Ok, let's continue splitting our text. We can do this by using RecursiveTextSplitter
markdown_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n","\n"," "],
chunk_size=2000,
chunk_overlap=0)
split_docs = markdown_splitter.split_documents(docs)
This function will only split at double line skips, then single lines, and finally at spaces (not great). You might be saying "ah, but wait! there is a MarkdownTextSplitter built-in LangChain." Well, yes and no. This splitter is also going to strip out headers and other relevant syntaxes, so your output file formatting is going to differ from the input! What worked the best for me was vanilla RecursiveCharacterTextSplitter.
The output is a bunch of text chunks depending on the input text and the chunk_size. For GPT-4, I have found that 2k is a reasonable tradeoff.

Translating the Chunks One by One

Now we have our Markdown file split into chunks, and we can start translating them; the naive way would be just to prompt the model with something like:
You are a translation assistant, here you have some Markdown file, please translate to {output_language}. Don't break the syntax.
This simple prompt got us to 75% there (arbitrarily computed by an internal metric), but we can do better:
  • We can give the model more clear instructions
  • We can pass a dictionary of words with respective translations, which is very useful for technical content
  • Instruct what parts to translate and how to keep the syntax intact.

Please Show Me the Prompt!

system_template = """You are a translation assistant
from {input_language} to {output_language}. Some rules to remember:
- Do not add extra blank lines.
- The results must be valid markdown
- It is important to maintain the accuracy of the contents but
we don't want the output to read like it's been translated.
So instead of translating word by word, prioritize naturalness
and ease of communication.
- In code blocks, just translate the comments and leave the code as is.

Here is the translation dictionary for domain specific words:
- Use the dictionary where you see appropriate.
<Dictionary start>
{input_language}: {output_language}
{dictionary}
<End of Dictionary>
"""
For our English ->Japanese translation, the head of the dictionary looks like this (you get the idea):
dictionary="""\
access: アクセス
accuracy plot: 精度図
address: アドレス
alias: エイリアス
analysis: 分析
artifact: artifact
Artifact: Artifact
...

Configuring the LLM and the Prompt

GPT-3 or -4 models support three message types: System, User, and AI. We can create a template for our prompt using LangChain built-in ChatPromptTemplate:
system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

human_template = ("Here is a chunk of Markdown text to translate"
"Return the translated text only, without adding anything else."
"Text: \n {text}")
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

# putting everything together
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt,
human_message_prompt])
This enables us to inject the values for input_language, output_language, and dictionary as we see fit:
for chunk in splited_docs:
prompt = chat_prompt.format_prompt(
input_language="English",
output_language="Japanese",
dictionary=dictionary,
text=chunk.page_content)

Putting Everything Together Into a Chain

Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case. The most commonly used type of chain is an LLMChain, which combines a PromptTemplate, a Model, and Guardrails to take user input, format it accordingly, pass it to the model and get a response, and then validate and fix (if necessary) the model output.
We are going to create an LLMChain to pipe our chunks of text with the right prompt for each chunk; we will also add the WandbTracer so we can inspect how our pipeline is doing!
from langchain.chains import LLMChain
from wandb.integration.langchain import WandbTracer

# we setup a W&B project to log our pipeline traces
WandbTracer.init({"project": "docs_translate"})

# define the chat model, in our case GPT4 (lower temps, less hallucination)
chat = ChatOpenAI(model_name="gpt-4", temperature=0.5)

# we probe the chain with the model and prompt template
chain = LLMChain(llm=chat, prompt=chat_prompt)
We have defined the blocks; now we have to iterate over the chunks and call the chain:
translated_docs = []
for i, chunk in enumerate(split_docs):
print(f"translating chunk {i+1}/{len(split_docs)}...")
chain_out = chain.run(
input_language="English",
output_language="Japanese",
dictionary=eng_ja_dict,
text=chunk.page_content)
translated_docs += [chain_out]

out_md = "\n".join(translated_docs)
We end up with the concatenation of the translated chunks, which is also not ideal as we add a simple line break. Room for improvement here.

Traces

We can explore single-chain calls to understand how our pipeline is working. This is extremely valuable when errors occur or when debugging some unexpected behavior. This is a rather single-chain pipeline, but when stacking multiple levels, it can become very complex!
I also provided another English to Spanish trace for you too check, not everyone speaks Japanese!

en->ja
1
en->es
1


Results

Let's take a look at this one-page translation. We see that the headers and list are there, then the titles are in the right place. Another relevant feature is that the English words we wanted to keep in English are still in English; most of them are product names, so we don't want them translated anyway. For instance, Google Colab, Weights & Biases, Sweeps, Artifacts, etc...
🤣 I don't speak Japanese, but my colleague Akira told me it's a good translation!



Conclusion

Having access to tools like this with minimal code is fantastic!
There's still has a lot of room for improvement on the processing side of things, but it may not be necessary as the models get faster and more reliable. We will pass the complete files at once!
Another trick you can use is to filter the dictionary with the corresponding chunk (you only need the words present in the chunk). This saves you some tokens and will make the request faster.
I am working on improving the splitting, and I was not the only one noticing that we need a syntax-preserving markdown splitter; there is an open issue in the LangChain GitHub repo.
Once we get access to large context windows, one may be able to pass a full page of translation into the prompt; that way, you could style the model with a translation from a professional translator. Right now, this is too long, and the requests time out.




Iterate on AI agents and models faster. Try Weights & Biases today.