Train a Custom Text Classifier and NER Model using HuggingFace and SpaCy

Learn how to build a custom text classifier that can assist disaster relief workers by extracting important entities from distress messages in chat apps using named entity recognition.
Madhana Bala S K
Created on May 4|Last edited on November 30
Comment
In this article, we will explore the development process of the ReliefNer app, inspired by the remarkable work of Merve Noyan, Alara Dirik, and their team in supporting the Turkey-Syria earthquake disaster relief operations. The primary focus of the ReliefNer app is to use named entity recognition (NER) to extract essential entities from chat applications and organize them in a structured format by creating a custom text classifier.
Throughout this article, we will delve into the process of training a custom text classifier and Named Entity Recognition (NER) model using HuggingFace and SpaCy. Additionally, we will utilize Gradio as our web framework and leverage the Telegram Bot API to gather distress help messages.
By the end of this article, you will have a comprehensive understanding of how to build a disaster relief assistant application empowered by named entity recognition.
Here's what we'll cover:
Problem Statement and WorkflowData CollectionTraining our Text Classifier ModelData AnnotationTraining our Named Entity Recognition ModelInference PipelineTelegram API SetupStep 1: Get Messages from TelegramStep 2: Classify Disaster vs Random MessagesStep 3: Perform Named Entity RecognitionStep 4: Use Gradio for User InterfaceMy Evaluation and Future Work for ReliefNerRecommended Reads
﻿
﻿
Note: This article complements the Named Entity Recognition for Beginners article in which we explore various NER techniques, applications, challenges, an example NewsTrackr app, and more! The main aim of this piece is to demonstrate how one can leverage Hugging Face and SpaCy in developing NER-powered applications.
💡
Problem Statement and WorkflowDuring a disaster, effective communication and coordination among disaster relief teams, volunteers, and affected individuals are crucial. Chat apps, such as messaging platforms or social media channels, play a significant role in facilitating communication and sharing information in real-time. However, extracting relevant and critical information from these chat apps can be challenging, especially when dealing with large volumes of messages.
One of the key challenges disaster relief teams face is the manual extraction of important information entities, such as names, phone numbers, and addresses, from the chat messages. This process is time-consuming, error-prone, and often hinders the prompt response and efficient allocation of resources.
To address this challenge, the afetharita team developed an OCR-NER-based disaster relief assistant application to scan images and extract important entities. The primary objective of the ReliefNer application is to automate the extraction of crucial information from chat apps and store it in a structured format.
By automating the extraction of important entities, the application reduces the burden of manual data processing, allowing teams to focus more on critical tasks and improve their overall response capabilities. The extracted information can be easily accessed, organized, and utilized to support decision-making, resource allocation, and coordination efforts during disaster response and recovery operations.
The ultimate goal of this application is to enhance the effectiveness and efficiency of disaster relief operations by empowering teams with valuable and timely information. 
Let's dive right in and begin constructing our ReilefNer app! Below is the architecture for the application:
ReliefNer App Architecture, Image by Author
Data CollectionData collection marks any ML project's initial phase, and the quality of the data gathered directly influences the model's performance.
While existing disaster-assist chat channels may seem like a valuable data source, it is essential to consider the privacy and ethical implications of collecting sensitive personal information. These channels often contain personal details such as names, phone numbers, and addresses of individuals seeking help. Obtaining consent from each user to use their data for training the model would be a complex and time-consuming process. An alternative approach was necessary for data collection to respect user privacy and avoid potential legal and ethical issues.
Since my primary focus was developing a working prototype ASAP, I used ChatGPT to generate synthetic disaster distress messages and some random texts.
The inclusion of random texts was necessary because global chat channels often contain unrelated messages, such as advertisements. We can focus on extracting important entities from the distress help messages by filtering out random messages using a text classifier.
﻿
﻿
You can use the code below to retrieve the dataset artifact:
import wandb
run = wandb.init()
artifact = run.use_artifact('madhana/Named_Entity_Recognition/Balanced_data_artifact:v0', type='dataset')
artifact_dir = artifact.download()
Training our Text Classifier ModelIn this section, we'll learn how to use HuggingFace to fine-tune the distilbert-base-uncased model for our 'Disaster Vs Random message' text classifier.
Let's first install the necessary libraries and use the notebook_login() function from the huggingface_hub library to authenticate and log in to the HuggingFace model hub. This allows us to directly access and work with models and datasets from the hub in your notebook.
!pip install transformers datasets evaluate
from huggingface_hub import notebook_login
﻿
notebook_login()
Load the pandas dataframe as HuggingFace Dataset and perform train-test splits.
from datasets import Dataset, DatasetDict
﻿
hf_msges = Dataset.from_pandas(df)
﻿
# 90% train, 10% test + validation
train_testvalid = hf_msges.train_test_split(test_size = 0.2)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})
Next, we define a preprocess_function that takes examples as input and uses the tokenizer to tokenize the "sentence" column of the examples with truncation enabled. This function will be applied to the train_test_valid_dataset using the map method, resulting in a tokenized dataset version.
Then, we create a DataCollatorWithPadding object, passing the tokenizer as an argument. This data collator will batch and pad the tokenized data during training.
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
﻿
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
    return tokenizer(examples["sentence"], truncation=True)
﻿
tokenized_train_test_valid_dataset = train_test_valid_dataset.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Next, we define two dictionaries: id2label to map numeric class labels to string labels, and label2id maps string labels to numeric labels.
id2label = {0: "RANDOM", 1: "DISASTER"}
label2id = {"RANDOM": 0, "DISASTER": 1}
In the code below, we configure the training settings for our model. We use the TrainingArguments class from the Transformers library to define various training parameters such as the output directory, learning rate, batch sizes, number of epochs, and evaluation strategy and set push_to_hub as True to reuse the model later. 
We then instantiate a Trainer object, passing the model, training arguments, train and evaluation datasets, tokenizer, data collator, and a custom metric function. Finally, we call the trainer.train() method to train our model using the specified settings.
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
﻿
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
﻿
training_args = TrainingArguments(
    output_dir="disaster_msges_classifier_v1",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model='precision',
    push_to_hub=True,
)
﻿
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_test_valid_dataset["train"],
    eval_dataset=tokenized_train_test_valid_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
﻿
trainer.train()
Once we have successfully trained our text classifier, we can use the pipeline function to perform inference on new input data. The pipeline function handles all the necessary preprocessing steps, model inference, and post-processing. It abstracts away the complexities of loading the model, tokenizing the input, and generating the predictions.
from transformers import pipeline
﻿
classifier = pipeline("sentiment-analysis", model="Madhana/disaster_msges_classifier_v1")
classifier(text1)
﻿
﻿
Data AnnotationWith our disaster distress messages classified, we can now focus on constructing our named entity recognition model. However, an important step is to annotate our disaster distress messages before we can build it. 
In the following code, we'll learn how to use SpaCy models with Argilla to log the results from the en_core_web_trf base model. We can then use the Argilla space in HF spaces to fine-tune our annotation data.
Let's start with importing and initializing Argilla in our notebook.
import argilla as rg
﻿
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="https://madhana-argillalearn.hf.space",
    api_key="admin.apikey"
)
Then, we use Argilla and SpaCy to perform named entity recognition on our dataset. The code below loads the pre-trained SpaCy model, iterates over the dataset, extracts entity annotations, and logs the results using Argilla.
# Using Argilla to log our results
nlp = spacy.load("en_core_web_trf") # /content/model-best
﻿
# Creating an empty record list to save all the records
records = []
﻿
# Iterate over the first 50 examples of the Gutenberg dataset
for index, row in balanced_df.iterrows():
    # We only need the text of each instance
    text = row['sentence']
﻿
    # spaCy Doc creation
    doc = nlp(text)
﻿
    # Entity annotations
    entities = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents]
﻿
    # Pre-tokenized input text
    tokens = [token.text for token in doc]
﻿
    # Argilla TokenClassificationRecord list
    records.append(
        rg.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="best-model",
        )
    )
﻿
rg.log(records=records, name="disaster_msg_ner")
Use the code below to retrieve the annotation.json artifact:
import wandb
run = wandb.init()
artifact = run.use_artifact('madhana/Named_Entity_Recognition/Annotation_artifact:v0', type='json')
artifact_dir = artifact.download()
Training our Named Entity Recognition ModelNow that our training data is ready let's train our NER model using SpaCy.
To fine-tune our NER model, we first need to convert the annotated json file into SpaCy's DocBin object. 
import json
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
﻿
annotation_json_path = '/kaggle/input/spacy-annotation-congif/annotations.json' 
with open(annotation_json_path, 'r') as f:
    data = json.load(f)
﻿
nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin() # create a DocBin object
The code below extracts the text and entity labels from each example, creates a SpaCy document, checks if the named entity spans are valid, and appends them to a list. The list of entity spans is then filtered to remove overlapping or conflicting spans.
Finally, the processed documents are saved to a disk as a SpaCy binary file for training purposes.
from spacy.util import filter_spans
﻿
for training_example in tqdm(data_modified): 
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents 
    doc_bin.add(doc)
﻿
doc_bin.to_disk("training_data.spacy") # save the docbin object
The most effective method to train our spaCy pipelines is using the spacy train command in the command line. This command requires a single config.cfg file that contains all the necessary settings and hyperparameters. 
We also can modify settings via the command line and incorporate custom functions and architectures through a Python file. 
To facilitate this process, the quickstart widget generates a basic configuration file with recommended settings tailored to your specific use case.
All we need to do is select the Language, Components, Hardware type, and the model size. After generating the config file using the quickstart widget, we can conveniently download it and save the model in the same folder as the annotations.json file.
Image from SpaCy Training Models Docs 
We can also add the WandbLogger to the spaCy config file to track our spaCy model's training metrics and save and version the models. More on Wandb - SpaCy integration here.
[training.logger]
@loggers = "spacy.WandbLogger.v3"
project_name = "Named_Entity_Recognition"
remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
run_name = "spacy-flood-ner-v01"
Once you have saved the initial configuration as base_config.cfg, you can utilize the init fill-config command to populate the remaining default values.
!python -m spacy init fill-config /kaggle/input/spacy-annotation-congif/base_config.cfg config.cfg
Before we start our model fine-tuning process, make sure to download the base model.
The code below downloads the en_core_web_trf model. 
!pip install -U 'spacy[cuda-autodetect,transformers,lookups]'
!python -m spacy download en_core_web_trf
With the binary data (training_data.spacy) and the model configuration file (config.cfg), we are ready to initiate the model training process using the following command. 
!python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy --gpu-id 0
Once the training process is over, we can push the best model to HuggingFace Hub by:
Create a " model-best-package " directory and package the trained model using SpaCy's command-line tool.
Save the packaged model as a wheel file for easy distribution and installation.
Finally, push the packaged model to the HuggingFace Hub.
!mkdir /content/model-best-package
!python -m spacy package /content/model-best /content/model-best-package --build wheel
!cd /content/model-best-package/en_pipeline-0.0.0/dist
!python -m spacy huggingface-hub push /content/model-best-package/en_pipeline-0.0.0/dist/en_pipeline-0.0.0-py3-none-any.whl
We can then access our model using the code below.
!pip install https://huggingface.co/Madhana/en_pipeline/resolve/main/en_pipeline-any-py3-none-any.whl
Inference PipelineWith our text classifier and NER model saved in HuggingFace Hub, we can now concentrate on our ReliefNer inference pipeline.
ReliefNer Inference Pipeline, Image by Author
Telegram API SetupTo set up our Telegram bot for collecting distress messages and extracting entities, we first need to create the bot and add it as an administrator to the disaster relief channel. Make sure to obtain the Telegram Bot API token for your bot. Let's go through the steps to configure our Telegram bot in this section.
Steps to Create a Telegram Bot Download the Telegram app on your device or use the web version. 
Open the app and search for the "BotFather" bot. 
Start a chat with the BotFather bot by clicking on the "START" button. 
Type "/newbot" and follow the on-screen instructions to create a new bot. 
Choose a name and username for your bot. 
Once your bot is created, the BotFather will give you a unique API token.
Steps to add your telegram bot to your channel as an administrator:Create a new channel or choose an existing one that you want to use the bot in. 
Add your bot to the channel as an administrator. To do this, go to the channel settings, click on "Administrators", and then click on "Add Administrator". Search for your bot and add it to the channel. 
Now you can send commands to the bot in the channel by mentioning the bot using the "@" symbol followed by the bot's username. For example, "@my_bot help" will send the "help" command to the bot.
Step 1: Get Messages from TelegramLet's now create the get_data function that interacts with the Telegram Bot API to retrieve text data from a Telegram bot. Here's an explanation of how the code works:
 The code begins by initializing a variable named `offset` to `None`. This variable will keep track of the last update ID processed.
The get_data function takes a parameter bot_token, which represents the API token of the Telegram bot used for authentication.
Inside the function, there is a try-except block to handle potential errors that may occur during the API request.
The code checks the value of offset to determine whether it has been initialized or not. If offset is None, it means no updates have been processed yet. In this case, the code sends a GET request to the Telegram API's getUpdates endpoint, passing the bot_token. The response is stored in the response variable.
The JSON response is parsed using the json.loads function, and the ID of the last update is extracted from the response JSON. The last_update_id is converted to an integer, and offset is updated to last_update_id + 1 to ensure that future requests retrieve only new updates.
If offset is not None, meaning updates have been processed before, the code sends another GET request to the getUpdates endpoint, but this time includes the offset as a parameter. This ensures that only new updates are retrieved. The response is stored in response, and the process is similar to the previous case.
After obtaining the response JSON and the updated offset, the code extracts the text of the channel posts from the JSON and stores them in a list called text_list.
Finally, the function returns text_list, which contains the extracted text data from the channel posts.
If any error occurs during the API request, such as a KeyError or any other exception, the function catches the error, generates an appropriate error message, and returns it in a list format.
offset = None
﻿
def get_data(bot_token):
    global offset
    try:
        if offset == None:
            response = requests.get("https://api.telegram.org/bot{}/getUpdates".format(bot_token))
            response_json = json.loads(response.text)
            last_update_id = int(response_json['result'][-1]['update_id'])
            # without 'last_update_id + 1' there will be duplicate results
            offset = last_update_id + 1
        else:
            response = requests.get('https://api.telegram.org/bot{}/getUpdates?offset={}'.format(bot_token, offset))
            response_json = json.loads(response.text)
            last_update_id = int(response_json['result'][-1]['update_id'])
            # without 'last_update_id + 1' there will be duplicate results
            offset = last_update_id + 1
        text_list = [r['channel_post']['text'] for r in response_json['result']]
        return text_list
    except KeyError:
        # print('An error occurred. Possibly empty request result or your Telegram Bot Token is incorrect.')
        error_list = ['An error occurred. Possibly empty request result or your Telegram Bot Token is incorrect.']
        return error_list
    except Exception as e:
        # print('An error occurred. Possibly empty request result or your Telegram Bot Token is incorrect.') #, e
        error_list = ['An error occurred. Possibly empty request result or your Telegram Bot Token is incorrect.']
        return error_list
﻿
Step 2: Classify Disaster vs Random MessagesWe can now create our classify_message function that utilizes the get_data function to fetch messages from a Telegram bot and classify them using our fine-tuned text classifier. Here's an explanation of how the code works:
An empty list named disaster_docs is created to store messages classified as "DISASTER".
The code uses the HuggingFace library's pipeline function to create the classifier function. The text classifier is loaded with our pre-trained model "disaster_msges_classifier_v1".
A results list is initialized to store tuples containing the original message and its corresponding label.
The code iterates over the messages obtained from the get_data function, which fetches messages using the provided bot token.
If an error message is encountered during message retrieval, indicated by the comparison data == error_msg[0], the function immediately returns the error_msg list.
For each message, the code applies the classifier. The classifier function returns a list of dictionaries, and the label of the first dictionary (classification[0]['label']) is extracted as the label.
The message and its label are appended as a tuple to the results list.
If the label is determined as "DISASTER", the message is appended to the disaster_docs list.
Finally, the function returns the disaster_docs list, which contains messages classified as "DISASTER". This provides a way to filter and extract disaster-related messages from the fetched data.
If any error occurs during the API request or classification process, the error_msg list is returned, indicating a possible issue with the request or bot token.
def classify_message(bot_token):
  error_msg = ['An error occurred. Possibly empty request result or your Telegram Bot Token is incorrect.']
  disaster_docs = []
  classifier = pipeline("sentiment-analysis", model="Madhana/disaster_msges_classifier_v1")
  results = []
  for data in get_data(bot_token):
    if data == error_msg[0]:
      return error_msg
    classification = classifier(data)
    label = classification[0]['label']
    results.append((data, label))
    if label == 'DISASTER':
      disaster_docs.append(data)
  return disaster_docs
﻿
Step 3: Perform Named Entity RecognitionIn this section, we will explore the implementation of Hybrid Named Entity Recognition, combining Rule-based and Fine-tuned models, in our ReliefNer app. You might be wondering, why choose Hybrid NER?
Introducing a new entity using a fine-tuned model requires significant time. We must gather examples, annotate the data, and then retrain the model using the new data. However, by leveraging rule-based named entity recognition, we can effortlessly incorporate simple entities like phone numbers and addresses, which follow clear patterns and are well-suited for rule-based identification. 
This approach not only streamlines the entity addition process but also proves valuable during a disaster where time is of the essence.
Let's now see how the below code works:
 The @spacy.Language.component decorator registers the function disaster_ner as a custom component in the spaCy language pipeline.
Inside the disaster_ner function, a PhraseMatcher object is created and initialized with the vocabulary of the input doc.
The Tamil_words list contains Tamil words or phrases and is tokenized using the nlp.tokenizer.pipe method and converted into a list of patterns.
The PhraseMatcher object is then updated with the patterns using the matcher.add method, assigning the label "Tamil_words" to the matches.
The matcher is applied to the input doc using the matcher method, and the matches are stored in the matches variable.
A list comprehension is used to create spans from the matches. Each span represents an entity and is created using the Span class, specifying the doc, start, end, and label attributes. In this case, the label "YO!" is used for all the matched entities.
The doc.ents attribute is updated with the spans, assigning the detected entities to the document.
Finally, the modified doc is returned by the disaster_ner function.
The Tamil_words list is defined with a single element, representing the name "மதனா பாலா" in Tamil. This can be used as an example for adding new entity labels using rule-based named entity recognition (NER) in a shorter time frame.
The Spacy language model is loaded using spacy.load("en_pipeline")
The "disaster_ner" component is added to the pipeline using nlp.add_pipe, specifying the component name as "disaster_ner" and the position before the default NER component (ner).
You may have noticed that we used Tamil text as an example for the rule-based method above. But why did we choose Tamil? 
The reason is simple. Not everyone uses English as their primary communication language in a diverse world. We encounter various languages, and sometimes even hybrid languages like Thanglish, being used in different contexts. We face a challenge to address this language diversity: how can we effectively process and understand non-English text in our application?
To tackle this issue, we have two options. One approach is to employ a multilingual model like mBERT, trained on multiple languages and can handle a wide range of text inputs. This allows us to achieve language-agnostic processing and analysis.
Another approach is to use multiple monolingual models, which can potentially show better performance than multilingual models but at the cost of significant computational power and resources.
We highlight the need to accommodate different languages and language variations within our named entity recognition systems by incorporating Tamil text as an example. It emphasizes the importance of building inclusive and adaptable solutions that handle diverse linguistic contexts.
@spacy.Language.component("disaster_ner")
def disaster_ner(doc):
    matcher = PhraseMatcher(doc.vocab)
    patterns = list(nlp.tokenizer.pipe(Tamil_words))
    matcher.add("Tamil_words", None, *patterns)
    matches = matcher(doc)
    spans = [Span(doc, start, end, label="YO!") for match_id, start, end in matches]
    doc.ents = spans
    return doc
﻿
Tamil_words = ['மதனா பாலா'] # That's my name in Tamil, use rule-based ner for adding new entity labels in a short time.
﻿
nlp = spacy.load("en_pipeline")
nlp.add_pipe("disaster_ner", name="disaster_ner", before='ner')
Let's now define a function to create the address using [STREET, NEIGHBORHOOD, CITY] columns and another function to geocode the address using geopy, returning the latitude and longitude coordinates of the location if successful.
def create_address(row):
    return f"{row['STREET']}, {row['NEIGHBORHOOD']}, {row['CITY']}"
﻿
geolocator = Nominatim(user_agent="disaster-ner-app")
﻿
def geocode_address(address):
    try:
        location = geolocator.geocode(address)
        return (location.latitude, location.longitude)
    except:
        return None
Step 4: Use Gradio for User InterfaceWe then create the get_classifier_ner function, which calls the classify_message function to get the disaster messages, extracts the ["NAME", "STREET", "NEIGHBORHOOD", "CITY", "PHONE NUMBER","YO!"] and applies the create_address function to create an address column and the geocode_address function to retrieve the coordinates for each address. Finally, it returns the data frame with the processed data.
"""# With Classifier"""
﻿
def get_classifier_ner(bot_token):
  data = classify_message(bot_token)
  entity_types = ["NAME", "STREET", "NEIGHBORHOOD", "CITY", "PHONE NUMBER","YO!"]
  df = pd.DataFrame(columns=["Text"] + entity_types)
﻿
  for text in data:
    doc = nlp(text)
    row = [text]
    entities = {ent.label_: ent.text for ent in doc.ents}
    for entity_type in entity_types:
        row.append(entities.get(entity_type, "")
﻿
    num_cols = len(df.columns)
    while len(row) < num_cols:
      row.append("")
﻿
    df.loc[len(df)] = row
  
  df['Address'] = df.apply(create_address, axis=1)
  df['Coordinates'] = df['Address'].apply(geocode_address)
﻿
  return df
Finally, we use Gradio to build the user interface for our ReilefNer app:
"""# Gradio"""
﻿
def process_ner_data(your_bot_token):
    return get_ner(your_bot_token)
﻿
def process_classifier_ner_data(your_bot_token):
    return get_classifier_ner(your_bot_token)
﻿
demo = gr.Blocks()
﻿
with demo:
    gr.Markdown("Telegram Disaster Recovery Assistant")
    with gr.Tabs():
        with gr.TabItem("Structured Telegram Messages"):
            with gr.Row():
                your_bot_token = gr.Textbox(type='password', label="Enter your Bot Token")
                ner_df = gr.Dataframe(headers=["NAME", "STREET", "NEIGHBORHOOD", "CITY", "PHONE NUMBER","YO!"])
                 
            classifier_ner_button = gr.Button("Get Classifier-NER Output")
            ner_button = gr.Button("Get NER Output")
            clear = gr.Button("Clear")
    
    ner_button.click(process_ner_data,inputs=your_bot_token, outputs=ner_df)
    classifier_ner_button.click(process_classifier_ner_data,inputs=your_bot_token, outputs=ner_df)
    clear.click(lambda: None, None, ner_df, queue=True)
   
demo.queue(concurrency_count=3)
demo.launch() # share=True, debug=True
﻿
Here's a short demo of our ReliefNer app:
﻿
﻿
My Evaluation and Future Work for ReliefNerWhile ReliefNer holds promise in aiding disaster relief efforts, it currently falls short of meeting production standards. The primary reason for this limitation lies in the quality of the training data, which was generated using ChatGPT and a limited set of prompt templates. As a result, the diversity of disaster messages is limited, impacting the performance of the language models. As the saying goes, "garbage in, garbage out," once again reinforcing that data quality is crucial in any ML system.
Having said that, I do believe that this app can be of great use in difficult times and can be enhanced in the following ways:
Improve performance for production-level usage.
Add multilingual support.
Expand named entity recognition to include more relevant entities.
Optimize code for efficiency and maintainability.
Refine user interface and experience with user feedback and more.
Continuously train and improve the language models using wandb and argilla.
By addressing these areas, ReliefNer can evolve into a powerful tool for disaster relief, providing effective support during challenging times.
I hope this article has provided valuable insights into developing NER-powered applications using Hugging Face and SpaCy. Through the ReliefNer app, we've seen how NER can contribute to disaster relief by extracting and categorizing vital information from unstructured text data. If you have any questions regarding this piece, feel free to reach out. Happy Learning! 
Recommended Reads﻿Using Machine Learning to Aid Survivors and Race through Time ﻿
﻿Named Entity Recognition for Beginners: A Complete Guide﻿
﻿Why we switched from Spacy to Flair to anonymize French case law﻿
﻿NER algo benchmark: spaCy, Flair, m-BERT and camemBERT on anonymizing French commercial legal cases﻿
﻿
Add a comment
Tags: Articles, HuggingFace, NLP, spaCy, Classification, Text Generation
Iterate on AI agents and models faster. Try Weights & Biases today.