🚀 Getting Started¶

Usage Demo on Colab(v0.9.12)¶

Refer to documentation below for updated instructions and guides.

Prerequisites¶

Sign up to HuggingFace and get your token keys use this guide.
Sign up to Weights and Biases and get your token keys use this guide

Step 1: Installation¶

!pip install --upgrade pip
!pip install africanwhisper[training]    # If you want to train and test the model on a notebook

# !pip install africanwhisper[all]      # If you want to train and deploy an endpoint.

# !pip install africanwhisper[deployment]      # If you want to deploy an endpoint.

# If you're on Colab, restart the session due to issue with numpy installation on colab.

Step 2: Set Parameters¶

# Set the parameters (refer to the 'Usage on VM' section for more details)
huggingface_token = " "  # make sure token has write permissions
dataset_name = "mozilla-foundation/common_voice_16_1" # Also supports "google/fleurs" and "facebook/multilingual_librispeech".
                                                      # For custom datasets, ensure the text key is one of the following: "sentence", "transcript", or "transcription".
language_abbr= " "                                    # Example `"af"`. see specific dataset for language code.
model_id= "model-id"                                  # Example openai/whisper-small, openai/whisper-medium
processing_task= "translate"                          # translate or transcribe
wandb_api_key = " "     
use_peft = True                                       # Note: PEFT only works on a notebook with GPU-support.

Step 3: Prepare the Model¶

from training.data_prep import DataPrep

# Initialize the DataPrep class and prepare the model
process = DataPrep(
    huggingface_token,
    dataset_name,
    language_abbr,
    model_id,
    processing_task,
    use_peft
)
tokenizer, feature_extractor, feature_processor, model = process.prepare_model()

Step 4: Preprocess the Dataset¶

# Load and preprocess the dataset
processed_dataset = process.load_dataset(
    feature_extractor=feature_extractor,
    tokenizer=tokenizer,
    processor=feature_processor,
    streaming=True,
    train_num_samples = None,     # Optional: int - Number of samples to load into training dataset, default the whole training set.
    test_num_samples = None )     # Optional: int - Number of samples to load into test dataset, default the whole test set.
                                  # Set None to load the entire dataset
                                  # If dataset is more than on, train_num_samples/test_num_samples will apply to all e.g `language_abbr= ["af", "ti"]` will return 100 samples each.

Step 5: Train the Model¶

from training.model_trainer import Trainer

# Initialize the Trainer class and train the model
trainer = Trainer(
    huggingface_token = huggingface_token,
    model_id = model_id,
    dataset =processed_dataset,
    model= model,
    feature_processor= feature_processor,
    feature_extractor= feature_extractor,
    tokenizer= tokenizer,
    wandb_api_key= wandb_api_key,
    use_peft=use_peft,
    processing_task=processing_task,
    language = language_abbr
)
trainer.train(
    warmup_steps=10,
    max_steps=500,
    learning_rate=0.0001,
    lr_scheduler_type="constant_with_warmup",
    per_device_train_batch_size=32,              # Adjust based on available RAM; increase if more RAM is available
    per_device_eval_batch_size=32,               # Adjust based on available RAM; increase if more RAM is available
    optim="adamw_bnb_8bit",
    save_steps=100,
    logging_steps=100,
    eval_steps=100,
    gradient_checkpointing=True,
)

# Optional parameters for training:
#     max_steps (int): The maximum number of training steps (default is 100).
#     learning_rate (float): The learning rate for training (default is 1e-5).
#     per_device_train_batch_size (int): The batch size per GPU for training (default is 8).
#     per_device_eval_batch_size (int): The batch size per GPU for evaluation (default is 8).
#     optim (str): The optimizer used for training (default is "adamw_bnb_8bit")
# See more configurable parameters https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments

Step 6: Merge LoRA weights(if PEFT was used)¶

from training.merge_lora import Merger

# Merge PEFT fine-tuned model weights with the base model weights
Merger.merge_lora_weights(hf_model_id="your-finetuned-model-name-on-huggingface-hub", huggingface_token = " ")

Step 7: Test Model using an Audio File¶

from deployment.speech_inference import SpeechTranscriptionPipeline, ModelOptimization

model_name = "your-finetuned-model-name-on-huggingface-hub"   # e.g., "KevinKibe/whisper-small-af"
huggingface_token = " "
task = "desired-task"                                         # either 'translate' or 'transcribe'
audiofile_dir = "location-of-audio-file"                      # filetype should be .mp3 or .wav

# Optimize model for better results
model_optimizer = ModelOptimization(model_name=model_name)
model_optimizer.convert_model_to_optimized_format()
model = model_optimizer.load_transcription_model()  # For fine-tuning v3 or v3-turbo models or a fine-tuned version of them, specify is_v3_architecture=True
                                                    # Example:
                                                    # model = model_optimizer.load_transcription_model(is_v3_architecture=True)

                                                    # Optional language parameter, else model will automatically detect language.
                                                    # Example:
                                                    # model = model_optimizer.load_transcription_model(language='en')

# Initiate the transcription model
inference = SpeechTranscriptionPipeline(
        audio_file_path=audiofile_dir,
        task=task,
        huggingface_token=huggingface_token,
        chunk_size=10 )                     # Duration of each audio chunk; shorter chunks improve accuracy but increase processing time.
                                            # Optional parameter language: The language of the audio for transcription/translation.
                                            # For fine-tuning v3 or v3-turbo models or a fine-tuned version of them specify is_v3_architecture=True


# To get transcriptions
transcription = inference.transcribe_audio(model=model)
print(transcription)

# To get transcriptions with speaker labels
alignment_result = inference.align_transcription(transcription) # Optional parameter alignment_model: if the default wav2vec alignment model is not available e.g thinkKenya/wav2vec2-large-xls-r-300m-sw
diarization_result = inference.diarize_audio(alignment_result)

#To generate subtitles(.srt format), will be saved in root directory
inference.generate_subtitles(transcription, alignment_result, diarization_result)

🖥️ Using the CLI¶

Step 1: Clone and Install Dependencies¶

Clone the Repository: Clone or download the application code to your local machine.
```
git clone https://github.com/KevKibe/African-Whisper.git
```
Create a virtual environment for the project and activate it.
```
python3 -m venv env
source venv/bin/activate
```
Install dependencies by running this command
```
pip install -r requirements.txt
```
Navigate to:
```
cd src
```

Step 2: Finetune the Model¶

To start the training , use the following command:

python -m training.main \
    --huggingface_token YOUR_HUGGING_FACE_WRITE_TOKEN_HERE \
    --dataset_name AUDIO_DATASET_NAME \
    --train_num_samples SAMPLE_SIZE \
    --test_num_samples SAMPLE_SIZE \
    --language_abbr LANGUAGE_ABBREVIATION \
    --model_id MODEL_ID \
    --processing_task PROCESSING_TASK \
    --wandb_api_key YOUR_WANDB_API_KEY_HERE \
    --use_peft \
    --max_steps NUMBER_OF_TRAINING_STEPS \
    --train_batch_size TRAINING_BATCH_SIZE \
    --eval_batch_size EVALUATION_BATCH_SIZE \
    --save_eval_logging_steps SAVE_EVAL_AND_LOGGING_STEPS \

Run python -m training.main --help to see the flag descriptions.
Find a description of these commands here.

Step 3: Merge the Model Weights(if PEFT Finetuned)¶

python -m training.merge_lora --hf_model_id MODEL-ID-ON-HF --huggingface_write_token HF-WRITE_TOKEN

Step 4: Get Inference¶

Install ffmpeg¶

To get inference from your fine-tuned model, follow these steps:
Ensure that ffmpeg is installed by running the following commands:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

To get inference on CLI Locally¶

cd src/deployment

- Create a .env file using nano .env command and add these keys and save the file.

MODEL_NAME = "your-finetuned-model"
HUGGINGFACE_TOKEN = "huggingface-token"

To perform transcriptions and translations:

python -m deployment.speech_inference_cli --audio_file FILENAME --task TASK --perform_diarization --perform_alignment

- Run python -m training.main --help to see the flag descriptions.