Real-Time Sign Language Translation System

Translates sign language gestures captured via flex sensors into speech in real time — achieving 25ms inference latency and 120ms end-to-end response.

Github

Real-Time Sign Language Translation System

This project, named "Akatsuki," is a vihaan9.0 By IEEEDTU Hackathon project.

Akatsuki Logo

Motivation

Over 70 million people worldwide use sign language as their primary means of communication. Yet the gap between sign language users and those who don't understand it remains wide — most real-time translation systems are either too slow, too expensive, or too dependent on cameras and controlled lighting.

This project explores a different approach: flex sensors on a glove, capturing finger bend angles directly, combined with an on-device randomForestclassifier model and a streaming inference pipeline that translates gestures into speech in real time.

System Architecture

The pipeline flows in one direction with minimal latency at each stage:

Flex Sensors (Arduino)
       ↓
  Serial Stream
       ↓
 Sliding Window Buffer
       ↓
randomForestclassifier Model
       ↓
 Smoothing + Confidence Filter
       ↓
 Gesture → Sentence Mapping
       ↓
   TTS Output + Streamlit UI

Hardware Layer — Arduino + Flex Sensors

Five flex sensors are mounted on a glove — one per finger. As the hand forms a gesture, each sensor changes resistance proportionally to bend angle. The Arduino reads these analog values at high frequency and streams them over serial to the host machine.

Key challenge: Raw sensor data is noisy. Small vibrations, finger tremors, and cable flex introduce jitter that — left unfiltered — causes spurious predictions.

Signal Processing — Sliding Window

Instead of classifying individual sensor readings, the system buffers the last N timesteps into a sliding window. This captures the temporal shape of a gesture — the motion arc matters as much as the final position.

Window parameters were tuned empirically to balance responsiveness and noise rejection.

Model — randomForestclassifier

A multi-layer randomForestclassifier is trained on windowed sensor sequences for each gesture class. randomForestclassifiers are well-suited here because:

They model temporal dependencies across the gesture arc
They handle variable-speed executions of the same gesture
They generalize across slight sensor drift

Training data was collected across multiple sessions to capture natural variation in gesture speed and hand size.

Post-Processing — Smoothing & Confidence Thresholds

Raw randomForestclassifier outputs are passed through a smoothing layer (running average over recent predictions) and a confidence threshold filter. Only predictions above a set confidence score trigger output — this eliminates low-confidence flickering between gesture classes.

Output — TTS + Streamlit UI

Confirmed gesture predictions are mapped to words or phrases and passed to a text-to-speech engine. A Streamlit dashboard displays the live prediction stream, confidence scores, and assembled sentences.

A FastAPI backend exposes real-time prediction endpoints — making the system extensible to mobile clients or web interfaces.

Real-Time Translation Display