Webcam gets stuck when audio is playing during hand detection in Python

  mediapipe, opencv, python-3.x, pyttsx3, webcam

I am working on a project on Vision-Based American Sign Language Converter Application.
The first step is to perform static hand detection using MediaPipe and OpenCV. I am also using pyttx3 for audio purposes.
The problem is, when the audio is playing, the webcam gets struck and since there is a while loop running, the webcam never gets smooth during the hand detection.
I searched on different platforms and came across the idea of multiple threading. But I don’t know how to implement that idea for my code. I am also confused about the placement of that ‘audio playing’ block that whether it should come inside the while loop of webcam or not!

Note: I have only uploaded a selected piece of code since the complete code uses an external tflite file (trained model for hand gestures) which I think I am not allowed to upload due to Stackoverflow policy.

import mediapipe as mp
import cv2

mp_draw =
mp_hand =

video = cv2.VideoCapture(0)

with mp_hand.Hands(max_num_hands = 1,
                min_detection_confidence = 0.7,
                min_tracking_confidence = 0.5) as hands:

while True:
    ret, image =
    image = cv2.flip(image, 1)
    image.flags.writeable = False
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = hands.process(image)
    image.flags.writeable = True
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

    letterName = ''

    if results.multi_hand_landmarks:
        landmarks = []
        for hand_landmark in results.multi_hand_landmarks:
            for lm in hand_landmark.landmark:
            mp_draw.draw_landmarks(image, hand_landmark, mp_hand.HAND_CONNECTIONS)

            ### Code for Extracting the hand cordinates, Modifying it, and giving it to tflite model as input will come here ###

            letterID = np.argmax(output_data)      ### Getting the index of most probable gesture 
            letterName = letterNames[letterID]

            ### letterNames is a list of strings corresponding to different gestures 
            ### letterName is the output gesture text that will be displayed on webcam screen

            ### AUDIO PLAYING PART ###
            engine = pyttsx3.init()
            engine.setProperty('rate', 125)
            engine.say(letterName)    # Speaks the text corresponding to the gesture

    cv2.putText(image, letterName, (10, 50), cv2.FONT_HERSHEY_SIMPLEX,
             1, (0,0,255), 2, cv2.LINE_AA)
    cv2.imshow('Frame', image)
    k = cv2.waitKey(1)
    if k == ord('q'):


Source: Python-3x Questions