I am working on a project on Vision-Based American Sign Language Converter Application.
The first step is to perform static hand detection using MediaPipe and OpenCV. I am also using pyttx3 for audio purposes.
The problem is, when the audio is playing, the webcam gets struck and since there is a while loop running, the webcam never gets smooth during the hand detection.
I searched on different platforms and came across the idea of multiple threading. But I don’t know how to implement that idea for my code. I am also confused about the placement of that ‘audio playing’ block that whether it should come inside the while loop of webcam or not!
Note: I have only uploaded a selected piece of code since the complete code uses an external tflite file (trained model for hand gestures) which I think I am not allowed to upload due to Stackoverflow policy.
import mediapipe as mp import cv2 mp_draw = mp.solutions.drawing_utils mp_hand = mp.solutions.hands video = cv2.VideoCapture(0) with mp_hand.Hands(max_num_hands = 1, min_detection_confidence = 0.7, min_tracking_confidence = 0.5) as hands: while True: ret, image = video.read() image = cv2.flip(image, 1) image.flags.writeable = False image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = hands.process(image) image.flags.writeable = True image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR) letterName = '' if results.multi_hand_landmarks: landmarks =  for hand_landmark in results.multi_hand_landmarks: for lm in hand_landmark.landmark: landmarks.append(lm.x) landmarks.append(lm.y) mp_draw.draw_landmarks(image, hand_landmark, mp_hand.HAND_CONNECTIONS) ### Code for Extracting the hand cordinates, Modifying it, and giving it to tflite model as input will come here ### letterID = np.argmax(output_data) ### Getting the index of most probable gesture letterName = letterNames[letterID] ### letterNames is a list of strings corresponding to different gestures ### letterName is the output gesture text that will be displayed on webcam screen ### AUDIO PLAYING PART ### engine = pyttsx3.init() engine.setProperty('rate', 125) engine.say(letterName) # Speaks the text corresponding to the gesture engine.runAndWait() engine.stop() cv2.putText(image, letterName, (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,0,255), 2, cv2.LINE_AA) cv2.imshow('Frame', image) k = cv2.waitKey(1) if k == ord('q'): break video.release() cv2.destroyAllWindows()
Source: Python-3x Questions