Recognizing user-defined versatile key phrase in real-time is difficult as a result of
the key phrase is represented in textual content. On this work, we suggest a novel structure
to effectively detect the versatile key phrases based mostly on the next concepts. We contsruct the consultant acousting embeding of a key phrase utilizing graphene-to-phone conversion. The phone-to-embedding conversion is completed by wanting up the embedding dictionary which is constructed by averaging the corresponding embeddings (from audio encoder) of every telephone in the course of the coaching. The important thing advantage of our strategy is that each textual content embedding and audio embedding are in the identical house; therefore its comparability is semantically extra correct than the case the place impartial textual content encoder is employed. Subsequently, we undertake the closest neighbor search within the embedding house to search out out the most definitely key phrase from the user-defined versatile key phrase checklist.