The adoption of multimodal interactions by Voice Assistants (VAs) is rising quickly to boost human-computer interactions. Smartwatches have now integrated trigger-less strategies of invoking VAs, similar to Elevate To Communicate (RTS), the place the person raises their watch and speaks to VAs with out an express set off. Present state-of-the-art RTS programs depend on heuristics and engineered Finite State Machines to fuse gesture and audio knowledge for multimodal decision-making. Nevertheless, these strategies have limitations, together with restricted adaptability, scalability, and induced human biases. On this work, we suggest a neural community based mostly audio-gesture multimodal fusion system that (1) Higher understands temporal correlation between audio and gesture knowledge, main to express invocations (2) Generalizes to a variety of environments and situations (3) Is light-weight and deployable on low-power gadgets, similar to smartwatches, with fast launch occasions (4) Improves productiveness in asset growth processes.