Voice assistants assist customers make cellphone calls, ship messages, create occasions, navigate, and do much more. Nevertheless, assistants have restricted capability to grasp their customers’ context. On this work, we purpose to take a step on this route. Our work dives into a brand new expertise for customers to seek advice from cellphone numbers, addresses, electronic mail addresses, URLs, and dates on their cellphone screens. Our focus lies in reference understanding, which turns into notably attention-grabbing when a number of related texts are current on display, just like visible grounding. We gather a dataset and suggest a light-weight general-purpose mannequin for this novel expertise. Because of the excessive value of consuming pixels immediately, our system is designed to depend on the extracted textual content from the UI. Our mannequin is modular, thus providing flexibility, improved interpretability, and environment friendly runtime reminiscence utilization.