Pre-trained mannequin representations have demonstrated state-of-the-art efficiency in speech recognition, pure language processing, and different purposes. Speech fashions, equivalent to Bidirectional Encoder Representations from Transformers (BERT) and Hidden items BERT (HuBERT), have enabled producing lexical and acoustic representations to profit speech recognition purposes. We investigated the usage of pre-trained mannequin representations for estimating dimensional feelings, equivalent to activation, valence, and dominance, from speech. We noticed that whereas valence might rely closely on lexical representations, activation and dominance rely totally on acoustic data. On this work, we used multi-modal fusion representations from pre-trained fashions to generate state-of-the-art speech emotion estimation, and we confirmed a 100% and 30% relative enchancment in concordance correlation coefficient (CCC) on valence estimation in comparison with commonplace acoustic and lexical baselines. Lastly, we investigated the robustness of pre-trained mannequin representations towards noise and reverberation degradation and observed that lexical and acoustic representations are impacted in another way. We found that lexical representations are extra strong to distortions in comparison with acoustic representations, and demonstrated that information distillation from a multi-modal mannequin helps to enhance the noise-robustness of acoustic-based fashions