• ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

    Large Language Models (LLMs) with billions of parameters have revolutionized AI but are computationally intensive. This study supports the use of ReLU activation in LLMs as it minimally affects performance but reduces computation and weight transfer. Alternative activation functions like GELU or SiLU are popular but more computationally demanding.

  • Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

    This work proposes a novel architecture to detect user-defined flexible keywords in real-time. The approach involves constructing acoustic embeddings of keywords using graphene-to-phone conversion, and converting phone-to-embedding by looking up the embedding dictionary built during training. The key benefit is the incorporation of both text and audio embedding.

  • Automating Behavioral Testing in Machine Translation

    Behavioral testing in NLP evaluates system capabilities by analyzing input-output behavior. However, current tests for Machine Translation are limited and manually created. To overcome this, our proposal suggests using Large Language Models (LLMs) to generate diverse source sentences for testing MT model behavior in various scenarios. Verification ensures expected performance.

  • How to Scale Your EMA

    Preserving training dynamics across batch sizes is important for practical machine learning. One tool for achieving this is scaling the learning rate linearly with the batch size. Another tool is the use of model EMA, which creates a functional copy of a target model that gradually moves towards the parameters of the target model using…

  • Diffusion Models as Masked Audio-Video Learners

    Recently, a paper on the use of audio-visual synchronization for learning audio-visual representations was accepted at the Machine Learning for Audio Workshop at NeurIPS 2023. The paper discusses the effectiveness of unsupervised training frameworks, particularly the Masked Audio-Video Learners (MAViL) framework, which combines contrastive learning with masked autoencoding.

  • Agnostically Learning Single-Index Models using Omnipredictors

    This text introduces a new approach to agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. Unlike previous methods, it does not rely on predetermined settings or knowledge of the activation function. Additionally, it only requires the marginal to have bounded second moments, instead of stronger distributional assumptions. The algorithm is based on…

  • PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model

    Autoregressive models for text generation often produce repetitive and low-quality output due to errors accumulating during generation. Exposure bias, the difference between training and inference, is blamed for this. Denoising diffusion models offer an alternative by allowing a model to revise its output, but they are computationally expensive and less fluent for longer text.

  • Improving Vision-inspired Keyword Spotting Using a Streaming Conformer Encoder With Input-dependent Dynamic Depth

    This text proposes an architecture capable of processing streaming audio using a vision-inspired keyword spotting framework. By extending a Conformer encoder with trainable binary gates, the approach improves detection and localization accuracy on continuous speech while maintaining a small memory footprint. The inclusion of gates also reduces the average amount of processing without affecting performance.

  • Realistic talking faces created from only an audio clip and a person’s photo

    Researchers have created a program called DIRFA that generates realistic videos by combining audio and a face photo. The program uses artificial intelligence to create 3D videos that accurately show the person’s facial expressions and head movements.

  • YouTube continues foray into AI with upcoming creative tools

    YouTube is introducing new AI-powered features that allow users to compose music using the voices of popular artists and convert hummed melodies into songs. One feature, called “Dream Track,” allows users to generate songs in the styles of licensed artists, while another tool, “Music AI Tools,” supports musicians in their creative processes. These innovations are…