Artificial Intelligence
Runway’s Gen-2 is a groundbreaking video editing tool that simplifies the video generation process. It introduces the Motion Brush function, which allows users to manipulate the movement of generated content using simple hand gestures. This eliminates the need for complex text inputs and extensive editing, making video creation more intuitive and accessible. Gen-2 faithfully restores…
Project Open Se Cura is an open-source framework introduced by Google to enhance the development of secure and efficient AI systems. It aims to bridge the gap between hardware breakthroughs and advances in machine learning models and software development. The collaborative effort with partners like VeriSilicon, Antmicro, and lowRISC focuses on creating open-source design tools…
NetEase Youdao has released an open-source text-to-speech (TTS) engine called “Yi Mo Sheng.” It offers web and script interfaces, allowing for batch result generation, making it suitable for applications requiring emotional synthesis of voices. The engine supports over 2,000 timbres, Chinese and English languages, and includes a unique emotion synthesis feature. Another competitor in the…
A recent research paper presents a deep learning-based classifier for age-related macular degeneration (AMD) stages using retinal optical coherence tomography (OCT) scans. The model accurately classifies macula-centered 3D volumes into Normal, early/intermediate AMD (iAMD), atrophic (GA), and neovascular (nAMD) stages. The study highlights the significance of accurate AMD staging for timely treatment initiation and emphasizes…
Researchers from MIT investigated the scaling behavior of large chemical language models, including generative pre-trained transformers (GPT) for chemistry and graph neural network force fields (GNNs). They introduced the concept of neural scaling, examining the impact of model and data size on pre-training loss. The study also explored hyperparameter optimization using a technique called Training…
Dynamic view synthesis is a technique used in computer vision and graphics to reconstruct dynamic 3D scenes from videos. Traditional methods have limitations in terms of rendering speed and quality. However, a new approach called 4K4D has been introduced, which utilizes a 4D point cloud representation and a hybrid appearance model to achieve faster rendering…
A team of researchers from Jiaotong University, Peking University, and Microsoft have developed a method called LeMa that improves the mathematical reasoning abilities of large language models (LLMs) by teaching them to learn from mistakes. They fine-tune the LLMs using mistake-correction data pairs generated by GPT-4. LeMa consistently improves performance across various LLMs and tasks,…
In this research, a Gaussian Mixture Model (GMM) is proposed as a reverse transition operator in the Denoising Diffusion Implicit Models (DDIM) framework. By constraining the GMM parameters to match the first and second order central moments of the forward marginals, samples of equal or better quality than the original DDIM with Gaussian kernels can…
Large Language Models (LLMs) with billions of parameters have revolutionized AI but are computationally intensive. This study supports the use of ReLU activation in LLMs as it minimally affects performance but reduces computation and weight transfer. Alternative activation functions like GELU or SiLU are popular but more computationally demanding.
This work proposes a novel architecture to detect user-defined flexible keywords in real-time. The approach involves constructing acoustic embeddings of keywords using graphene-to-phone conversion, and converting phone-to-embedding by looking up the embedding dictionary built during training. The key benefit is the incorporation of both text and audio embedding.
Behavioral testing in NLP evaluates system capabilities by analyzing input-output behavior. However, current tests for Machine Translation are limited and manually created. To overcome this, our proposal suggests using Large Language Models (LLMs) to generate diverse source sentences for testing MT model behavior in various scenarios. Verification ensures expected performance.
Preserving training dynamics across batch sizes is important for practical machine learning. One tool for achieving this is scaling the learning rate linearly with the batch size. Another tool is the use of model EMA, which creates a functional copy of a target model that gradually moves towards the parameters of the target model using…
Recently, a paper on the use of audio-visual synchronization for learning audio-visual representations was accepted at the Machine Learning for Audio Workshop at NeurIPS 2023. The paper discusses the effectiveness of unsupervised training frameworks, particularly the Masked Audio-Video Learners (MAViL) framework, which combines contrastive learning with masked autoencoding.
This text introduces a new approach to agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. Unlike previous methods, it does not rely on predetermined settings or knowledge of the activation function. Additionally, it only requires the marginal to have bounded second moments, instead of stronger distributional assumptions. The algorithm is based on…
Autoregressive models for text generation often produce repetitive and low-quality output due to errors accumulating during generation. Exposure bias, the difference between training and inference, is blamed for this. Denoising diffusion models offer an alternative by allowing a model to revise its output, but they are computationally expensive and less fluent for longer text.
This text proposes an architecture capable of processing streaming audio using a vision-inspired keyword spotting framework. By extending a Conformer encoder with trainable binary gates, the approach improves detection and localization accuracy on continuous speech while maintaining a small memory footprint. The inclusion of gates also reduces the average amount of processing without affecting performance.
Researchers have created a program called DIRFA that generates realistic videos by combining audio and a face photo. The program uses artificial intelligence to create 3D videos that accurately show the person’s facial expressions and head movements.
YouTube is introducing new AI-powered features that allow users to compose music using the voices of popular artists and convert hummed melodies into songs. One feature, called “Dream Track,” allows users to generate songs in the styles of licensed artists, while another tool, “Music AI Tools,” supports musicians in their creative processes. These innovations are…
Microsoft has introduced its first custom AI chips, the Microsoft Azure Maia 100 AI Accelerator and the Microsoft Azure Cobalt 100 CPU. These chips are designed for AI and cloud computing applications and will be used in Microsoft’s data centers to power Bing AI chatbot, Copilot, and Azure OpenAI. The goal is to meet the…
Data organisations often overlook the responsibilities of data consumers in data contracts. To maximize the value of data, data contracts should outline the consumer’s obligations in analyzing and applying the data. Neglecting consumer commitments can reduce the business impact of data contracts. Consumer commitments should go beyond compliance and focus on value creation. Structured approaches,…