Researchers from MIT investigated the scaling behavior of large chemical language models, including generative pre-trained transformers (GPT) for chemistry and graph neural network force fields (GNNs). They introduced the concept of neural scaling, examining the impact of model and data size on pre-training loss. The study also explored hyperparameter optimization using a technique called Training…
Dynamic view synthesis is a technique used in computer vision and graphics to reconstruct dynamic 3D scenes from videos. Traditional methods have limitations in terms of rendering speed and quality. However, a new approach called 4K4D has been introduced, which utilizes a 4D point cloud representation and a hybrid appearance model to achieve faster rendering…
A team of researchers from Jiaotong University, Peking University, and Microsoft have developed a method called LeMa that improves the mathematical reasoning abilities of large language models (LLMs) by teaching them to learn from mistakes. They fine-tune the LLMs using mistake-correction data pairs generated by GPT-4. LeMa consistently improves performance across various LLMs and tasks,…
In this research, a Gaussian Mixture Model (GMM) is proposed as a reverse transition operator in the Denoising Diffusion Implicit Models (DDIM) framework. By constraining the GMM parameters to match the first and second order central moments of the forward marginals, samples of equal or better quality than the original DDIM with Gaussian kernels can…
Large Language Models (LLMs) with billions of parameters have revolutionized AI but are computationally intensive. This study supports the use of ReLU activation in LLMs as it minimally affects performance but reduces computation and weight transfer. Alternative activation functions like GELU or SiLU are popular but more computationally demanding.
This work proposes a novel architecture to detect user-defined flexible keywords in real-time. The approach involves constructing acoustic embeddings of keywords using graphene-to-phone conversion, and converting phone-to-embedding by looking up the embedding dictionary built during training. The key benefit is the incorporation of both text and audio embedding.
Behavioral testing in NLP evaluates system capabilities by analyzing input-output behavior. However, current tests for Machine Translation are limited and manually created. To overcome this, our proposal suggests using Large Language Models (LLMs) to generate diverse source sentences for testing MT model behavior in various scenarios. Verification ensures expected performance.
Preserving training dynamics across batch sizes is important for practical machine learning. One tool for achieving this is scaling the learning rate linearly with the batch size. Another tool is the use of model EMA, which creates a functional copy of a target model that gradually moves towards the parameters of the target model using…
Recently, a paper on the use of audio-visual synchronization for learning audio-visual representations was accepted at the Machine Learning for Audio Workshop at NeurIPS 2023. The paper discusses the effectiveness of unsupervised training frameworks, particularly the Masked Audio-Video Learners (MAViL) framework, which combines contrastive learning with masked autoencoding.
This text introduces a new approach to agnostically learning Single-Index Models (SIMs) with arbitrary monotone and Lipschitz activations. Unlike previous methods, it does not rely on predetermined settings or knowledge of the activation function. Additionally, it only requires the marginal to have bounded second moments, instead of stronger distributional assumptions. The algorithm is based on…
Autoregressive models for text generation often produce repetitive and low-quality output due to errors accumulating during generation. Exposure bias, the difference between training and inference, is blamed for this. Denoising diffusion models offer an alternative by allowing a model to revise its output, but they are computationally expensive and less fluent for longer text.
This text proposes an architecture capable of processing streaming audio using a vision-inspired keyword spotting framework. By extending a Conformer encoder with trainable binary gates, the approach improves detection and localization accuracy on continuous speech while maintaining a small memory footprint. The inclusion of gates also reduces the average amount of processing without affecting performance.
Researchers have created a program called DIRFA that generates realistic videos by combining audio and a face photo. The program uses artificial intelligence to create 3D videos that accurately show the person’s facial expressions and head movements.
YouTube is introducing new AI-powered features that allow users to compose music using the voices of popular artists and convert hummed melodies into songs. One feature, called “Dream Track,” allows users to generate songs in the styles of licensed artists, while another tool, “Music AI Tools,” supports musicians in their creative processes. These innovations are…
Microsoft has introduced its first custom AI chips, the Microsoft Azure Maia 100 AI Accelerator and the Microsoft Azure Cobalt 100 CPU. These chips are designed for AI and cloud computing applications and will be used in Microsoft’s data centers to power Bing AI chatbot, Copilot, and Azure OpenAI. The goal is to meet the…
Data organisations often overlook the responsibilities of data consumers in data contracts. To maximize the value of data, data contracts should outline the consumer’s obligations in analyzing and applying the data. Neglecting consumer commitments can reduce the business impact of data contracts. Consumer commitments should go beyond compliance and focus on value creation. Structured approaches,…
This text discusses the semantics of slowly changing dimension type 2 (SCD2) techniques in dimensional modeling. It covers the importance of choosing appropriate reference dates and the impact of different row-versioning methods on access patterns. Three options for reference dates are discussed: extract timestamps, source system timestamps, and business timestamps. Additionally, the format of valid_to…
Researchers at the Clinic of Radiology and Nuclear Medicine at University Hospital Basel have developed a deep learning model called TotalSegmentator that can automatically segment anatomical structures in CT images. The model has been trained on a large dataset and can accurately segment a wide range of organs with minimal user input. The researchers have…
OpenAI’s DevDay showcased innovative features, offering exciting opportunities in the field of artificial intelligence. Discover the latest advancements and explore a world of endless possibilities in our article.
Grounding Large Multimodal Model (GLaMM) is introduced as a novel model for visually grounded conversations. GLaMM allows for natural language replies combined with object segmentation masks, providing improved user engagement. The researchers also introduce the Grounded Conversation Generation (GCG) task and the Grounding-anything Dataset (GranD) to aid in model training and evaluation.