MLabs AI is an independent AI Research Lab with a focus on state of the art LLM research.
We operate as a high-throughput research engine. We continuously generate and test unconventional training approaches across multiple domains, failing fast and often, and refining those that expose unnecessary rigidity in current methods.
Our advantage is not a single method, but the rate at which we can explore the design space.
Through our process we have developed several new algorithms and mathematical constructions, including cases where parts of MLP training can be performed without gradient descent while improving accuracy and efficiency in specific regimes.
Our Focus & Expertise
Our research and development efforts are concentrated on key areas that drive the efficiency and capability of large language models.
Linguistic Factorization
Explicitly factoring the machine learning requirements for syntax, semantics, pragmatics and discourse.
One-Shot and Closed Form Training
Developing advanced methods for minimizing the loss function for the model or specific layers without the need for gradient descent.
Gradient Descent Optimization
Maximizing learning effectiveness through novel techniques for gradient signal and condition number optimization.
Architectural Efficiency
Innovating architectures to reduce actual or effective parameter counts while retaining or encouraging modelling capability.
Increased Prediction Accuracy
Designing more effective methods for token prediction by optimizing over an extended prediction horizon.
Dynamic Context Compression
Extending the effective context window while maximizing the information density and reducing computational load.
Past Innovations
One-Shot and Closed Form Training
Classifiers
One-shot construction of MLP classifier layers for continuous and discrete input data
Material efficiency gains (orders of magnitude) in constrained settings with improved accuracy
We intend to test it to construct the MLP blocks of LLMs (fact retention)
Function Approximators
One-shot construction of MLP function approximator layers
Material efficiency gains (one order of magnitude) in constrained settings with improved accuracy
We intend to test it to construct the MLP blocks of LLMs (fact retention)
Feature Encoders
Tried to construct the MLP layers of feature encoders using techniques we learned from building the classifier layers
Didn't work because of problems with stability of the clustering algorithm
Moved on to a more foundational mathematical solution for Auto Encoders
Auto Encoders
One-shot construction of MLP Auto Encoders
Does not work well on small problems, because of dimensionality scaling
We intend to test it to construct projection subspaces in the attention blocks of LLMs, as sparse layers for built-in explainability and post-training interpretability
Neural Monte-Carlo Tree Search
Differentiable Monte-Carlo Tree Search
Fully trainable tree search
Embedded into the neural network
Works very well on self-contained search problems; beats AlphaGo Zero at playing Go
We intend to test it for optimizing prediction sequences in LLMs
Large Language Models
Attention Blocks
Variations of tree attention and recursive attention blocks
Preliminary empirical signals suggest solid performance improvements in constrained settings
We intend to evaluate more radical variations on large scale language modelling tasks
State Space Models
Incremental variations of Mamba have been tried
Early results suggest that further exploration is warranted
Low-Rank Latent Space
Exploiting Johnson-Lindenstrauss lemma to develop compute-efficient block architectures
Able to revert back to full-rank mid-training
Works on some smaller LLMs and we intend to test it on larger LLMs
Low-Rank Approximation by SVD
Able to speed up inference as well as training
Can switch from full-rank to low-rank (and back) in the middle of training
We intend to test it on LLMs
Factorizing Semantic Graphs
Novel approach to scale-efficient factorization of very large semantic graphs
Orders of magnitude faster than standard factorization techniques
Currently testing LLM-scale semantic graph factorization
Embedding
Construction of vector embeddings from structured linguistic knowledge
Demonstrates explicit subspaces, and improves explainability
Didn't work as well as expected due to the curse of dimensionality
Repurposed to make the token embeddings into a compute-efficient DAG
Retrieval Augmented Generation
Store internal state of LLM instead of text/embeddings to speed up inference
Unable to improve efficiency over standard techniques due to the inability to isolate compact state information
Further developing approach to apply to linguistic factorization techniques and context compression
Tokenization
We tried incremental variations around unigram and morpheme-boundary aware tokenization
Unable to achieve any real improvements at scale
Rapid Development Language
Rapid development language for quick and efficiently experiments with different LLM innovations
Image Processing Models
CNNs
Discovery of new mathematical underpinnings for CNN kernels
Developed closed-form image tokenization algorithm
Worked well enough on a number of small problems
Unable to translate the theoretical innovations into experimental improvements at scale
Current and Future Innovations
Our innovation pipeline has 50+ research directions to explore. We’re continuously reassessing which directions are worth pursuing, but below are some of our directions.
Linguistic Factorization
Retained Fact Learning
Dense Layer Fact Editing
Fact updating with a second dense layer
Fact updating through MLP fine-tuning
Standalone Fact-Training
Using gradient descent
Using closed-form solutions
Hybrid Semantic Graph
Edge propagation fact retention
Context sensitive
Lookup Table Dense Layers
Associative fact lookup tables
Context free
Skip-Gram Conditional Memory
N-gram and skip-gram fact gating
Could deter hallucinations
Mixture of Memories
MOE architecture for dense layers
Gates the most relevant expert for the discourse
Syntax, Semantics and Pragmatics
Fact-Free Attention Models
Pure syntax and semantic models
Facts could be added separately
Semantic Priming Models
Priming with decaying priors
Simple and very compact long range context
N-Gram Syntax Models
Captures short-range phrasing
Could bypass expensive computation when possible
Context Sensitive Grammars
Captures more nuanced phrasing
Could leverage semantic priming
Constrained Linguistic Embedding
Relevance-constrained Embedding
Scales embedding vector magnitude by specificity
Provides context compression scores
Subspace-constrained Embedding
Shares embedding subspaces between tokens
Could promote rapid learning of relevant subspaces
Suffix Graph Embedding
Treats the token set as a directed acyclic graph
Shares embedding optimization along edges
Semantic Graph Embedding
Treats the concept space as an embedded graph
Would enable fast inclusion of new concepts
Linguistic Tokenization
Byte-Tuple Encoding Tokenization
Extends BPE to tuples
Maximal corpus compression
Information rich
Morphological Tokenization
Bootstraps tuples using morphemes
Promotes linguistically meaningful tokenization
Phrase-level Tokenization
Word-level N-grams treated as single segments
Moves common phrasing from attention blocks to tokenization
Gradient Descent Optimization
Corpus Boosting
Synonym Boosting
Strengthens gradient signal by introducing phantom targets
Better quality gradient information from smaller training corpus
Inflection Boosting
Gradient sharing along known linguistic subspaces
One backpropagation generates multiple gradients
Subspaces could be learned more effectively
Embedding Boosting
Strengthens gradient signal by treating the target as a sample from a distribution in embedding space
Better quality gradient information from smaller training corpus
Topographic Boosting
Connects embeddings which are adjacent along identified subspaces
Gradient is shared along edges
Condition Number Management
Parametric Nonlinearities
Parameterizes the degree of nonlinearity in activation functions
Matches linearity to stage of learning
Localization of Hyperplanes
Reduces sensitivity of hyperplanes by limiting the range of their effect
Matches range of effect to stage of learning
Layer-wise Loss Functions
Introduces additional components of the loss function to partially linearize the search space
Bilinear Activation Functions
Reduces model complexity while retaining nonlinearity
Could offer improvement in training efficiency
Training Corpus Linearization
Similar to curriculum learning
Schedules training data so that linear relationships are learned first
Architectural Efficiency
Low Rank Approximations and Parameter Sharing
Random Projection Parameter Sharing
Projects dense layers into low rank subspaces and back out again