MLabs AI
MLabs AI is an independent AI Research Lab with a focus on state of the art LLM research.
We operate as a high-throughput research engine. We continuously generate and test unconventional training approaches across multiple domains, failing fast and often, and refining those that expose unnecessary rigidity in current methods.
Our advantage is not a single method, but the rate at which we can explore the design space.
Through our process we have developed several new algorithms and mathematical constructions, including cases where parts of MLP training can be performed without gradient descent while improving accuracy and efficiency in specific regimes.
Our Focus & Expertise
Our research and development efforts are concentrated on key areas that drive the efficiency and capability of large language models.
Linguistic Factorization
Explicitly factoring the machine learning requirements for syntax, semantics, pragmatics and discourse.
One-Shot and Closed Form Training
Developing advanced methods for minimizing the loss function for the model or specific layers without the need for gradient descent.
Gradient Descent Optimization
Maximizing learning effectiveness through novel techniques for gradient signal and condition number optimization.
Architectural Efficiency
Innovating architectures to reduce actual or effective parameter counts while retaining or encouraging modelling capability.
Increased Prediction Accuracy
Designing more effective methods for token prediction by optimizing over an extended prediction horizon.
Dynamic Context Compression
Extending the effective context window while maximizing the information density and reducing computational load.
Past Innovations
One-Shot and Closed Form Training
Classifiers
  • One-shot construction of MLP classifier layers for continuous and discrete input data
  • Material efficiency gains (orders of magnitude) in constrained settings with improved accuracy
  • We intend to test it to construct the MLP blocks of LLMs (fact retention)
Function Approximators
  • One-shot construction of MLP function approximator layers
  • Material efficiency gains (one order of magnitude) in constrained settings with improved accuracy
  • We intend to test it to construct the MLP blocks of LLMs (fact retention)
Feature Encoders
  • Tried to construct the MLP layers of feature encoders using techniques we learned from building the classifier layers
  • Didn't work because of problems with stability of the clustering algorithm
  • Moved on to a more foundational mathematical solution for Auto Encoders
Auto Encoders
  • One-shot construction of MLP Auto Encoders
  • Does not work well on small problems, because of dimensionality scaling
  • We intend to test it to construct projection subspaces in the attention blocks of LLMs, as sparse layers for built-in explainability and post-training interpretability
Neural Monte-Carlo Tree Search
Differentiable Monte-Carlo Tree Search
  • Fully trainable tree search
  • Embedded into the neural network
  • Works very well on self-contained search problems; beats AlphaGo Zero at playing Go
  • We intend to test it for optimizing prediction sequences in LLMs
Large Language Models
Attention Blocks
  • Variations of tree attention and recursive attention blocks
  • Preliminary empirical signals suggest solid performance improvements in constrained settings
  • We intend to evaluate more radical variations on large scale language modelling tasks
State Space Models
  • Incremental variations of Mamba have been tried
  • Early results suggest that further exploration is warranted
Low-Rank Latent Space
  • Exploiting Johnson-Lindenstrauss lemma to develop compute-efficient block architectures
  • Able to revert back to full-rank mid-training
  • Works on some smaller LLMs and we intend to test it on larger LLMs
Low-Rank Approximation by SVD
  • Able to speed up inference as well as training
  • Can switch from full-rank to low-rank (and back) in the middle of training
  • We intend to test it on LLMs
Factorizing Semantic Graphs
  • Novel approach to scale-efficient factorization of very large semantic graphs
  • Orders of magnitude faster than standard factorization techniques
  • Currently testing LLM-scale semantic graph factorization
Embedding
  • Construction of vector embeddings from structured linguistic knowledge
  • Demonstrates explicit subspaces, and improves explainability
  • Didn't work as well as expected due to the curse of dimensionality
  • Repurposed to make the token embeddings into a compute-efficient DAG
Retrieval Augmented Generation
  • Store internal state of LLM instead of text/embeddings to speed up inference
  • Unable to improve efficiency over standard techniques due to the inability to isolate compact state information
  • Further developing approach to apply to linguistic factorization techniques and context compression
Tokenization
  • We tried incremental variations around unigram and morpheme-boundary aware tokenization
  • Unable to achieve any real improvements at scale
Rapid Development Language
  • Rapid development language for quick and efficiently experiments with different LLM innovations
Image Processing Models
CNNs
  • Discovery of new mathematical underpinnings for CNN kernels
  • Developed closed-form image tokenization algorithm
  • Worked well enough on a number of small problems
  • Unable to translate the theoretical innovations into experimental improvements at scale
Current and Future Innovations
Our innovation pipeline has 50+ research directions to explore. We’re continuously reassessing which directions are worth pursuing, but below are some of our directions.
Linguistic Factorization
Retained Fact Learning
Dense Layer Fact Editing
  • Fact updating with a second dense layer
  • Fact updating through MLP fine-tuning
Standalone Fact-Training
  • Using gradient descent
  • Using closed-form solutions
Hybrid Semantic Graph
  • Edge propagation fact retention
  • Context sensitive
Lookup Table Dense Layers
  • Associative fact lookup tables
  • Context free
Skip-Gram Conditional Memory
  • N-gram and skip-gram fact gating
  • Could deter hallucinations
Mixture of Memories
  • MOE architecture for dense layers
  • Gates the most relevant expert for the discourse
Syntax, Semantics and Pragmatics
Fact-Free Attention Models
  • Pure syntax and semantic models
  • Facts could be added separately
Semantic Priming Models
  • Priming with decaying priors
  • Simple and very compact long range context
N-Gram Syntax Models
  • Captures short-range phrasing
  • Could bypass expensive computation when possible
Context Sensitive Grammars
  • Captures more nuanced phrasing
  • Could leverage semantic priming
Constrained Linguistic Embedding
Relevance-constrained Embedding
  • Scales embedding vector magnitude by specificity
  • Provides context compression scores
Subspace-constrained Embedding
  • Shares embedding subspaces between tokens
  • Could promote rapid learning of relevant subspaces
Suffix Graph Embedding
  • Treats the token set as a directed acyclic graph
  • Shares embedding optimization along edges
Semantic Graph Embedding
  • Treats the concept space as an embedded graph
  • Would enable fast inclusion of new concepts
Linguistic Tokenization
Byte-Tuple Encoding Tokenization
  • Extends BPE to tuples
  • Maximal corpus compression
  • Information rich
Morphological Tokenization
  • Bootstraps tuples using morphemes
  • Promotes linguistically meaningful tokenization
Phrase-level Tokenization
  • Word-level N-grams treated as single segments
  • Moves common phrasing from attention blocks to tokenization
Gradient Descent Optimization
Corpus Boosting
Synonym Boosting
  • Strengthens gradient signal by introducing phantom targets
  • Better quality gradient information from smaller training corpus
Inflection Boosting
  • Gradient sharing along known linguistic subspaces
  • One backpropagation generates multiple gradients
  • Subspaces could be learned more effectively
Embedding Boosting
  • Strengthens gradient signal by treating the target as a sample from a distribution in embedding space
  • Better quality gradient information from smaller training corpus
Topographic Boosting
  • Connects embeddings which are adjacent along identified subspaces
  • Gradient is shared along edges
Condition Number Management
Parametric Nonlinearities
  • Parameterizes the degree of nonlinearity in activation functions
  • Matches linearity to stage of learning
Localization of Hyperplanes
  • Reduces sensitivity of hyperplanes by limiting the range of their effect
  • Matches range of effect to stage of learning
Layer-wise Loss Functions
  • Introduces additional components of the loss function to partially linearize the search space
Bilinear Activation Functions
  • Reduces model complexity while retaining nonlinearity
  • Could offer improvement in training efficiency
Training Corpus Linearization
  • Similar to curriculum learning
  • Schedules training data so that linear relationships are learned first
Architectural Efficiency
Low Rank Approximations and Parameter Sharing
Random Projection Parameter Sharing
  • Projects dense layers into low rank subspaces and back out again
  • Reduces number of trainable parameters
Progressive Projection Parameter Sharing
  • Uses aggressively low-dimensional subspaces initially
  • Adds dimensions while training
Deep Random Projection Parameter Sharing
  • Multilayer version of other low rank techniques
  • Offers richer nonlinearity of compression and expansion of dimensions
Nystrom Parameter Sharing
  • Retains sensitive portions of the weight matrices exactly
  • Approximates non-sensitive weights using low-rank subspace
Model Growing and Shrinking
  • Grows layer size and model depth when training stalls
  • Shrinks back to minimize final parameter count
Parameter Block Diagonalization
  • Identifies parallel streams of information processing
  • Decomposes dense layers
  • Reduces computational load from unrelated inputs
Factorized Models of Language
Recursive Shared Attention Model
  • Shares weights between stages of attention
  • Shared in sequential blocks
Low Rank Projection Gating
  • Mixture of projections with compute efficient gating
  • Context dependent
Parallel Ladder Attention
  • Factorizes short- and long-range context and processes them in parallel
  • Computationally efficient
Sequential Ladder Attention
  • Factorizes short- and long-range context and processes them sequentially
  • Computationally efficient
Recursive Parallel Ladder Model
  • Recursive version of Parallel Ladder Attention
  • Further reduces trainable parameter count
Bootstrap Recursive Attention Model
  • Enforces soft weight sharing between attention layers
  • Relaxes constraint later in training
State Space Models
Pre-trained Latent Space Attention
  • Uses Mamba-like SSM to prepare relevant context for attention modules
  • Trained as a standalone LLM
Co-trained Latent Space Attention
  • Uses Mamba-like SSM to prepare relevant context for attention modules
  • Trained co-operatively with the attention modules
Increased Prediction Accuracy
Improved Prediction Search
Multi-token Autoregression
  • Predicts sequence of tokens
  • Could enhance LLM output efficiency and coherence
Auxiliary Future State Prediction Model
  • Designed to anticipate future context trajectory in order to provide prediction foresight
  • Measures sensitivity
Propagation of the Prediction Distribution
  • A form of soft decoding
  • Autoregression has access to auxiliary information
Twin Model Speculative Decoding
  • Uses simple model to speculate many futures
  • More complex model selects best and longest overall prediction sequence
Look-ahead Decoding Model
  • Uses non-causal information while training
  • Could speed up convergence
  • Uses surrogate non-causal information for inference
Neural Monte-Carlo Tree Search
  • Treats the sequence prediction problem as a tree search
  • Leverages our work on neural search
Dynamic Context Compression
Context Filtering
Statistical Relevance Weighting
  • Uses a per token score based on occurrence pattern of tokens in the training corpus
  • Weights more specific tokens more highly than generic tokens
Learnable Relevance Weighting
  • Uses a trainable neural scoring layer to weight the tokens
  • Depending on how specific or generic they are in the context of the current discourse
Fixed Weighted Compression
  • Decimates the context into shingles
  • Compresses each shingle according to relevance
  • Performs attention on compressed shingles rather than raw tokens
Variable Weighted Compression
  • Segments the context into consistent information content carts
  • Fills the carts with the most relevant information from the segment
  • Performs attention on carts rather than raw tokens
Latent Space Context
Pre-trained State Space Context Models
  • Uses a pre-trained SSM to compress the context
  • SSM is trained to store information most useful to its own predictive accuracy
  • State space is used as context for the LLM
Co-trained State Space Context Models
  • Trains the SSM to compress the context in parallel with the LLM using the context
  • SSM learns to store information most useful to the LLM predictive accuracy
For Research Partnerships
Research Access
Gain (real-time) access to our current and past research, innovations and experiments, including valuable insights from negative results.
First Rights
First right to acquire or license new innovations before they reach the wider market.
Research Prioritisation
Influence our research roadmap to align with your strategic goals, ensuring our work addresses your most pressing challenges.
Monthly Updates
Monthly update call on our progress, discoveries and results.
Bespoke Insights & Feedback
Receive tailored research insights and direct feedback on your specific areas of interest from our leading AI scientists.
To discuss research partnerships please reach out to info@mlabsai.com