MLabs AI
MLabs AI is an independent AI Research Lab with a focus on state of the art LLM research.
Have a look below to see what we are working on.
Our Focus & Expertise
Our research and development efforts are concentrated on key areas that drive the efficiency and capability of large language models.
Linguistic Factorization
Explicitly factoring the machine learning requirements for syntax, semantics, pragmatics and discourse.
One-Shot and Closed Form Training
Developing advanced methods for minimizing the loss function for the model or specific layers without the need for gradient descent.
Gradient Descent Optimization
Maximizing learning effectiveness through novel techniques for gradient signal and condition number optimization.
Architectural Efficiency
Innovating architectures to reduce actual or effective parameter counts while retaining or encouraging modelling capability.
Increased Prediction Accuracy
Designing more effective methods for token prediction by optimizing over an extended prediction horizon.
Dynamic Context Compression
Extending the effective context window while maximizing the information density and reducing computational load.
Past Innovations
One-Shot and Closed Form Training
Classifiers
  • one-shot construction of MLP classifier layers for continuous and discrete input data
  • training tested on classifier benchmarks indicate over 100x more efficient with improved accuracy
  • can be used to construct the MLP blocks of LLMs (fact retention)
Function Approximators
  • one-shot construction of MLP function approximator layers
  • training for function approximator benchmarks indicate over 10x more efficient with improved accuracy
  • TBD: how can we apply this to LLMs
Auto Encoders
  • one-shot construction of MLP Auto Encoders
  • potentially works well on large problems ; TBD
  • can be used to construct projection subspaces in the attention blocks of LLMs
Neural Monte-Carlo Tree Search
Differentiable Monte-Carlo Tree Search
  • Embedded into the neural network
  • Fully trainable tree search
  • Works very well, beats AlphaGo Zero at playing Go
  • Can be used to optimize prediction sequences in LLMs
Large Language Models
Attention Blocks
  • Variations of tree attention
  • Variations of recursive attention blocks
  • Results show 2x speed improvement with slight hit on accuracy
State Space Models
  • Variations of Mamba
Low-Rank Latent Space
  • Exploiting Johnson-Lindenstrauss lemma to develop compute-efficient block architectures
  • Able to revert back to full-rank mid-training
Low-Rank Approximation by SVD
  • Able to speed up inference as well as training
  • Can switch from full-rank to low-rank (and back) in the middle of training
Factorizing Semantic Graphs
  • Novel approach to scale-efficient factorization of very large semantic graphs
  • Orders of magnitude faster than standard factorization techniques
  • Able to make LLM-scale semantic graph factorization feasible
Embedding
  • Construction of vector embeddings from structured linguistic knowledge
  • Demonstrates explicit subspaces, and improves explainability
  • Repurposed to make the token embeddings into a compute-efficient DAG
Retrieval Augmented Generation
  • Store internal state of LLM instead of text/embeddings to speed up inference
  • Further developing approach to apply to linguistic factorization techniques and context compression
Tokenization
  • Variations around unigram tokenization
  • Variations around morpheme-boundary aware tokenization
Rapid Development Language
  • Rapid development language for quick and efficiently experiments with different LLM innovations
Image Processing Models
CNNs
  • Discovery of new mathematical underpinnings for CNN kernels
  • Developed closed-form image tokenization algorithm
Current and Future Innovations
Our innovation pipeline has 50+ research directions to explore, all triangulated on our Innovation Matrix and prioritized accordingly.
Linguistic Factorization
Retained Fact Learning
Dense Layer Fact Editing
  • Fact updating with a second dense layer
  • Fact updating through MLP fine-tuning
Standalone Fact-Training
  • Using gradient descent
  • Using closed-form solutions
Hybrid Semantic Graph
  • Edge propagation fact retention
  • Context sensitive
Lookup Table Dense Layers
  • Associative fact lookup tables
  • Context free
Skip-Gram Conditional Memory
  • N-gram and skip-gram fact gating
  • Deters hallucinations
Mixture of Memories
  • MOE architecture for dense layers
  • Gates the most relevant expert for the discourse
Syntax, Semantics and Pragmatics
Fact-Free Attention Models
  • Pure syntax and semantic models
  • Facts added separately
Semantic Priming Models
  • Priming with decaying priors
  • Simple and very compact long range context
N-Gram Syntax Models
  • Captures short-range phrasing
  • Can bypass expensive computation when possible
Context Sensitive Grammars
  • Captures more nuanced phrasing
  • Can leverage semantic priming
Constrained Linguistic Embedding
Relevance-constrained Embedding
  • Scales embedding vector magnitude by specificity
  • Provides context compression scores
Subspace-constrained Embedding
  • Shares embedding subspaces between tokens
  • Promotes rapid learning of relevant subspaces
Suffix Graph Embedding
  • Treats the token set as a directed acyclic graph
  • Shares embedding optimization along edges
Semantic Graph Embedding
  • Treats the concept space as an embedded graph
  • Enables fast inclusion of new concepts
Linguistic Tokenization
Byte-Tuple Encoding Tokenization
  • Extends BPE to tuples
  • Maximal corpus compression
  • Information rich
Morphological Tokenization
  • Bootstraps tuples using morphemes
  • Promotes linguistically meaningful tokenization
Phrase-level Tokenization
  • Word-level N-grams treated as single segments
  • Moves common phrasing from attention blocks to tokenization
Gradient Descent Optimization
Corpus Boosting
Synonym Boosting
  • Strengthens gradient signal by introducing phantom targets
  • Better quality gradient information from smaller training corpus
Inflection Boosting
  • Gradient sharing along known linguistic subspaces
  • One backpropagation generates multiple gradients
  • Subspaces are learned more effectively
Embedding Boosting
  • Strengthens gradient signal by treating the target as a sample from a distribution in embedding space
  • Better quality gradient information from smaller training corpus
Topographic Boosting
  • Connects embeddings which are adjacent along identified subspaces
  • Gradient is shared along edges
Condition Number Management
Parametric Nonlinearities
  • Parameterizes the degree of nonlinearity in activation functions
  • Matches linearity to stage of learning
Localization of Hyperplanes
  • Reduces sensitivity of hyperplanes by limiting the range of their effect
  • Matches range of effect to stage of learning
Layer-wise Loss Functions
  • Introduces additional components of the loss function to partially linearize the search space
Bilinear Activation Functions
  • Reduces model complexity while retaining nonlinearity
  • Offers improvement in training efficiency
Training Corpus Linearization
  • Similar to curriculum learning
  • Schedules training data so that linear relationships are learned first
Architectural Efficiency
Low Rank Approximations and Parameter Sharing
Random Projection Parameter Sharing
  • Projects dense layers into low rank subspaces and back out again
  • Reduces number of trainable parameters
Progressive Projection Parameter Sharing
  • Uses aggressively low-dimensional subspaces initially
  • Adds dimensions while training
Deep Random Projection Parameter Sharing
  • Multilayer version of other low rank techniques
  • Offers richer nonlinearity of compression and expansion of dimensions
Nystrom Parameter Sharing
  • Retains sensitive portions of the weight matrices exactly
  • Approximates non-sensitive weights using low-rank subspace
Model Growing and Shrinking
  • Grows layer size and model depth when training stalls
  • Shrinks back to minimize final parameter count
Parameter Block Diagonalization
  • Identifies parallel streams of information processing
  • Decomposes dense layers
  • Reduces computational load from unrelated inputs
Factorized Models of Language
Recursive Shared Attention Model
  • Shares weights between stages of attention
  • Shared in sequential blocks
Low Rank Projection Gating
  • Mixture of projections with compute efficient gating
  • Context dependent
Parallel Ladder Attention
  • Factorizes short- and long-range context and processes them in parallel
  • Computationally efficient
Sequential Ladder Attention
  • Factorizes short- and long-range context and processes them sequentially
  • Computationally efficient
Recursive Parallel Ladder Model
  • Recursive version of Parallel Ladder Attention
  • Further reduces trainable parameter count
Bootstrap Recursive Attention Model
  • Enforces soft weight sharing between attention layers
  • Relaxes constraint later in training
State Space Models
Pre-trained Latent Space Attention
  • Uses Mamba-like SSM to prepare relevant context for attention modules
  • Trained as a standalone LLM
Co-trained Latent Space Attention
  • Uses Mamba-like SSM to prepare relevant context for attention modules
  • Trained co-operatively with the attention modules
Increased Prediction Accuracy
Improved Prediction Search
Multi-token Autoregression
  • Predicts sequence of tokens
  • Enhances LLM output efficiency and coherence
Auxiliary Future State Prediction Model
  • Anticipates future context trajectory
  • Provides prediction foresight
  • Measures sensitivity
Propagation of the Prediction Distribution
  • A form of soft decoding
  • Autoregression has access to auxiliary information
Twin Model Speculative Decoding
  • Uses simple model to speculate many futures
  • More complex model selects best and longest overall prediction sequence
Look-ahead Decoding Model
  • Uses non-causal information while training
  • Speeds up convergence
  • Uses surrogate non-causal information for inference
Neural Monte-Carlo Tree Search
  • Treats the sequence prediction problem as a tree search
  • leverages our work on neural search
Dynamic Context Compression
Context Filtering
Statistical Relevance Weighting
  • Uses a per token score based on occurrence pattern of tokens in the training corpus
  • Weights more specific tokens more highly than generic tokens
Learnable Relevance Weighting
  • Uses a trainable neural scoring layer to weight the tokens
  • Depending on how specific or generic they are in the context of the current discourse
Fixed Weighted Compression
  • Decimates the context into shingles
  • Compresses each shingle according to relevance
  • Performs attention on compressed shingles rather than raw tokens
Variable Weighted Compression
  • Segments the context into consistent information content carts
  • Fills the carts with the most relevant information from the segment
  • Performs attention on carts rather than raw tokens
Latent Space Context
Pre-trained State Space Context Models
  • Uses a pre-trained SSM to compress the context
  • SSM is trained to store information most useful to its own predictive accuracy
  • State space is used as context for the LLM
Co-trained State Space Context Models
  • Trains the SSM to compress the context in parallel with the LLM using the context
  • SSM learns to store information most useful to the LLM predictive accuracy
For Research Partnerships
Research Access
Gain (real-time) access to our current and past research, innovations and experiments, including valuable insights from negative results.
First Rights
First right to acquire or license new innovations before they reach the wider market.
Research Prioritisation
Influence our research roadmap to align with your strategic goals, ensuring our work addresses your most pressing challenges.
Monthly Updates
Monthly update call on our progress, discoveries and results.
Bespoke Insights & Feedback
Receive tailored research insights and direct feedback on your specific areas of interest from our leading AI scientists.
To discuss research partnerships please reach out to info@mlabsai.com