MLabs AI
MLabs AI is an independent AI Research Lab with a focus on state of the art LLM research.
Have a look below to see what we are working on.
Our Focus & Expertise
Our research and development efforts are concentrated on key areas that drive the efficiency and capability of large language models.
Linguistic Factorization
Explicitly factoring the machine learning requirements for syntax, semantics, pragmatics and discourse.
One-Shot and Closed Form Training
Developing advanced methods for minimizing the loss function for the model or specific layers without the need for gradient descent.
Gradient Descent Optimization
Maximizing learning effectiveness through novel techniques for gradient signal and condition number optimization.
Architectural Efficiency
Innovating architectures to reduce actual or effective parameter counts while retaining or encouraging modelling capability.
Increased Prediction Accuracy
Designing more effective methods for token prediction by optimizing over an extended prediction horizon.
Dynamic Context Compression
Extending the effective context window while maximizing the information density and reducing computational load.
Past Innovations
One-Shot and Closed Form Training
Classifiers
one-shot construction of MLP classifier layers for continuous and discrete input data
training tested on classifier benchmarks indicate over 100x more efficient with improved accuracy
can be used to construct the MLP blocks of LLMs (fact retention)
Function Approximators
one-shot construction of MLP function approximator layers
training for function approximator benchmarks indicate over 10x more efficient with improved accuracy
TBD: how can we apply this to LLMs
Auto Encoders
one-shot construction of MLP Auto Encoders
potentially works well on large problems ; TBD
can be used to construct projection subspaces in the attention blocks of LLMs
Neural Monte-Carlo Tree Search
Differentiable Monte-Carlo Tree Search
Embedded into the neural network
Fully trainable tree search
Works very well, beats AlphaGo Zero at playing Go
Can be used to optimize prediction sequences in LLMs
Large Language Models
Attention Blocks
Variations of tree attention
Variations of recursive attention blocks
Results show 2x speed improvement with slight hit on accuracy
State Space Models
Variations of Mamba
Low-Rank Latent Space
Exploiting Johnson-Lindenstrauss lemma to develop compute-efficient block architectures
Able to revert back to full-rank mid-training
Low-Rank Approximation by SVD
Able to speed up inference as well as training
Can switch from full-rank to low-rank (and back) in the middle of training
Factorizing Semantic Graphs
Novel approach to scale-efficient factorization of very large semantic graphs
Orders of magnitude faster than standard factorization techniques
Able to make LLM-scale semantic graph factorization feasible
Embedding
Construction of vector embeddings from structured linguistic knowledge
Demonstrates explicit subspaces, and improves explainability
Repurposed to make the token embeddings into a compute-efficient DAG
Retrieval Augmented Generation
Store internal state of LLM instead of text/embeddings to speed up inference
Further developing approach to apply to linguistic factorization techniques and context compression
Tokenization
Variations around unigram tokenization
Variations around morpheme-boundary aware tokenization
Rapid Development Language
Rapid development language for quick and efficiently experiments with different LLM innovations
Image Processing Models
CNNs
Discovery of new mathematical underpinnings for CNN kernels
Developed closed-form image tokenization algorithm
Current and Future Innovations
Our innovation pipeline has 50+ research directions to explore, all triangulated on our Innovation Matrix and prioritized accordingly.
Linguistic Factorization
Retained Fact Learning
Dense Layer Fact Editing
Fact updating with a second dense layer
Fact updating through MLP fine-tuning
Standalone Fact-Training
Using gradient descent
Using closed-form solutions
Hybrid Semantic Graph
Edge propagation fact retention
Context sensitive
Lookup Table Dense Layers
Associative fact lookup tables
Context free
Skip-Gram Conditional Memory
N-gram and skip-gram fact gating
Deters hallucinations
Mixture of Memories
MOE architecture for dense layers
Gates the most relevant expert for the discourse
Syntax, Semantics and Pragmatics
Fact-Free Attention Models
Pure syntax and semantic models
Facts added separately
Semantic Priming Models
Priming with decaying priors
Simple and very compact long range context
N-Gram Syntax Models
Captures short-range phrasing
Can bypass expensive computation when possible
Context Sensitive Grammars
Captures more nuanced phrasing
Can leverage semantic priming
Constrained Linguistic Embedding
Relevance-constrained Embedding
Scales embedding vector magnitude by specificity
Provides context compression scores
Subspace-constrained Embedding
Shares embedding subspaces between tokens
Promotes rapid learning of relevant subspaces
Suffix Graph Embedding
Treats the token set as a directed acyclic graph
Shares embedding optimization along edges
Semantic Graph Embedding
Treats the concept space as an embedded graph
Enables fast inclusion of new concepts
Linguistic Tokenization
Byte-Tuple Encoding Tokenization
Extends BPE to tuples
Maximal corpus compression
Information rich
Morphological Tokenization
Bootstraps tuples using morphemes
Promotes linguistically meaningful tokenization
Phrase-level Tokenization
Word-level N-grams treated as single segments
Moves common phrasing from attention blocks to tokenization
Gradient Descent Optimization
Corpus Boosting
Synonym Boosting
Strengthens gradient signal by introducing phantom targets
Better quality gradient information from smaller training corpus
Inflection Boosting
Gradient sharing along known linguistic subspaces
One backpropagation generates multiple gradients
Subspaces are learned more effectively
Embedding Boosting
Strengthens gradient signal by treating the target as a sample from a distribution in embedding space
Better quality gradient information from smaller training corpus
Topographic Boosting
Connects embeddings which are adjacent along identified subspaces
Gradient is shared along edges
Condition Number Management
Parametric Nonlinearities
Parameterizes the degree of nonlinearity in activation functions
Matches linearity to stage of learning
Localization of Hyperplanes
Reduces sensitivity of hyperplanes by limiting the range of their effect
Matches range of effect to stage of learning
Layer-wise Loss Functions
Introduces additional components of the loss function to partially linearize the search space
Bilinear Activation Functions
Reduces model complexity while retaining nonlinearity
Offers improvement in training efficiency
Training Corpus Linearization
Similar to curriculum learning
Schedules training data so that linear relationships are learned first
Architectural Efficiency
Low Rank Approximations and Parameter Sharing
Random Projection Parameter Sharing
Projects dense layers into low rank subspaces and back out again
Reduces number of trainable parameters
Progressive Projection Parameter Sharing
Uses aggressively low-dimensional subspaces initially
Adds dimensions while training
Deep Random Projection Parameter Sharing
Multilayer version of other low rank techniques
Offers richer nonlinearity of compression and expansion of dimensions
Nystrom Parameter Sharing
Retains sensitive portions of the weight matrices exactly
Approximates non-sensitive weights using low-rank subspace
Model Growing and Shrinking
Grows layer size and model depth when training stalls
Shrinks back to minimize final parameter count
Parameter Block Diagonalization
Identifies parallel streams of information processing
Decomposes dense layers
Reduces computational load from unrelated inputs
Factorized Models of Language
Recursive Shared Attention Model
Shares weights between stages of attention
Shared in sequential blocks
Low Rank Projection Gating
Mixture of projections with compute efficient gating
Context dependent
Parallel Ladder Attention
Factorizes short- and long-range context and processes them in parallel
Computationally efficient
Sequential Ladder Attention
Factorizes short- and long-range context and processes them sequentially
Computationally efficient
Recursive Parallel Ladder Model
Recursive version of Parallel Ladder Attention
Further reduces trainable parameter count
Bootstrap Recursive Attention Model
Enforces soft weight sharing between attention layers
Relaxes constraint later in training
State Space Models
Pre-trained Latent Space Attention
Uses Mamba-like SSM to prepare relevant context for attention modules
Trained as a standalone LLM
Co-trained Latent Space Attention
Uses Mamba-like SSM to prepare relevant context for attention modules
Trained co-operatively with the attention modules
Increased Prediction Accuracy
Improved Prediction Search
Multi-token Autoregression
Predicts sequence of tokens
Enhances LLM output efficiency and coherence
Auxiliary Future State Prediction Model
Anticipates future context trajectory
Provides prediction foresight
Measures sensitivity
Propagation of the Prediction Distribution
A form of soft decoding
Autoregression has access to auxiliary information
Twin Model Speculative Decoding
Uses simple model to speculate many futures
More complex model selects best and longest overall prediction sequence
Look-ahead Decoding Model
Uses non-causal information while training
Speeds up convergence
Uses surrogate non-causal information for inference
Neural Monte-Carlo Tree Search
Treats the sequence prediction problem as a tree search
leverages our work on neural search
Dynamic Context Compression
Context Filtering
Statistical Relevance Weighting
Uses a per token score based on occurrence pattern of tokens in the training corpus
Weights more specific tokens more highly than generic tokens
Learnable Relevance Weighting
Uses a trainable neural scoring layer to weight the tokens
Depending on how specific or generic they are in the context of the current discourse
Fixed Weighted Compression
Decimates the context into shingles
Compresses each shingle according to relevance
Performs attention on compressed shingles rather than raw tokens
Variable Weighted Compression
Segments the context into consistent information content carts
Fills the carts with the most relevant information from the segment
Performs attention on carts rather than raw tokens
Latent Space Context
Pre-trained State Space Context Models
Uses a pre-trained SSM to compress the context
SSM is trained to store information most useful to its own predictive accuracy
State space is used as context for the LLM
Co-trained State Space Context Models
Trains the SSM to compress the context in parallel with the LLM using the context
SSM learns to store information most useful to the LLM predictive accuracy
For Research Partnerships
Research Access
Gain (real-time) access to our current and past research, innovations and experiments, including valuable insights from negative results.
First Rights
First right to acquire or license new innovations before they reach the wider market.
Research Prioritisation
Influence our research roadmap to align with your strategic goals, ensuring our work addresses your most pressing challenges.
Monthly Updates
Monthly update call on our progress, discoveries and results.
Bespoke Insights & Feedback
Receive tailored research insights and direct feedback on your specific areas of interest from our leading AI scientists.
To discuss research partnerships please reach out to
info@mlabsai.com