Factorized LLMs | MLabs AI

MLabs AI
MLabs AI is an independent AI Research Lab with a focus on state of the art LLM research.
We operate as a high-throughput research engine. We continuously generate and test unconventional training approaches across multiple domains, failing fast and often, and refining those that expose unnecessary rigidity in current methods.
Our advantage is not a single method, but the rate at which we can explore the design space. 
Through our process we have developed several new algorithms and mathematical constructions, including cases where parts of MLP training can be performed without gradient descent while improving accuracy and efficiency in specific regimes.
Factorised Pre-Trained Transformers
We have recently started work on factorized pre-trained transformers, separating the knowledge base, reasoning and language generation into separate modular components. This is a work in progress but boasts several potential advantages, including more efficient training, smaller footprint models, enhanced explainability and the ability to update the knowledge base without retraining the entire model.
Our Focus & Expertise
Our research and development efforts are concentrated on key areas that drive the efficiency and capability of large language models.
Linguistic Factorization
Explicitly factoring the machine learning requirements for syntax, semantics, pragmatics and discourse.
One-Shot and Closed Form Training
Developing advanced methods for minimizing the loss function for the model or specific layers without the need for gradient descent.
Gradient Descent Optimization
Maximizing learning effectiveness through novel techniques for gradient signal and condition number optimization.
Architectural Efficiency
Innovating architectures to reduce actual or effective parameter counts while retaining or encouraging modelling capability.
Increased Prediction Accuracy
Designing more effective methods for token prediction by optimizing over an extended prediction horizon.
Dynamic Context Compression
Extending the effective context window while maximizing the information density and reducing computational load.
Past Innovations
One-Shot and Closed Form Training
Classifiers
One-shot construction of MLP classifier layers for continuous and discrete input data
Material efficiency gains (orders of magnitude) in constrained settings with improved accuracy
We intend to test it to construct the MLP blocks of LLMs (fact retention)
Function Approximators 
One-shot construction of MLP function approximator layers
Material efficiency gains (one order of magnitude) in constrained settings with improved accuracy
We intend to test it to construct the MLP blocks of LLMs (fact retention)
Feature Encoders
Tried to construct the MLP layers of feature encoders using techniques we learned from building the classifier layers
Didn't work because of problems with stability of the clustering algorithm
Moved on to a more foundational mathematical solution for Auto Encoders
Auto Encoders
One-shot construction of MLP Auto Encoders
Does not work well on small problems, because of dimensionality scaling
We intend to test it to construct projection subspaces in the attention blocks of LLMs, as sparse layers for built-in explainability and post-training interpretability
Neural Monte-Carlo Tree Search
Differentiable Monte-Carlo Tree Search 
Fully trainable tree search
Embedded into the neural network
Works very well on self-contained search problems
beats AlphaGo Zero at playing Go with the same number of model parameters but a small increase in memory footprint and computational load
We intend to test it for optimizing prediction sequences in LLMs
Large Language Models
Attention Blocks
Variations of tree attention and recursive attention blocks
Preliminary empirical signals suggest solid performance improvements in constrained settings
We intend to evaluate more radical variations on large scale language modelling tasks
State Space Models
Incremental variations of Mamba have been tried
Early results suggest that further exploration is warranted
Low-Rank Latent Space
Exploiting Johnson-Lindenstrauss lemma to develop compute-efficient block architectures
Able to revert back to full-rank mid-training
Works on some smaller LLMs and we intend to test it on larger LLMs
Low-Rank Approximation by SVD
Able to speed up inference as well as training
Can switch from full-rank to low-rank (and back) in the middle of training
We intend to test it on LLMs
Factorizing Semantic Graphs
Novel approach to scale-efficient factorization of very large semantic graphs
Orders of magnitude faster than standard factorization techniques
Currently testing LLM-scale semantic graph factorization
Embedding
Construction of vector embeddings from structured linguistic knowledge
Demonstrates explicit subspaces, and improves explainability
Didn't work as well as expected due to the curse of dimensionality
Repurposed to make the token embeddings into a compute-efficient DAG
Retrieval Augmented Generation
Store internal state of LLM instead of text/embeddings to speed up inference
Unable to improve efficiency over standard techniques due to the inability to isolate compact state information
Further developing approach to apply to linguistic factorization techniques and context compression
Tokenization
We tried incremental variations around unigram and morpheme-boundary aware tokenization
Unable to achieve any real improvements at scale 
Rapid Development Language
Rapid development language for quick and efficiently experiments with different LLM innovations
Image Processing Models
CNNs
Discovery of new mathematical underpinnings for CNN kernels
Developed closed-form image tokenization algorithm
Worked well enough on a number of small problems
Unable to translate the theoretical innovations into experimental improvements at scale
Current and Future Innovations
Our innovation pipeline has 50+ research directions to explore. We’re continuously reassessing which directions are worth pursuing, but below are some of our directions.
Linguistic Factorization
Retained Fact Learning
Dense Layer Fact Editing
Fact updating with a second dense layer
Fact updating through MLP fine-tuning
Standalone Fact-Training
Using gradient descent
Using closed-form solutions
Hybrid Semantic Graph
Edge propagation fact retention
Context sensitive
Lookup Table Dense Layers
Associative fact lookup tables
Context free
Skip-Gram Conditional Memory
N-gram and skip-gram fact gating
Could deter hallucinations
Mixture of Memories
MOE architecture for dense layers
Gates the most relevant expert for the discourse
Syntax, Semantics and Pragmatics
Fact-Free Attention Models
Pure syntax and semantic models
Facts could be added separately
Semantic Priming Models
Priming with decaying priors
Simple and very compact long range context
N-Gram Syntax Models
Captures short-range phrasing
Could bypass expensive computation when possible
Context Sensitive Grammars
Captures more nuanced phrasing
Could leverage semantic priming
Constrained Linguistic Embedding
Relevance-constrained Embedding
Scales embedding vector magnitude by specificity
Provides context compression scores
Subspace-constrained Embedding
Shares embedding subspaces between tokens
Could promote rapid learning of relevant subspaces
Suffix Graph Embedding
Treats the token set as a directed acyclic graph
Shares embedding optimization along edges
Semantic Graph Embedding
Treats the concept space as an embedded graph
Would enable fast inclusion of new concepts
Linguistic Tokenization
Byte-Tuple Encoding Tokenization
Extends BPE to tuples
Maximal corpus compression
Information rich
Morphological Tokenization
Bootstraps tuples using morphemes
Promotes linguistically meaningful tokenization
Phrase-level Tokenization
Word-level N-grams treated as single segments
Moves common phrasing from attention blocks to tokenization
Gradient Descent Optimization
Corpus Boosting
Synonym Boosting
Strengthens gradient signal by introducing phantom targets
Better quality gradient information from smaller training corpus
Inflection Boosting
Gradient sharing along known linguistic subspaces
One backpropagation generates multiple gradients
Subspaces could be learned more effectively
Embedding Boosting
Strengthens gradient signal by treating the target as a sample from a distribution in embedding space
Better quality gradient information from smaller training corpus
Topographic Boosting
Connects embeddings which are adjacent along identified subspaces
Gradient is shared along edges
Condition Number Management
Parametric Nonlinearities
Parameterizes the degree of nonlinearity in activation functions
Matches linearity to stage of learning
Localization of Hyperplanes
Reduces sensitivity of hyperplanes by limiting the range of their effect
Matches range of effect to stage of learning
Layer-wise Loss Functions
Introduces additional components of the loss function to partially linearize the search space
Bilinear Activation Functions
Reduces model complexity while retaining nonlinearity
Could offer improvement in training efficiency
Training Corpus Linearization
Similar to curriculum learning
Schedules training data so that linear relationships are learned first
Architectural Efficiency
Low Rank Approximations and Parameter Sharing
Random Projection Parameter Sharing
Projects dense layers into low rank subspaces and back out again
Reduces number of trainable parameters
Progressive Projection Parameter Sharing
Uses aggressively low-dimensional subspaces initially
Adds dimensions while training
Deep Random Projection Parameter Sharing
Multilayer version of other low rank techniques
Offers richer nonlinearity of compression and expansion of dimensions
Nystrom Parameter Sharing
Retains sensitive portions of the weight matrices exactly
Approximates non-sensitive weights using low-rank subspace
Model Growing and Shrinking
Grows layer size and model depth when training stalls
Shrinks back to minimize final parameter count
Parameter Block Diagonalization
Identifies parallel streams of information processing
Decomposes dense layers
Reduces computational load from unrelated inputs
Factorized Models of Language
Recursive Shared Attention Model
Shares weights between stages of attention
Shared in sequential blocks
Low Rank Projection Gating
Mixture of projections with compute efficient gating
Context dependent
Parallel Ladder Attention
Factorizes short- and long-range context and processes them in parallel
Computationally efficient
Sequential Ladder Attention
Factorizes short- and long-range context and processes them sequentially
Computationally efficient
Recursive Parallel Ladder Model
Recursive version of Parallel Ladder Attention
Further reduces trainable parameter count
Bootstrap Recursive Attention Model
Enforces soft weight sharing between attention layers
Relaxes constraint later in training
State Space Models
Pre-trained Latent Space Attention
Uses Mamba-like SSM to prepare relevant context for attention modules
Trained as a standalone LLM
Co-trained Latent Space Attention
Uses Mamba-like SSM to prepare relevant context for attention modules
Trained co-operatively with the attention modules
Increased Prediction Accuracy
Improved Prediction Search
Multi-token Autoregression
Predicts sequence of tokens
Could enhance LLM output efficiency and coherence
Auxiliary Future State Prediction Model
Designed to anticipate future context trajectory in order to provide prediction foresight
Measures sensitivity
Propagation of the Prediction Distribution
A form of soft decoding
Autoregression has access to auxiliary information
Twin Model Speculative Decoding
Uses simple model to speculate many futures
More complex model selects best and longest overall prediction sequence
Look-ahead Decoding Model
Uses non-causal information while training
Could speed up convergence
Uses surrogate non-causal information for inference
Neural Monte-Carlo Tree Search
Treats the sequence prediction problem as a tree search
Leverages our work on neural search
Dynamic Context Compression
Context Filtering
Statistical Relevance Weighting
Uses a per token score based on occurrence pattern of tokens in the training corpus
Weights more specific tokens more highly than generic tokens
Learnable Relevance Weighting
Uses a trainable neural scoring layer to weight the tokens
Depending on how specific or generic they are in the context of the current discourse
Fixed Weighted Compression
Decimates the context into shingles
Compresses each shingle according to relevance
Performs attention on compressed shingles rather than raw tokens
Variable Weighted Compression
Segments the context into consistent information content carts
Fills the carts with the most relevant information from the segment
Performs attention on carts rather than raw tokens
Latent Space Context
Pre-trained State Space Context Models
Uses a pre-trained SSM to compress the context
SSM is trained to store information most useful to its own predictive accuracy
State space is used as context for the LLM
Co-trained State Space Context Models
Trains the SSM to compress the context in parallel with the LLM using the context
SSM learns to store information most useful to the LLM predictive accuracy
For Research Partnerships
Research Access
Gain (real-time) access to our current and past research, innovations and experiments, including valuable insights from negative results.
First Rights
First right to acquire or license new innovations before they reach the wider market.
Research Prioritisation
Influence our research roadmap to align with your strategic goals, ensuring our work addresses your most pressing challenges.
Monthly Updates
Monthly update call on our progress, discoveries and results.
Bespoke Insights & Feedback
Receive tailored research insights and direct feedback on your specific areas of interest from our leading AI scientists.
To discuss research partnerships please reach out to info@mlabsai.com