Question regarding the two-layer linear (FNN) projection structure in EmbeddingGemma-300M

#33
by dragonkue - opened

Subject: Question regarding the two-layer linear (FNN) projection structure in EmbeddingGemma-300M

Hello,

First of all, thank you for your excellent work in releasing EmbeddingGemma-300M — it’s an outstanding embedding model and has already proven very valuable for my retrieval / semantic-search workflows.

I have been reviewing the model architecture and documentation and have a technical question regarding the head/projection portion (after pooling) of the network, which I hope you might clarify.

From my understanding, the model appears to use the following structure after the token-level Transformer and pooling step:

  • A Dense (linear) layer from dimension 768 → 3072, with activation set to identity
  • Followed by a second Dense (linear) layer from dimension 3072 → 768, also with activation identity
  • Then (presumably) a normalization (e.g., L2-norm) to produce the final 768-dim embedding

Because both layers use identity activations, mathematically this is equivalent to a single linear transformation from 768 → 768 (i.e., the two weight matrices multiply into one). However, the choice to expand to 3072 and then project back to 768 suggests a deliberate architectural decision. I’d appreciate if you could share any insight on the motivations behind this two-layer linear projection design.

In particular, I’m very curious about:

  1. Was the 768 → 3072 → 768 expansion-then-projection design chosen primarily to increase the model’s expressive capacity, perhaps allowing an internal “wider” representation space (even without a non-linearity) before compressing back to embedding space?
  2. Given that activation functions are identity, what practical benefit is achieved by first expanding then projecting rather than a direct 768→768 linear? For example: does it help with training stability, initialization dynamics, internal regularisation, quantisation/truncation (e.g., supporting the Matryoshka Representation Learning that allows embeddings of size 768→512→256→128) or any other downstream benefit?
  3. Does this structure specifically help when applying embedding truncation, quantisation (int4/int8), or on-device deployment scenarios (e.g., mobile/edge) by providing a “buffer” internal representation dimension?
  4. Or, alternatively, was the two-layer structure simply an implementation/engineering convenience (for example, to separate expansion & projection weights, better hardware/TPU kernel fusion, ONNX export, or internal library constraints) rather than purely a modelling/expressiveness choice?
  5. Finally, do you anticipate that future versions of the model might insert a non-linearity (or skip-connection) between these two projection layers, or are they intentionally kept linear for specific reasons?

Your insights would be extremely helpful for researchers and practitioners (including myself) in better understanding the embedding architecture and designing fine-tuning/quantisation/truncation pipelines around it.

Thank you very much for your time and for sharing your outstanding work.

Hi
Personally, I think that the primary motivation for this design lies in the Embedding Matching Loss used during the distillation process.

According to the technical report, EmbeddingGemma is distilled from the Gemini Embedding model.
Note that the native embedding dimension of the Gemini Embedding teacher model is 3072 [2].

Therefore, the expansion from 768 to 3072 likely serves to align the student's representation with the teacher's output space.
This allows the model to calculate the embedding matching loss (as proposed in EmbedDistill [3]) directly against the teacher's 3072-dimensional vectors during training.
The subsequent projection back to 768 then learns to compress this teacher-aligned representation into the target size.

Please note that this is purely my personal speculation.

References
[1] <embeddinggemma> https://arxiv.org/abs/2509.20354
[2] <Gemini Embedding> https://arxiv.org/abs/2503.07891
[3] <EmbedDistil> https://arxiv.org/abs/2301.12005

Sign up or log in to comment