0% found this document useful (0 votes)
7 views

Text and Code Embeddings by Contrastive Pre-Training

Uploaded by

sallymorsy2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Text and Code Embeddings by Contrastive Pre-Training

Uploaded by

sallymorsy2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan * 1 Tao Xu * 1 Raul Puri 1 Alec Radford 1 Jesse Michael Han 1 Jerry Tworek 1
Qiming Yuan 1 Nikolas Tezak 1 Jong Wook Kim 1 Chris Hallacy 1 Johannes Heidecke 1 Pranav Shyam 1
Boris Power 1 Tyna Eloundou Nekoul 1 Girish Sastry 1 Gretchen Krueger 1 David Schnurr 1
Felipe Petroski Such 1 Kenny Hsu 1 Madeleine Thompson 1 Tabarak Khan 1 Toki Sherbakov 1 Joanne Jang 1
Peter Welinder 1 Lilian Weng 1
arXiv:2201.10005v1 [cs.CL] 24 Jan 2022

Abstract Average performance vs model size


70
Text embeddings are useful features in many
applications such as semantic search and com- 68
puting text similarity. Previous work typically

Performance
66
trains models customized for different use cases,
varying in dataset choice, training objective and 64
model architecture. In this work, we show that
contrastive pre-training on unsupervised data at 62
scale leads to high quality vector representations 60
of text and code. The same unsupervised text em- S-300M M-1.2B L-6B XL-175B
Model Size
beddings that achieve new state-of-the-art results
in linear-probe classification also display impres- Figure 1. Average performance of unsupervised cpt-text
sive semantic search capabilities and sometimes models of different sizes across 22 tasks consisting of linear-probe
even perform competitively with fine-tuned mod- classification, text search, and sentence similarity tasks.
els. On linear-probe classification accuracy aver-
aging over 7 tasks, our best unsupervised model
achieves a relative improvement of 4% and 1.8% imize the likelihood of observed data while embedding
over previous best unsupervised and supervised models are trained to distinguish observed data from noise
text embedding models respectively. The same (Sohn, 2016; van den Oord et al., 2018; Radford et al.,
text embeddings when evaluated on large-scale 2021; Jia et al., 2021; Gao et al., 2021; Izacard et al., 2021).
semantic search attains a relative improvement Generative models have been shown to produce realistic
of 23.4%, 14.7%, and 10.6% over previous best content and benefit many downstream applications, reduc-
unsupervised methods on MSMARCO, Natural ing the need for labeled training datasets. In generative
Questions and TriviaQA benchmarks, respec- models, the information about the input is typically dis-
tively. Similarly to text embeddings, we train tributed over multiple hidden states of the model. While
code embedding models on (text, code) pairs, ob- some generative models (Kingma & Welling, 2014; Kiros
taining a 20.8% relative improvement over prior et al., 2015) can learn a single representation of the in-
best work on code search. put, most autoregressive Transformer (Vaswani et al., 2017)
models do not (Raffel et al., 2019; Brown et al., 2020; Chen
et al., 2021; Ramesh et al., 2021). However, learning such a
1. Introduction representation (or embedding) is necessary for many tasks.
Systems that search over millions or billions of items re-
Deep unsupervised learning with generative and embed- quire each entry to be embedded as a dense representation
ding models has seen dramatic success in the past few and build an index in advance to save computational costs
years. Generative models (Peters et al., 2018; Raffel et al., at query time. These embeddings are useful features for
2019; van den Oord et al., 2016; Ramesh et al., 2021; classification tasks and can also enable data visualization
Brown et al., 2020; Chen et al., 2021) are trained to max- applications via techniques such as clustering. Embedding
* models are explicitly optimized to learn a low dimensional
Equal contribution 1 OpenAI. Correspondence to: Arvind
Neelakantan <[email protected]>. representation that captures the semantic meaning of the
input (Radford et al., 2021; Jia et al., 2021; Giorgi et al.,
2020; Gao et al., 2021; Izacard et al., 2021).
Text and Code Embeddings by Contrastive Pre-Training

In this work, we train embedding models using a con- fine-tuned models.


trastive learning objective with in-batch negatives (Sohn,
Next, we train code embedding models (cpt-code) using
2016; Yih et al., 2011) on unlabeled data. The input is en-
the same recipe. Our models learn via (text, code) pairs,
coded with a Transformer encoder (Vaswani et al., 2017)
extracted from open source code. We evaluate our model
and we leverage naturally occurring paired data to con-
on CodeSearchNet (Husain et al., 2020), a commonly used
struct training data with no explicit labels. Text embedding
code search benchmark, where the task is to find the most
models are trained on paired text data where we consider
relevant code snippet given a natural language query. Our
neighboring pieces of text on the Internet as positive pairs.
models achieve new state-of-the-art results with a 20.8%
Code embedding models treat the top-level docstring in a
relative improvement over the previous best result (Guo
function along with its implementation as a (text, code)
et al., 2021). Unlike text embedding models, we observe
pair. The training signal of the contrastive objective on
no performance improvement on code search when increas-
its own is not sufficient to learn useful representations and
ing the number of parameters of cpt-code from 300M to
we overcome this by initializing our model with other pre-
1.2B.
trained models (Brown et al., 2020; Chen et al., 2021). Fi-
nally, we find that it is critical to use a sufficiently large Finally, we experiment with fine-tuning our models on
batch to achieve the optimal performance. We show that several supervised datasets and study the transfer learn-
this simple recipe combining pre-trained model initializa- ing performance. When fine-tuned on NLI (Natural Lan-
tion, large-batch contrastive learning and training at scale, guage Inference) datasets, we see a further boost in linear-
can produce text and code embeddings that possess a broad probe classification, outperforming the previous best trans-
range of capabilities. fer method (Gao et al., 2021) by 2.2%. On SST-2 senti-
ment classification (Socher et al., 2013), we find that our
We train a series of unsupervised text embedding mod-
representations are sufficiently descriptive that even a sim-
els (cpt-text) of different sizes, ranging from 300M
ple k-NN classifier achieves results comparable to a linear-
to 175B parameters, and observe a consistent perfor-
probe classifier. Interestingly, zero-shot performance with
mance improvement with increasing model sizes (Figure
our embeddings outperforms the supervised neural network
1). On classification accuracy averaging across 7 linear-
models introduced along with the release of the SST-2
probe classification tasks in SentEval (Conneau & Kiela,
dataset. We also fine-tune the unsupervised model on MS-
2018), our largest unsupervised model achieves new state-
MARCO and evaluate it on a suite of zero-shot search tasks
of-the-art results with a relative improvement of 4% and
in the BEIR benchmark (Thakur et al., 2021). In the trans-
1.8% over the previous best unsupervised (Giorgi et al.,
fer setting, our models achieve a 5.2% relative improve-
2020) and supervised (Gao et al., 2021) text embedding
ment over previous methods (Izacard et al., 2021) and is
models, respectively.
comparable even with methods (Santhanam et al., 2021;
Text embedding in previous work was studied under differ- Formal et al., 2021; Wang et al., 2020) that demand sub-
ent domains, varying in data, training objective and model stantially more computation at test time.
architecture. Precisely, sentence embedding (Reimers &
Gurevych, 2019; Gao et al., 2021; Giorgi et al., 2020) 2. Approach
and neural information retrieval (Lee et al.; Guu et al.,
2020; Karpukhin et al., 2020a; Sachan et al., 2021; Izac- Our models are trained with a contrastive objective on
ard et al., 2021) have remained different research topics paired data. In this section, we present more details on the
evaluated on distinct benchmarks, even though both aim to model architecture and the training objective. The training
learn high-quality text representation. However, we find the set consists of paired samples, {(xi , yi )}N
i=1 , where (xi , yi )
same model that achieves good performance on sentence corresponds to a positive example pair, indicating that xi
embedding benchmarks, as discussed above, is also able and yi are semantically similar or contextually relevant.
to obtain impressive results on large-scale information re-
trieval. When evaluated on the MSMARCO passage rank- 2.1. Model
ing task (Nguyen et al., 2016) to search over 4M passages,
cpt-text gets a relative improvement of 23.4% over pre- Given a training pair (x, y), a Transformer (Vaswani et al.,
vious best unsupervised methods (Robertson, 2009). On 2017) encoder E is used to process x and y independently.
the task of searching on 21M documents from Wikipedia, The encoder maps the input to a dense vector representa-
cpt-text obtains a relative improvement of 14.7%, and tion or embedding (Figure 2). We insert two special token
10.6% over previous unsupervised methods (Izacard et al., delimiters, [SOS] and [EOS], to the start and end of the
2021) for Natural Questions (Kwiatkowski et al., 2019) input sequence respectively. The hidden state from the last
and TriviaQA (Joshi et al., 2017), respectively. On Triv- layer corresponding to the special token [EOS] is consid-
iaQA, our unsupervised method is even competitive with ered as the embedding of the input sequence.
Text and Code Embeddings by Contrastive Pre-Training

Model Parameters Embed Dimensions Batch size


S 300M 1024 12288
M 1.2B 2048 6912
ENCODER L 6B 4096 5896
XL 175B 12288 4976
ENCODER
[SOS] INPUT [EOS]
Table 1. Batch size used to train the models of different sizes.

used for unsupervised representation learning in prior work


Figure 2. The encoder E maps input x to embedding vx . Special (Radford et al., 2021; Jia et al., 2021; Chen et al., 2020;
tokens, [SOS] and [EOS], are appended to the start and end Izacard et al., 2021). For each example in a mini-batch
of the input sequence respectively. The last layer hidden state of M examples, the other (M − 1) in the batch are used as
corresponding to the token [EOS] is extracted as the embedding negative examples. The usage of in-batch negatives enables
of the input sequence. re-use of computation both in the forward and the backward
pass making training highly efficient. The logits for one
batch is a M × M matrix, where each entry logit(xi , yj ) is
given by,

logit(xi ,yj ) = sim(xi , yj ) · exp(τ ),


∀(i, j), i, j ∈ {1, 2, . . . , M }
ENCODER
[EOS]
where τ is a trainable temperature parameter.
Only entries on the diagonal of the matrix are considered
positive examples. The final training loss is the sum of the
cross entropy losses on the row and the column direction,
Figure 3. The encoder E maps inputs x and y, to embeddings, as described in the following numpy style pseudo code.
vx and vy independently. The similarity score between x and y
is defined as the cosine similarity between these two embedding labels = np.arange(M)
vectors. l_r = cross_entropy(logits, labels, axis=0)
l_c = cross_entropy(logits, labels, axis=1)
loss = (l_r + l_c) / 2
The Transformer encoder maps the input, x and y, to em-
We initialize our models with pre-trained generative lan-
beddings, vx and vy respectively and the similarity between
guage models. cpt-text is initialized with GPT mod-
two inputs is quantified by the cosine similarity between
els (Brown et al., 2020) and cpt-code is initialized with
their embeddings, vx and vy (Figure 3).
Codex models (Chen et al., 2021). When fine-tuning our
models (Section 3), the supervised training data like NLI
vx = E([SOS]x ⊕ x ⊕ [EOS]x ) datasets contain explicit negative examples and they are
vy = E([SOS]y ⊕ y ⊕ [EOS]y ) used along with the in-batch negatives.
vx · v y
sim(x, y) =
kvx k · kvy k 3. Results
where ⊕ is an operation to concatenate two strings to- Our models are trained on naturally occurring paired data.
gether. We found that using different delimiters leads to cpt-text models are trained on Internet data with neigh-
more stable training. For x, we use ‘[’ as [SOS]x and boring pieces of text as positive pairs for the contrastive ob-
‘]’ as [EOS]x , while we use ‘{’ and ‘}’ as [SOS]y and jective. The code embedding cpt-code models use (text,
[EOS]y respectively for y. code) pairs extracted from open source code. As discussed
in Section 3.4.1, sufficiently large batch size is crucial to
2.2. Training Objective achieve good performance with our setup. Table 1 lists the
batch sizes used to train the models of different sizes.
The paired samples in the training set are contrasted against
in-batch negatives (Yih et al., 2011; Sohn, 2016). Con- We evaluate our text embedding models on a broad range of
trastive learning with in-batch negatives has been widely tasks: linear-probe classification, sentence similarity, and
Text and Code Embeddings by Contrastive Pre-Training

semantic search. While sentence embedding (Reimers & 3.1.3. S ENTENCE S IMILARITY
Gurevych, 2019; Gao et al., 2021; Giorgi et al., 2020) meth-
On sentence similarity tasks in SentEval, we find that our
ods report results only on embedding benchmarks and neu-
models perform worse than previous SOTA methods (Ta-
ral information retrieval methods (Lee et al.; Guu et al.,
ble 4). Sentence similarity is not a completely well-defined
2020; Karpukhin et al., 2020a; Sachan et al., 2021; Izacard
downstream task (e.g. are the sentences, ‘Jack loves Jill’
et al., 2021) report results only on search benchmarks, we
and ‘Mary loves chocolates’, similar?).1,2 For example,
use the same unsupervised model across all these tasks.
Goodman (1972) argue that two objects can be infinitely
similar or dissimilar (Vervaeke et al., 2012). A possible
3.1. Text Embedding
explanation for why our models perform better than prior
The SentEval benchmark (Conneau & Kiela, 2018) is work on search and classification but not on these tasks is
widely adopted to assess the quality of sentence embed- that our models might not be optimized for the specific def-
dings, consisting of a broad collection of tasks in the cate- inition used by these sentence similarity benchmarks. It
gories of linear-probe classification and sentence similarity, is important to note that previous embedding search meth-
and we use the same to evaluate ours. ods do not report performance on sentence similarity tasks
(Karpukhin et al., 2020a; Sachan et al., 2021; Izacard et al.,
3.1.1. L INEAR P ROBE C LASSIFICATION 2021). More discussion on this phenomenon is presented
in Section 3.4.2.
When evaluated on linear-probe classification, the embed-
dings are used as features to train a linear classifier to
3.2. Text Search
solve a variety of downstream tasks. The results in Ta-
ble 2 demonstrate a clear advantage of larger model sizes Previous work on training embedding methods for search
producing better features for improved classification per- typically requires fine-tuning on a particular text search
formance. In transfer learning setup, we fine-tune unsuper- dataset (Karpukhin et al., 2020a; Sachan et al., 2021; Qu
vised cpt-text models on SNLI (Bowman et al., 2015) et al., 2021). It is also common to have a multi-step
and MNLI (Williams et al., 2018) datasets using entailment setup where fine-tuned models rely on an expensive query
pairs as positive examples and contradiction pairs as nega- and document cross-attention encoder in the final step (Qu
tive examples. On both unsupervised learning and transfer et al., 2021; Wang et al., 2020). In contrast, we push the
learning settings, we achieve state-of-the-art results. limits of using a single embedding model for large-scale
semantic search.
3.1.2. Z ERO - SHOT AND k-NN C LASSIFICATION
3.2.1. L ARGE -S CALE S EARCH
In this section, we discuss results using zero-shot classifi-
cation and k-nearest neighbor classification on the SST-2 First, we evaluate our models on several large-scale text
binary sentiment classification task (Socher et al., 2013). search benchmarks. MSMARCO (Nguyen et al., 2016)
We experiment with 6B (L) cpt-text model fine-tuned requires the model to search over 4M documents while
on NLI data for this study. In the first zero-shot experiment, Natural Questions (NQ) (Kwiatkowski et al., 2019) and
each input text is assigned with one of the two labels (‘pos- TriviaQA (Joshi et al., 2017) involve searching over 21M
itive’, ‘negative’) based on which label has its embedding Wikipedia documents. We use the FAISS library (John-
closest to the input text embedding. The performance can son et al., 2019) to build the vector indices for approximate
be further improved by prompting, where we use a simple k-nearest neighbor search. The same unsupervised model
label description, ‘this is an example of a positive/negative discussed previously achieves impressive performance on
movie review.’, instead of a single word. This zero-shot semantic search. Table 5 demonstrates that cpt-text
usage of embeddings is novel compared to prior work on outperforms prior unsupervised approaches by a big mar-
embeddings and it is interesting to note that our zero-shot gin and larger model sizes consistently lead to improved
results are better than the supervised neural network results performance. Surprisingly, on TriviaQA, our model is even
reported along with the release of the dataset (Socher et al., competitive with fine-tuned models.
2013). In the k-NN classification experiment, given an 1
https://twitter.com/yoavgo/status/
input text, the prediction is the majority label among 256
1431299645570011142
training examples closest to the test input in the embedding 2
https://twitter.com/yoavgo/status/
space. As shown in Table 3, the k-NN classifier without 1483565266575540225?s=20
any task-specific tuning of trainable parameters achieves
results comparable to a linear classifier.
Text and Code Embeddings by Contrastive Pre-Training

MR CR SUBJ MPQA SST TREC MRPC Avg.


Unsupervised
BERT (Devlin et al., 2019) 78.7 86.2 94.4 88.7 84.4 92.8 69.4 84.9
SimCSE (Gao et al., 2021) 84.7 88.6 95.4 87.5 89.5 95.0 72.4 87.6
DECLUTR (Giorgi et al., 2020) 85.2 90.7 95.8 88.5 90.0 93.2 74.6 88.3
cpt-text S 87.1 90.1 94.9 88.3 91.8 95.2 71.6 88.4
cpt-text M 89.0 90.9 96.7 89.6 93.9 96.6 73.6 89.9
cpt-text L 90.6 92.6 97.0 90.6 95.3 97.0 73.6 90.9
cpt-text XL 92.2 93.5 97.4 91.5 96.2 97.4 74.1 91.8
Transfer from NLI data
SBERT (Reimers & Gurevych, 2019) 84.9 90.1 94.5 90.3 90.7 87.4 75.9 87.7
SimCSE (Gao et al., 2021) 88.4 92.5 95.2 90.1 93.3 93.8 77.7 90.2
cpt-text S 87.3 91.0 94.6 90.5 91.4 95.0 75.6 89.3
cpt-text M 89.8 92.7 95.7 91.3 95.3 96.6 76.5 91.1
cpt-text L 90.8 93.5 96.2 91.2 95.7 96.0 76.9 91.5
cpt-text XL 92.4 93.9 97.0 91.8 95.8 96.4 78.1 92.2

Table 2. cpt-text models of different sizes, ranging from 300M (S) to 175B (XL), are compared to previous work on linear-probe
classification tasks in SentEval. We report performance of unsupervised models, as well as those fine-tuned on NLI data.

Method Accuracy
Zero-shot 88.1
Zero-shot with prompting 89.1
k-NN 93.3
Linear-probe 95.7 MSMARCO NQ TriviaQA
Full fine-tuned SOTA 97.5 Fine-tuned SOTA 44.3 84.8, 89.8 84.1, 87.8

Table 3. Comparison of different classification strategies using


Unsupervised
the 6B cpt-text model fine-tuned on NLI data for SST-2 bi- BM25 18.4 62.9, 78.3 76.4, 83.2
nary sentiment task (Socher et al., 2013). Our zero-shot results ICT - 50.9, 66.8 57.5, 73.6
are better than the 85.4% accuracy obtained by supervised neural MSS - 59.8, 74.9 68.2, 79.4
networks reported along with the release of the dataset (Socher
Contriever - 67.2, 81.3 74.2, 83.2
et al., 2013).
cpt-text S 19.9 65.5, 77.2 75.1, 81.7
cpt-text M 20.6 68.7, 79.6 78.0, 83.8
STS -12 -13 -14 -15 -16 Avg cpt-text L 21.5 73.0, 83.4 80.0, 86.8
Unsupervised cpt-text XL 22.7 78.8, 86.8 82.1, 86.9
SimCSE (Gao et al., 2021) 72.9 84.0 75.6 84.8 81.8 79.8
cpt-text S 62.1 60.0 62.0 71.8 73.7 65.9 Table 5. Evaluation of unsupervised cpt-text models of differ-
cpt-text M 62.7 62.8 64.6 73.9 75.3 67.9 ent sizes on several large-scale text search benchmarks. We report
cpt-text L 62.4 66.4 67.6 76.0 77.5 70.0 MRR@10 on MSMARCO and Recall@20, Recall@100 for NQ
cpt-text XL 64.1 67.5 68.4 76.7 78.7 71.1 and TriviaQA as done in prior work. Results for training with
Transfer from NLI Inverse Cloze Task (ICT) and masked salient spans (MSS) objec-
SimCSE (Gao et al., 2021) 77.5 87.3 82.4 86.7 83.9 83.6 tives are taken from Sachan et al. (2021). cpt-text achieves the
cpt-text S 72.8 80.6 78.7 84.7 82.0 79.8 best results among unsupervised methods, surpassing keyword
cpt-text M 73.7 80.2 78.9 85.0 82.8 80.1 search methods on MSMARCO (Robertson, 2009) and embed-
cpt-text L 71.8 79.7 79.0 85.8 84.0 80.1 ding based methods (Izacard et al., 2021) on NQ and TriviaQA.
cpt-text XL 72.3 80.3 78.9 85.1 85.1 80.3

Table 4. cpt-text performs worse than the previous best sen-


tence embedding method on sentence similarity tasks. We inves-
tigate this result in more detail in Section 3.4.2.
Text and Code Embeddings by Contrastive Pre-Training

3.2.2. BEIR S EARCH embedding models with increased distractors and still don’t
see bigger models giving a boost in search performance.
Next, we evaluate our models on 11 zero-shot search tasks
in the BEIR evaluation suite (Thakur et al., 2021). First,
3.4. Analysis
we observe that our unsupervised model performs compet-
itively even with some previous embedding methods that 3.4.1. E FFECT OF BATCH S IZE
leverage supervised MSMARCO data (Xiong et al., 2020;
Hofstätter et al., 2021). Keyword-based BM25 (Robertson, Our ablation study highlights the effect of the model’s
2009) achieves the best results in the unsupervised setting batch size on the final performance. Table 9 compares the
while cpt-text achieves the best transfer learning re- performance of S (300M) cpt-text model trained with
sults. different batch sizes on the NQ development set. Since we
train with in-batch negatives, a larger batch increases the
In the transfer setting, our models achieve a 5.2% relative chances of having hard negatives in a batch, resulting in a
improvement over the previous best embedding method significant performance boost.
(Izacard et al., 2021). It also outperforms docT5query
(Nogueira et al., 2019a) that relies on a fine-tuned T5 model 3.4.2. T RAINING B EHAVIOR
(Raffel et al., 2019) for document expansion. cpt-text
results are competitive even with methods that use sub- We observe that as we train our models for longer, the
stantially more compute at test time. BM25+CE (Wang performance on search and classification tasks increases
et al., 2020) uses keyword search to select top 100 docu- while the performance on sentence similarity tasks de-
ments which are then re-ranked by a cross-attention neural creases (Figure 4). As discussed previously, sentence simi-
network encoder. The ranking encoder network performs larity is not a well defined task. A hypothesis is that search
computationally expensive joint query and document atten- tasks and sentence similarity tasks might have contradict-
tion and cannot exploit indexing and approximate nearest ing definitions. For example, a sentence and its negation
neighbor algorithms for fast and efficient search at query could be considered as relevant during search, but not “sim-
time. Several other existing work take this approach of ilar” in sentence similarity tasks. It is also important to
leveraging more computation resources at query time to ob- note that previous embedding search methods do not report
tain better search performance. ColBERT v2 (Santhanam performance on sentence similarity tasks (Karpukhin et al.,
et al., 2021) is a multi-vector method that represents the 2020a; Sachan et al., 2021; Izacard et al., 2021) and previ-
query and the documents as a set of vectors, and employs ous sentence embedding methods do not evaluate on search
a multi-step retrieval procedure to obtain relevant docu- tasks (Reimers & Gurevych, 2019; Giorgi et al., 2020; Gao
ments. Splade v2 (Formal et al., 2021) represents queries et al., 2021). When deciding the model checkpoints to use
and documents as sparse vectors of size equivalent to the for evaluation, we assigned higher importance to search
vocabulary of the BERT encoder (Devlin et al., 2019). Our and classification tasks as they are commonly associated
cpt-text models compute only one dense embedding with clearly defined real-world applications while sentence
per document which are indexed offline and does not de- similarity tasks are less so.
pend on any cross-attention re-ranker at query time.
4. Related Work
3.3. Code Search
The goal of representation learning (Bengio et al., 2012)
We evaluate our code embedding models on the code is to learn an embedding space in which similar examples
search task using the CodeSearchNet benchmark (Husain stay close to each other while dissimilar ones are far apart
et al., 2020). Given a natural language query, the model (Hadsell et al., 2006). In contrastive learning, the learning
is expected to retrieve the relevant code block among 1K procedure is formulated as a classification problem given
candidates. The models are evaluated on 6 programming similar and dissimilar candidates (Chopra et al., 2005; Gut-
languages and our model achieves state-of-the-art results mann & Hyvärinen, 2010; Schroff et al., 2015; Sohn, 2016;
(Table 7). Unlike with text embeddings, we do not see a van den Oord et al., 2018). Recent work relies on con-
performance improvement with increased model size for trastive objective to learn representations for images (Wu
code embeddings. et al., 2018; He et al., 2020; Chen et al., 2020; Zbontar
et al., 2021), text, or both jointly (Lu et al., 2019; Sun
We also evaluate on a harder setting of finding the relevant
et al., 2019; Kim et al., 2021; Radford et al., 2021; Khosla
code block among 10K candidates instead of 1K. Here, we
et al., 2020). In self-supervised contrastive learning, pos-
compare the performance of cpt-text models against
itive samples can be collected in various approaches in-
cpt-code models (Table 8). It is interesting to see that
cluding by creating an augmented version of the origi-
text embedding performs fairly well in code search espe-
nal input without modifying the semantic meaning (Gao
cially in Python. We see a drop in performance for code
Text and Code Embeddings by Contrastive Pre-Training

covid nfc fiqa arg. touche quora scifact climate dbp. hotpot fever Avg.
Unsupervised
BM25 (Robertson, 2009) 65.6 32.5 23.6 31.5 36.7 78.9 66.5 21.3 31.3 60.3 75.3 47.6
Contriever (Izacard et al., 2021) 27.4 31.7 24.5 37.9 19.3 83.5 64.9 15.5 29.2 48.1 68.2 40.9
cpt-text S 52.9 32.0 34.1 38.7 21.0 68.1 65.4 15.8 27.2 51.5 57.1 42.2
cpt-text M 44.3 34.5 37.3 41.2 23.3 70.3 68.3 15.6 29.6 53.0 58.2 43.2
cpt-text L 42.7 36.9 39.7 39.2 22.8 68.7 71.2 16.1 31.2 54.3 63.8 44.2
Transfer from MSMARCO
TAS-B (Hofstätter et al., 2021) 48.1 31.9 30.0 42.9 16.2 83.5 64.3 22.8 38.4 58.4 70.0 46.0
ANCE (Xiong et al., 2020) 65.4 23.7 29.5 41.5 24.0 85.2 50.7 19.8 28.1 45.6 66.9 43.7
Contriever (Izacard et al., 2021) 59.6 32.8 32.9 44.6 23.0 86.5 67.7 23.7 41.3 63.8 75.8 50.2
cpt-text S 67.9 33.2 38.4 47.0 28.5 70.6 67.2 18.5 36.2 59.4 72.1 49.0
cpt-text M 58.5 36.7 42.2 49.2 29.7 69.7 70.4 19.9 38.6 63.1 77.0 50.5
cpt-text L 56.2 38.0 45.2 46.9 30.9 67.7 74.4 19.4 41.2 64.8 75.6 50.9
cpt-text XL 64.9 40.7 51.2 43.5 29.1 63.8 75.4 22.3 43.2 68.8 77.5 52.8
docT5query (Nogueira et al., 2019a) 71.3 32.8 29.1 34.9 34.7 80.2 67.5 20.1 33.1 58.0 71.4 48.5
BM25+CE (Wang et al., 2020) 75.7 35.0 34.7 31.1 27.1 82.5 68.8 25.3 39.2 70.7 81.9 52.0
ColBERT v2 (Santhanam et al., 2021) 73.8 33.8 35.6 46.3 26.3 85.2 69.3 17.6 44.6 66.7 78.5 52.5
Splade v2 (Formal et al., 2021) 71.0 33.4 33.6 47.9 27.2 83.8 69.3 23.5 43.5 68.4 78.6 52.7

Table 6. Comparison of cpt-text to previous methods on 11 zero-shot search tasks in the BEIR evaluation suite (Thakur et al.,
2021). Results are reported both in the unsupervised data setting and in the transfer data setting. cpt-text outperforms previous best
embedding methods (Xiong et al., 2020; Hofstätter et al., 2021; Izacard et al., 2021) in both the settings. In the unsupervised setting,
BM25 (Robertson, 2009) still achieves the best performance while in the transfer setting cpt-text is competitive with methods that
use substantially more compute at test time (Wang et al., 2020; Santhanam et al., 2021; Formal et al., 2021).

Batch Size MRR@10


Go Ruby Python Java JS PHP Avg.
1536 71.4
CodeBERT 69.3 70.6 84.0 86.8 74.8 70.6 76.0 12288 84.7
GraphCodeBERT 84.1 73.2 87.9 75.7 71.1 72.5 77.4
cpt-code S 97.7 86.3 99.8 94.0 86.0 96.7 93.4 Table 9. Performance of the cpt-text 300M model on NQ dev
cpt-code M 97.5 85.5 99.9 94.4 86.5 97.2 93.5 set given different training batch sizes.

Table 7. Comparison of cpt-code on code search across 6 pro-


gramming languages (Husain et al., 2020) with CodeBERT (Feng
et al., 2020) and GraphCodeBERT (Guo et al., 2021). The task re- et al., 2021), by grouping samples within the same context
quires finding the relevant code block among 1K candidates for a (Giorgi et al., 2020; Izacard et al., 2021), or by collecting
given natural language query. cpt-code performs substantially
data about the same object from different views (Tian et al.,
better than previous methods on all the languages.
2019).
Learning word embeddings is a well studied research area
Go Ruby Python Java JS PHP Avg. (Brown et al., 1992; Gutmann & Hyvärinen, 2010; Mikolov
cpt-text S 60.6 58.9 92.6 48.4 52.8 47.6 60.1 et al., 2013; Pennington et al., 2014). Learning low-
cpt-text M 65.4 63.1 91.4 47.9 53.5 43.1 60.7 dimensional representations of larger text pieces, denser
cpt-code S 90.4 80.6 98.8 81.9 76.1 85.3 85.5 than raw term-based vectors, has been studied extensively
cpt-code M 90.0 89.1 98.9 81.1 75.6 85.1 85.0 as well (Deerwester et al., 1990; Yih et al., 2011). Most
of the recent models for learning sentence embeddings rely
Table 8. Comparison of cpt-code vs cpt-text on large scale
on supervised NLI datasets, using entailment pairs as pos-
code search (Husain et al., 2020). The task is to retrieve the rel-
itive examples and contradiction pairs as (hard) negatives.
evant code block among 10K candidates for a given natural lan-
guage query. It is interesting to note that cpt-text performs SBERT (Reimers & Gurevych, 2019) trained a siamese net-
quite well on Python code search without explicitly training on work to learn a representation where sentence similarity is
(text, code) pairs. estimated by the cosine similarity between embeddings. Li
et al. (2020) improves the embedding space to be isotropic
Text and Code Embeddings by Contrastive Pre-Training

Training Behavior neural encoders (Nogueira et al., 2019b; Qu et al., 2021).


90 Xiong et al. (2020) proposed ANCE, a contrastive learn-
Senteval
Performance

89 NQ ing framework for learning text representations for dense


88 retrieval using mined hard negatives. Other unsupervised
87 retriever methods use the Inverse Cloze Task or masked
86 salient spans to achieve significant improvement on ODQA
85
tasks (Sachan et al., 2021). In comparison to most prior
sts12 work, we find that with a large enough batch size, it is
Performance

75 sts13 possible to achieve good search performance without us-


sts14
70 sts15 ing supervised data. Finally, the recently published Con-
sts16 triever (Izacard et al., 2021) is most similar to our work on
65 learning text embeddings for text search using contrastive
0 10000 20000 30000 40000 50000 learning on unlabeled data.
Training Steps
Semantic code search refers to the task of retrieving code
relevant to a query in natural language. The CodeSearch-
Figure 4. Performance of M (1.2B) cpt-text model on classi- Net challenge (Husain et al., 2020) presents a set of bench-
fication, search and sentence similarity tasks at different training
mark code search tasks in different programming lan-
steps. While the performance on search and classification im-
proves with longer training, the performance on sentence similar-
guages, as well as a simple baseline model to predict em-
ity degrades. beddings of query and code via contrastive learning on a
dataset of (text, code) pairs. ContraCode (Jain et al., 2021)
uses a contrastive learning task of identifying functionally
similar programs, where the functionally similar samples
via normalizing flows. The whitening operation is another are generated via source-to-source compiler transforma-
alternative operation to improve the isotropy of the embed- tions. CodeBERT (Feng et al., 2020) learns to predict se-
ding space (Su et al., 2021). It is typical to initialize such mantic similarity with a pre-trained language model and
models with a pre-trained language model (Devlin et al., GraphCodeBERT (Guo et al., 2021) further improves the
2019) before training on NLI datasets. performance on the CodeSearchNet benchmark by adding
Several methods have been studied for unsupervised or pre-training tasks on code structure.
self-supervised sentence embedding learning (Logeswaran
& Lee, 2018; Zhang et al., 2020; Gao et al., 2021). Com- 5. Broader Impacts
mon approaches consider sentences within the same con-
text as semantically similar samples (Kiros et al., 2015; Lo- Prior research has shown that text representation models
geswaran & Lee, 2018). To create positive training pairs encode the biases present in their training data, including
with augmented samples, a diverse set of text augmen- those which are discriminatory towards protected groups
tation operations have been explored, including lexicon- such as Black people or women (Bolukbasi et al., 2016;
based distortion (Wei & Zou, 2019), synonym replacement Caliskan et al., 2017; May et al., 2019; Zhao et al., 2018;
(Kobayashi, 2018), back-translation (Fang & Xie, 2020), Rudinger et al., 2018). Biases encoded in embedding mod-
cut-off (Shen et al., 2020) and dropout (Gao et al., 2021). els may cause representational harms3 by reinforcing exis-
However, unsupervised sentence embedding models still tent societal biases in the text corpus, and further propagat-
perform notably worse than supervised sentence encoders. ing them in downstream tasks of embedding models.

Large-scale text search based on dense embeddings and Therefore, we encourage further research on two research
neural information retrieval (neural IR) have the poten- agendas: (a) developing robust evaluation methodologies
tial to generalize better than keyword matching in classic for multiple classes of bias in training data and pre-trained
IR systems. Neural IR systems encode documents at the models, and (b) developing and improving methods for
indexing stage and then perform nearest neighbor search mitigating encoded bias, including fine-tuning to reduce
(Johnson et al., 2019) at query time (Lin et al., 2021). bias in pre-trained models (Caliskan et al., 2017; May et al.,
Neural IR models are usually learned by fine-tuning a pre- 2019; Bolukbasi et al., 2016; Liang et al., 2020; Park et al.,
trained language model on supervised search corpus (Lee 2018; Solaiman & Dennison, 2021). Until we have robust
et al.; Guu et al., 2020; Karpukhin et al., 2020b; Lewis evaluation methodology, it is important to restrict and mon-
et al., 2020). Many SOTA search models combine classical itor the use of the model in downstream applications. Par-
IR with neural IR in a staged setup, where the candidates 3
Representational harms occur when systems reinforce the
are first narrowed down by BM25 keyword search (Robert- subordination of some groups along the lines of identity, e.g.
son, 2009) and then re-ranked by joint query and document stereotyping or denigration (Crawford, 2017).
Text and Code Embeddings by Contrastive Pre-Training

ticularly for those where risk of representational harm is D. Language models are few-shot learners. In Advances
great and those where biased representations may influence in Neural Information Processing Systems, 2020.
the allocation of resources and opportunities to people.
Caliskan, A., Bryson, J. J., and Narayanan, A. Seman-
Our embedding models are trained with large batch sizes tics derived automatically from language corpora contain
and require substantial computation resources. While this human-like biases. Science, 356(6334):183–186, 2017.
training regime is environmentally and computationally
costly, there are promising paths forward to amortize and Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto,
offset these costs while allowing users to benefits from H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N.,
the capabilities of these models. For example, safe public Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov,
access to large pre-trained language models, and efficient M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray,
training pipelines that leverage improved model architec- S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
tures and training schemes. We encourage further research ian, M., Winter, C., Tillet, P., Such, F. P., Cummings,
and implementation efforts in these areas. D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss,
A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang,
J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W.,
6. Conclusion Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra,
We showed that contrastive pre-training on unsupervised V., Morikawa, E., Radford, A., Knight, M., Brundage,
data with a sufficiently large batch size can lead to high M., Murati, M., Mayer, K., Welinder, P., McGrew, B.,
quality vector representations of text and code. Our models Amodei, D., McCandlish, S., Sutskever, I., and Zaremba,
achieved new state-of-the-art results in linear-probe classi- W. Evaluating large language models trained on code.
fication, text search and code search. We find that our mod- arXiv preprint arXiv:2107.03374, 2021.
els underperformed on sentence similarity tasks and ob- Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. E.
served unexpected training behavior with respect to these A simple framework for contrastive learning of visual
tasks. Finally, we discussed the broader impact of our work representations. In International conference on machine
on society. learning (ICML), 2020.

References Chopra, S., Hadsell, R., and LeCun, Y. Learning a similar-


ity metric discriminatively, with application to face ver-
Bengio, Y., Courville, A. C., and Vincent, P. Representa- ification. In Computer Vision and Pattern Recognition
tion learning: A review and new perspectives. Transac- (CVPR). IEEE, 2005.
tions on pattern analysis and machine intelligence, 35
(8), 2012. Conneau, A. and Kiela, D. Senteval: An evaluation toolkit
for universal sentence representations. arXiv preprint
Bolukbasi, T., Chang, K., Zou, J. Y., Saligrama, V., and arXiv:1803.05449, 2018.
Kalai, A. Man is to computer programmer as woman is
Crawford, K. The trouble with bias. Keynote at NeurIPS,
to homemaker? debiasing word embeddings. 29, 2016.
2017.
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and
A large annotated corpus for learning natural language Harshman, R. Indexing by latent semantic analysis.
inference. In Conference on Empirical Methods in Nat- Journal of the American society for information science,
ural Language Processing (EMNLP). ACL, 2015. 41(6):391–407, 1990.

Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C., Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
and Mercer, R. L. Class-based n-gram models of nat- BERT: Pre-training of deep bidirectional transformers
ural language. Computational Linguistics, 18(4):467– for language understanding. In Conference of the North
480, 1992. American Chapter of the Association for Computational
Linguistics (NAACL). ACL, 2019.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, Fang, H. and Xie, P. CERT: contrastive self-supervised
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, learning for language understanding. arXiv preprint
G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, arXiv:2005.12766, 2020.
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou, M. Code-
McCandlish, S., Radford, A., Sutskever, I., and Amodei, bert: A pre-trained model for programming and natural
Text and Code Embeddings by Contrastive Pre-Training

languages. In Conference on Empirical Methods in Nat- Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bo-
ural Language Processing (EMNLP), 2020. janowski, P., Joulin, A., and Grave, E. Towards unsuper-
vised dense information retrieval with contrastive learn-
Formal, T., Lassance, C., Piwowarski, B., and Clinchant, ing. arXiv preprint arXiv:2112.09118, 2021.
S. SPLADE v2: Sparse lexical and expansion model for
information retrieval. arXiv preprint arXiv:2109.10086, Jain, P., Jain, A., Zhang, T., Abbeel, P., Gonzalez, J. E., and
2021. Stoica, I. Contrastive code representation learning. In
Conference on Empirical Methods in Natural Language
Gao, T., Yao, X., and Chen, D. SimCSE: Simple con- Processing (EMNLP), 2021.
trastive learning of sentence embeddings. In Conference
on Empirical Methods in Natural Language Processing Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
(EMNLP), 2021. H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. Scaling up
visual and vision-language representation learning with
Giorgi, J. M., Nitski, O., Bader, G. D., and Wang, B. De- noisy text supervision. In International Conference on
clutr: Deep contrastive learning for unsupervised textual Machine Learning (ICML), 2021.
representations. In Proceedings of ACL/IJCNLP, 2020.
Johnson, J., Douze, M., and Jégou, H. Billion-scale simi-
Goodman, N. Seven strictures on similarity. Bobbs Merrill, larity search with gpus. IEEE Transactions on Big Data,
1972. 2019.

Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L.
Zhou, L., Duan, N., Svyatkovskiy, A., Fu, S., Tufano, TriviaQA: A large scale distantly supervised challenge
M., Deng, S. K., Clement, C. B., Drain, D., Sundaresan, dataset for reading comprehension. In Conference of the
N., Yin, J., Jiang, D., and Zhou, M. Graphcodebert: Pre- Association for Computational Linguistics (ACL). ACL,
training code representations with data flow. In Interna- 2017.
tional Conference on Learning Representation (ICLR),
2021. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L.,
Edunov, S., Chen, D., and Yih, W.-t. Dense passage
Gutmann, M. and Hyvärinen, A. Noise-contrastive estima- retrieval for open-domain question answering. In Con-
tion: A new estimation principle for unnormalized sta- ference on Empirical Methods in Natural Language Pro-
tistical models. In Conference on Artificial Intelligence cessing (EMNLP), 2020a.
and Statistics. PMLR, 2010.
Karpukhin, V., Oguz, B., Min, S., Wu, L., Edunov, S.,
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, Chen, D., and Yih, W. Dense passage retrieval for open-
M. REALM: retrieval-augmented language model pre- domain question answering. In Conference on Empiri-
training. arXiv preprint arXiv:2002.08909, 2020. cal Methods in Natural Language Processing (EMNLP),
2020b.
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality
reduction by learning an invariant mapping. In Computer Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian,
Vision and Pattern Recognition (CVPR), volume 2, pp. Y., Isola, P., Maschinot, A., Liu, C., and Krishnan,
1735–1742. IEEE, 2006. D. Supervised contrastive learning. arXiv preprint
arXiv:2004.11362, 2020.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. B. Mo-
mentum contrast for unsupervised visual representation Kim, W., Son, B., and Kim, I. Vilt: Vision-and-
learning. In Computer Vision and Pattern Recognition language transformer without convolution or region su-
(CVPR), 2020. pervision. In International Conference on Machine
Learning (ICML), 2021.
Hofstätter, S., Lin, S., Yang, J., Lin, J., and Hanbury,
A. Efficiently teaching an effective dense retriever Kingma, D. P. and Welling, M. Auto-Encoding Variational
with balanced topic aware sampling. arXiv preprint Bayes. In International Conference on Learning Repre-
arXiv:2104.06967, 2021. sentation (ICLR), 2014.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and Kiros, J., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Tor-
Brockschmidt, M. CodeSearchNet challenge: Evaluat- ralba, A., Urtasun, R., and Fidler, S. Skip-thought vec-
ing the state of semantic code search. arXiv preprint tors. In Advances in Neural Information Processing Sys-
arXiv:1909.09436, 2020. tems (NeuriPS), 2015.
Text and Code Embeddings by Contrastive Pre-Training

Kobayashi, S. Contextual augmentation: Data augmen- Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary,
tation by words with paradigmatic relations. arXiv S., Majumder, R., and Deng, L. MS MARCO: A hu-
preprint arXiv:1805.06201, 2018. man generated machine reading comprehension dataset.
arXiv preprint arXiv:1611.09268, 2016.
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M.,
Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel- Nogueira, R., Lin, J., and Epistemic, A. From doc2query
cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, to doctttttquery. Online preprint, 2019a.
L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and
Petrov, S. Natural questions: a benchmark for question Nogueira, R., Yang, W., Cho, K., and Lin, J. Multi-
answering research. Transactions of the Association of stage document ranking with BERT. arXiv preprint
Computational Linguistics, 2019. arXiv:1910.14424, 2019b.

Lee, K., Chang, M., and Toutanova, K. Latent retrieval Park, J. H., Shin, J., and Fung, P. Reducing gender bias in
for weakly supervised open domain question answering. abusive language detection. In Conference on Empiri-
In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), cal Methods in Natural Language Processing (EMNLP),
Conference of the Association for Computational Lin- 2018.
guistics (ACL), pp. 6086–6096. ACL.
Pennington, J., Socher, R., and Manning, C. GloVe:
Lewis, P. S. H., Perez, E., Piktus, A., Petroni, F., Global vectors for word representation. In Conference
Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, on Empirical Methods in Natural Language Processing
W., Rocktäschel, T., Riedel, S., and Kiela, D. Retrieval- (EMNLP), 2014.
augmented generation for knowledge-intensive NLP
tasks. In Advances in Neural Information Processing Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,
Systems (NeuriPS), 2020. Clark, C., Lee, K., and Zettlemoyer, L. Deep con-
textualized word representations. In Proceedings of
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. NCAAL/IJCNLP, 2018.
On the sentence embeddings from pre-trained language
models. In Conference on Empirical Methods in Natural Qu, Y., Ding, Y., Liu, J., Liu, K., Ren, R., Zhao, X., Dong,
Language Processing (EMNLP), 2020. D., Wu, H., and Wang, H. Rocketqa: An optimized train-
ing approach to dense passage retrieval for open-domain
Liang, P. P., Li, I. M., Zheng, E., Lim, Y. C., Salakhutdinov, question answering. In Conference of the Association for
R., and Morency, L. Towards debiasing sentence repre- Computational Linguistics (ACL), 2021.
sentations. In Conference of the Association for Compu-
tational Linguistics (ACL), 2020. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
Lin, J., Nogueira, R., and Yates, A. Pretrained transformers J., Krueger, G., and Sutskever, I. Learning transferable
for text ranking: BERT and beyond. Synthesis Lectures visual models from natural language supervision. arXiv
on Human Language Technologies, 14(4):1–325, 2021. preprint arXiv:2103.00020, 2021.

Logeswaran, L. and Lee, H. An efficient framework for Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
learning sentence representations. In International Con- Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
ference on Learning Representation (ICLR), 2018. the limits of transfer learning with a unified text-to-text
transformer. arXiv preprint arXiv:1910.10683, 2019.
Lu, J., Batra, D., Parikh, D., and Lee, S. Vil-
bert: Pretraining task-agnostic visiolinguistic represen- Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-
tations for vision-and-language tasks. arXiv preprint ford, A., Chen, M., and Sutskever, I. Zero-shot text-
arXiv:1908.02265, 2019. to-image generation. arXiv preprint arXiv:2102.12092,
2021.
May, C., Wang, A., Bordia, S., Bowman, S. R., and
Rudinger, R. On measuring social biases in sentence en- Reimers, N. and Gurevych, I. Sentence-bert: Sentence em-
coders. In Conference of the North American Chapter of beddings using siamese bert-networks. In Conference
the Association for Computational Linguistics (NAACL), on Empirical Methods in Natural Language Processing
2019. (EMNLP), 2019.

Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. Effi- Robertson, S. The Probabilistic Relevance Framework:
cient estimation of word representations in vector space. BM25 and Beyond. Foundations and Trends® in Infor-
arXiv preprint arXiv:1301.3781, 2013. mation Retrieval, 2009.
Text and Code Embeddings by Contrastive Pre-Training

Rudinger, R., Naradowsky, J., Leonard, B., and Durme, van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
B. V. Gender bias in coreference resolution. arXiv Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,
preprint arXiv:1804.09301, 2018. and Kavukcuoglu, K. Wavenet: A generative model for
raw audio. arXiv preprint arXiv:1609.03499, 2016.
Sachan, D. S., Patwary, M., Shoeybi, M., Kant, N., Ping,
W., Hamilton, W. L., and Catanzaro, B. End-to-end van den Oord, A., Li, Y., and Vinyals, O. Representa-
training of neural retrievers for open-domain question tion learning with contrastive predictive coding. arXiv
answering. In Zong, C., Xia, F., Li, W., and Navigli, preprint arXiv:1807.03748, 2018.
R. (eds.), Proceedings of ACL/IJCNLP, pp. 6648–6662.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
ACL, 2021.
L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., tion is all you need. In Advances in Neural Information
and Zaharia, M. Colbertv2: Effective and efficient re- Processing Systems (NeuriPS), 2017.
trieval via lightweight late interaction. arXiv preprint Vervaeke, J., Lillicrap, T. P., and Richards, B. A. Relevance
arXiv:2112.01488, 2021. realization and the emerging framework in cognitive sci-
Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A ence. Journal of logic and computation, 22(1):79–99,
unified embedding for face recognition and clustering. 2012.
In Computer Vision and Pattern Recognition (CVPR), Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou,
2015. M. Minilm: Deep self-attention distillation for task-
Shen, D., Zheng, M., Shen, Y., Qu, Y., and Chen, W. A agnostic compression of pre-trained transformers. arXiv
simple but tough-to-beat data augmentation approach for preprint arXiv:2002.10957, 2020.
natural language understanding and generation. arXiv Wei, J. W. and Zou, K. EDA: easy data augmentation tech-
preprint arXiv:2009.13818, 2020. niques for boosting performance on text classification
tasks. arXiv preprint arXiv:1901.11196, 2019.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,
C. D., Ng, A., and Potts, C. Recursive deep models for Williams, A., Nangia, N., and Bowman, S. A broad-
semantic compositionality over a sentiment treebank. In coverage challenge corpus for sentence understanding
Conference on Empirical Methods in Natural Language through inference. In Conference of the North American
Processing (EMNLP), 2013. Chapter of the Association for Computational Linguis-
tics (NAACL). ACL, 2018.
Sohn, K. Improved deep metric learning with multi-class
n-pair loss objective. In Advances in Neural Information Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised
Processing Systems (NeuriPS), 2016. feature learning via non-parametric instance-level dis-
crimination. In Computer Vision and Pattern Recogni-
Solaiman, I. and Dennison, C. Process for adapting lan- tion (CVPR), 2018.
guage models to society (PALMS) with values-targeted
datasets. arXiv preprint arXiv:2106.10328, 2021. Xiong, L., Xiong, C., Li, Y., Tang, K., Liu, J., Bennett,
P. N., Ahmed, J., and Overwijk, A. Approximate near-
Su, J., Cao, J., Liu, W., and Ou, Y. Whitening sentence est neighbor negative contrastive learning for dense text
representations for better semantics and faster retrieval. retrieval. arXiv preprint arXiv:2007.00808, 2020.
arXiv preprint arXiv:2103.15316, 2021.
Yih, W.-t., Toutanova, K., Platt, J. C., and Meek, C. Learn-
Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, ing discriminative projections for text similarity mea-
C. Videobert: A joint model for video and language sures. In Conference on Computational Natural Lan-
representation learning. In International Conference on guage Learning (CoNLL). ACL, 2011.
Computer Vision (ICCV), 2019.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Bar-
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and low twins: Self-supervised learning via redundancy re-
Gurevych, I. BEIR: A heterogenous benchmark for duction. In International Conference on Machine Learn-
zero-shot evaluation of information retrieval models. ing (ICML), 2021.
In Advances in Neural Information Processing Systems
(NeuriPS), 2021. Zhang, Y., He, R., Liu, Z., Lim, K. H., and Bing, L.
An unsupervised sentence embedding method by mutual
Tian, Y., Krishnan, D., and Isola, P. Contrastive multi- information maximization. In Conference on Empiri-
view coding. European Conference on Computer Vision cal Methods in Natural Language Processing (EMNLP),
(ECCV), 2019. 2020.
Text and Code Embeddings by Contrastive Pre-Training

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang,
K. Gender bias in coreference resolution: Evaluation and
debiasing methods. arXiv preprint arXiv:1804.06876,
2018.

You might also like