The Paraop corpus and models are available on GitHub.


What is it?

Paraop is a framework for annotating paraphrase operations: the specific changes that occur in a sentence in the production of a paraphrase.

To illustrate, if we have the following pair of sentences:

John gave Mary a book

John gave a book to Mary

Then, the Paraop annotations for this sentence pair would be:

John gave Mary a book
0 0 3 3 3
John gave a book to Mary
0 0 3 3 1 3

The numbers below each word in the example above denote which Paraop operation applies to that word. For instance, 1 represents the addition of a function word, 3 represents a change of order, and 0 represents no change.

Why annotate paraphrase operations?

Applications of paraphrase operation detection include:

  • Data augmentation
  • Machine translation
  • Textual entailment detection
  • Text summarization and simplification
  • Plagiarism detection

What resources are available?

The Paraop repository on GitHub contains:

  • The Paraop corpus
  • Automatic Paraop classifiers

The Paraop corpus is based on the Extended Typology Paraphrase Corpus (ETPC), which, in turn, is based on the Microsoft Research Paraphrase Corpus (MRPC).

The automatic Paraop classifiers are BERT models fine-tuned on the Paraop corpus.