Abstract
•
Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences
•
Formulate the extractive summarization task as a semantic text matching problem
→ a source document and candidate summaries will be (extracted from the original text) matched in a semantic space
→ well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors
Introduction
•
Automatic text summarization: compress a textual document to a shorter highlight while keeping salient information on the original text
•
Extractive Summarization
◦
Most of the neural extractive summarization systems
1) score and extract sentences one by one from the original text
2) model the relationship between the sentences
3) select several sentences to form a summary
▪
Cheng and Lapata (2016); Nallapati et al. (2017)
•
formulate the extractive summarization task as a sequence labeling problem
•
make independent binary decisions for each sentence, resulting in high redundancy
▪
Chen and Bansal, 2018; Jadhav and Rajan, 2018; Zhou et al., 2018
•
introduce an auto-regressive decoder
•
allow the scoring operations of different sentences to influence on each other
▪
Trigram Blocking (Paulus et al., 2017; Liu and Lapata, 2019)
•
At the stage of selecting sentences to form a summary, it will skip the sentence
that has trigram overlapping with the previously selected sentences.
⇒ The above systems of modeling the relationship between sentences are essentially sentence-level extractors.
◦
We conduct an analysis on six benchmark datasets to better understand the advantages and limitations of sentence-level and summary-level approaches
→ There is indeed an inherent gap between the two approaches across these datasets
◦
MATCHSUM
▪
conceptualize extractive summarization as a semantic text matching problem
▪
"A good summary should be more semantically similar as a whole to the source document than the unqualified summaries."
▪
a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary
•
Siamese BERT leverages the pre-trained BERT in a Siamese network structure to derive semantically meaningful text embeddings that can be compared using cosine-similarity
Related Work: Two-Stage Summarization
•
the first stage is usually to extract some fragments of the original text
•
the second stage is to select or modify on the basis of these fragments
•
Chen and Bansal (2018) and Bae et al. (2019): a hybrid extract-then-rewrite architecture
•
Lebanoff et al. (2019); Xu and Durrett (2019); Mendes et al. (2019): extract-then-
compress learning paradigm
•
MATCHSUM model can be viewed as an extract-then-match framework
Sentence-Level or Summary-Level?
Questions!
1) For extractive summarization, is the summary level extractor better than the sentence-level extractor?
2) Given a dataset, which extractor should we choose based on the characteristics of the data,
and what is the inherent gap between these two extractors?
Definition
Document:
Cadidate Summary:
•
Given a document D with its gold summary C*, we measure a candidate summary C by calculating the ROUGE value between C and C* in two levels:
1) Sentence-Level Score:
2) Summary-Level Score:
•
Pearl-Summary
The summary that has a lower sentence-level score but a higher summary-level score
C1 3문장: Summary 44, Sentence 40 (Pearl)
C2 3문장: Summary 42, Sentence 43
Defition 1
A candidate summary C is defined as a pearl-summary if there exists another candidate
summary C' that satisfies the inequality:
•
Best-Summary
The best-summary refers to a summary has highest summary-level score among all the candidate summaries.
Definition 2
A summary is defined as the best-summary when it satidfies:
Ranking of Best-Summary
For each document, we sort all candidate summaries in descending order based on the sentence level score, and then define z as the rank index of the best-summary .
1) if z = 1: the best-summary is composed of sentences with the highest score
2) If z > 1: the best-summary is a pearl-summary
•
Most of the best-summaries are not made up of the highest-scoring sentences.
•
In conclusion, the proportion of the pearl-summaries in all the best-summaries is a property
to characterize a dataset, which will affect our choices of summarization extractors.
Inherent Gap
•
Sentence-level extractor 대신 Summary-level extractor를 사용했을 때, 얼마만큼 성능이 향상되는가?
→ Potential Gain에 따라서, 둘 중 어떤 extractor를 사용할 지를 결정할 수 있다.
1.
Inherent Gap
2.
Potential Gap at Document Level
3.
Potential Gap at Dataset Level
Summarization as Matching
Siamese Bert
•
Siamese Network
◦
Different inputs → Computing the similarity between two inputs
◦
Identical networks with shared weights
◦
Learning by optimizing similarity
•
In this paper, the reserchers construct subnetworks by using original BERT to derive the semantically meaningful embeddings from document D and candidate summary C
•
Let and denote the embeddings of the document D and candidate
summary C.
◦
Their similarity score is measured by
•
Loss Function
◦
In order to fine-tune Siamese-BERT, we use a margin-based triplet loss to update the weights.
◦
Semactically, golden summary should be closest to the source doument (Loss term 1)
◦
Naturally, the candidate pair with a larger ranking gap should have a larger margin
◦
Finally, our margin-based triplet loss can be written as
Candidate Pruning
•
How could we determine the size of the candidate summary set or should we score all possible candidates? → Candidate Pruning Strategy
⇒ Introduce a content selection module to pre-select salient sentences
•
The module learns to assign each sentence a salience score and prunes sentences irrelevant with the current document.
•
In this paper, we use BERTSUM (Liu and Lapata, 2019) without trigram blocking to score each sentence.
Experiment
Comparison between BERT-EXT and MATCHSUM
•
We define the ratio that MATCHSUM can learn on dataset D as
where is the inherent gap between sentence-level and summary-level extractos
•
Pearl-summary가 Best인 경우, MATCHSUM이 더 좋은 성능을 보임.