Extractive Summarization as Text Matching

2021/07/04 07:01
포스팅 종류


Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences
Formulate the extractive summarization task as a semantic text matching problem
→ a source document and candidate summaries will be (extracted from the original text) matched in a semantic space
→ well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors


Automatic text summarization: compress a textual document to a shorter highlight while keeping salient information on the original text
Extractive Summarization
Most of the neural extractive summarization systems
1) score and extract sentences one by one from the original text
2) model the relationship between the sentences
3) select several sentences to form a summary
Cheng and Lapata (2016); Nallapati et al. (2017)
formulate the extractive summarization task as a sequence labeling problem
make independent binary decisions for each sentence, resulting in high redundancy
Chen and Bansal, 2018; Jadhav and Rajan, 2018; Zhou et al., 2018
introduce an auto-regressive decoder
allow the scoring operations of different sentences to influence on each other
Trigram Blocking (Paulus et al., 2017; Liu and Lapata, 2019)
At the stage of selecting sentences to form a summary, it will skip the sentence that has trigram overlapping with the previously selected sentences.
⇒ The above systems of modeling the relationship between sentences are essentially sentence-level extractors.
We conduct an analysis on six benchmark datasets to better understand the advantages and limitations of sentence-level and summary-level approaches
→ There is indeed an inherent gap between the two approaches across these datasets
conceptualize extractive summarization as a semantic text matching problem
"A good summary should be more semantically similar as a whole to the source document than the unqualified summaries."
a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary
Siamese BERT leverages the pre-trained BERT in a Siamese network structure to derive semantically meaningful text embeddings that can be compared using cosine-similarity

Related Work: Two-Stage Summarization

the first stage is usually to extract some fragments of the original text
the second stage is to select or modify on the basis of these fragments
Chen and Bansal (2018) and Bae et al. (2019): a hybrid extract-then-rewrite architecture
Lebanoff et al. (2019); Xu and Durrett (2019); Mendes et al. (2019): extract-then- compress learning paradigm
MATCHSUM model can be viewed as an extract-then-match framework

Sentence-Level or Summary-Level?


1) For extractive summarization, is the summary level extractor better than the sentence-level extractor?
2) Given a dataset, which extractor should we choose based on the characteristics of the data, and what is the inherent gap between these two extractors?
Document: D={s1,...,sn} D=\{ { s }_{ 1 },\quad ...\quad ,{ s }_{ n }\} 
Cadidate Summary: C={s1,...,sksiD} C=\{ { s }_{ 1 },\quad ...\quad ,{ s }_{ k }|{ s }_{ i }\in D\} 
Given a document D with its gold summary C*, we measure a candidate summary C by calculating the ROUGE value between C and C* in two levels:
1) Sentence-Level Score:
gsen(C)=1CsC R(s,C){ g }^{ sen }(C)=\frac { 1 }{ |C| } \sum _{ s\in C }^{  }{ R(s,\quad C*) }
2) Summary-Level Score:
gsum(C)=R(C,C) { g }^{sum }(C)={ R(C,\quad C*) } 
The summary that has a lower sentence-level score but a higher summary-level score
C1 3문장: Summary 44, Sentence 40 (Pearl)
C2 3문장: Summary 42, Sentence 43
Defition 1
A candidate summary C is defined as a pearl-summary if there exists another candidate summary C' that satisfies the inequality:
gsen(C)>gsen(C)whilegsum(C)<gsum(C){ g }^{ sen }(C')>{ g }^{ sen }(C)\quad while\quad { g }^{ sum }(C')<{ g }^{ sum }(C)
The best-summary refers to a summary has highest summary-level score among all the candidate summaries.
Definition 2
A summary C^\hat { C } is defined as the best-summary when it satidfies:
C^=argmaxCS(C)gsum(C)\hat { C } =\underset { C\in S(C) }{ argmax } \quad { g }^{ sum }(C)

Ranking of Best-Summary

For each document, we sort all candidate summaries in descending order based on the sentence level score, and then define z as the rank index of the best-summary C^\hat { C }.
1) if z = 1: the best-summary is composed of sentences with the highest score
2) If z > 1: the best-summary is a pearl-summary
Most of the best-summaries are not made up of the highest-scoring sentences.
In conclusion, the proportion of the pearl-summaries in all the best-summaries is a property to characterize a dataset, which will affect our choices of summarization extractors.

Inherent Gap

Sentence-level extractor 대신 Summary-level extractor를 사용했을 때, 얼마만큼 성능이 향상되는가?
→ Potential Gain에 따라서, 둘 중 어떤 extractor를 사용할 지를 결정할 수 있다.
Inherent Gap
αsen(D)=maxCCDgsen(C){ \alpha }^{ sen }(D)=\max _{ C\in { C }_{ D } }{ { g }^{ sen }(C) }
αsum(D)=maxCCDgsum(C){ \alpha }^{ sum }(D)=\max _{ C\in { C }_{ D } }{ { g }^{ sum }(C) }
Potential Gap at Document Level
Δ(D)=α sum(D)α sen(D)\Delta (D)={ \alpha  }^{ sum }(D)-{ \alpha  }^{ sen }(D)
Potential Gap at Dataset Level
Δ(Dataset)=1Dataset DDataset Δ(D) \Delta (Dataset)=\frac { 1 }{ \left| Dataset \right|  } \sum _{ D\in Dataset }^{  }{ \Delta (D) } 

Summarization as Matching

Siamese Bert

Siamese Network
Different inputs → Computing the similarity between two inputs
Identical networks with shared weights
Learning by optimizing similarity
In this paper, the reserchers construct subnetworks by using original BERT to derive the semantically meaningful embeddings from document D and candidate summary C
Let rD{ r }_{ D } and rC{ r }_{ C } denote the embeddings of the document D and candidate summary C.
Their similarity score is measured by f(D,C)=cosine(rD,rC)f(D,C)=cosine({ r }_{ D },{ r }_{ C })
Loss Function
In order to fine-tune Siamese-BERT, we use a margin-based triplet loss to update the weights.
Semactically, golden summary should be closest to the source doument (Loss term 1)
L1=max(0,f(D,C)f(D,C)+γ 1) { L }_{ 1 }=\max { (0,f(D,C)-f(D,C*)+{ \gamma  }_{ 1 }) } 
Naturally, the candidate pair with a larger ranking gap should have a larger margin
L2=max(0,f(D,Cj)f(D,Ci)+(j1)γ 2)(i<j) { L }_{ 2 }=\max { (0,f(D,{ C }_{ j })-f(D,{ C }_{ i })+{ (j-1)*\gamma  }_{ 2 })\quad (i<j) } 
Finally, our margin-based triplet loss can be written as
L=L1+L2L={ L }_{ 1 }+{ L }_{ 2 }

Candidate Pruning

How could we determine the size of the candidate summary set or should we score all possible candidates? → Candidate Pruning Strategy
⇒ Introduce a content selection module to pre-select salient sentences
The module learns to assign each sentence a salience score and prunes sentences irrelevant with the current document.
D={s1,...,sextsiD} D'=\{ { s }_{ 1 }^{ ' },\quad ...\quad ,{ s }_{ ext }^{ ' }|{ s }_{ i }^{ ' }\in D\} 
In this paper, we use BERTSUM (Liu and Lapata, 2019) without trigram blocking to score each sentence.


Comparison between BERT-EXT and MATCHSUM

We define the ratio that MATCHSUM can learn on dataset D as
where Δ(D)\Delta (D) is the inherent gap between sentence-level and summary-level extractos
Pearl-summary가 Best인 경우, MATCHSUM이 더 좋은 성능을 보임.