Introduction to AI News Summarization
In an era of information overload, AI-powered news summarization has become essential for staying informed without spending hours reading. From morning briefings to research alerts, intelligent summarization systems help users consume more information in less time.
In this comprehensive tutorial, you’ll build a complete news summarization system that aggregates content from multiple sources, generates concise summaries using state-of-the-art NLP models, and presents them through a clean interface. This is the technology powering apps like Artifact, Feedly, and Google News.
What You’ll Build
By the end of this tutorial, you’ll have a news summarizer that:
- Fetches news from multiple RSS feeds and APIs
- Extracts article content intelligently
- Generates abstractive summaries using transformers
- Groups related articles into topics
- Provides customizable summary lengths
- Includes a web dashboard for browsing summaries
Understanding Text Summarization
Types of Summarization
Extractive Summarization: Selects important sentences from the original text. Like highlighting key passages.
Abstractive Summarization: Generates new sentences that capture the essence. Like writing a brief in your own words.
Hybrid Approaches: Combines both methods for best results.
Challenges in News Summarization
- Maintaining factual accuracy without hallucination
- Preserving key entities (names, dates, numbers)
- Handling multiple perspectives on the same story
- Generating coherent multi-document summaries
- Adapting to different news domains (tech, politics, sports)
Prerequisites and Setup
Required Libraries
# Create virtual environment
python -m venv news_summarizer_env
source news_summarizer_env/bin/activate
# Core dependencies
pip install transformers torch
pip install newspaper3k lxml_html_clean
pip install feedparser requests
pip install beautifulsoup4 trafilatura
# Additional tools
pip install schedule python-dotenv
pip install flask flask-cors
# For better summaries
pip install sentencepieceProject Structure
news_summarizer/
├── app.py # Flask web application
├── fetcher/
│ ├── __init__.py
│ ├── rss_fetcher.py # RSS feed parsing
│ ├── article_extractor.py # Full article extraction
│ └── news_api.py # News API integration
├── summarizer/
│ ├── __init__.py
│ ├── extractive.py # Extractive summarization
│ ├── abstractive.py # Transformer-based summarization
│ └── multi_doc.py # Multi-document summarization
├── models/
│ └── config.py # Model configurations
├── templates/
│ └── index.html # Web dashboard
├── data/
│ └── feeds.json # RSS feed list
└── requirements.txtStep 1: News Fetcher
First, let’s create a robust news fetching system:
# fetcher/rss_fetcher.py
import feedparser
from typing import List, Dict, Optional
from dataclasses import dataclass, field
from datetime import datetime
import hashlib
@dataclass
class NewsArticle:
title: str
url: str
source: str
published: Optional[datetime] = None
summary: str = ""
content: str = ""
author: str = ""
tags: List[str] = field(default_factory=list)
image_url: str = ""
article_id: str = ""
def __post_init__(self):
if not self.article_id:
self.article_id = hashlib.md5(self.url.encode()).hexdigest()[:12]
class RSSFetcher:
"""Fetch news articles from RSS feeds."""
def __init__(self):
self.default_feeds = {
"tech": [
"https://feeds.arstechnica.com/arstechnica/technology-lab",
"https://www.theverge.com/rss/index.xml",
"https://techcrunch.com/feed/",
],
"general": [
"https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
"https://feeds.bbci.co.uk/news/rss.xml",
"https://www.theguardian.com/world/rss",
],
"science": [
"https://www.sciencedaily.com/rss/all.xml",
"https://www.nature.com/nature.rss",
]
}
def fetch_feed(self, feed_url: str, source_name: str = "") -> List[NewsArticle]:
"""Fetch articles from a single RSS feed."""
articles = []
try:
feed = feedparser.parse(feed_url)
# Get source name from feed if not provided
if not source_name:
source_name = feed.feed.get("title", "Unknown")
for entry in feed.entries[:20]: # Limit to 20 per feed
article = NewsArticle(
title=entry.get("title", ""),
url=entry.get("link", ""),
source=source_name,
summary=self._clean_html(entry.get("summary", "")),
author=entry.get("author", ""),
tags=[tag.term for tag in entry.get("tags", [])],
)
# Parse published date
if "published_parsed" in entry and entry.published_parsed:
article.published = datetime(*entry.published_parsed[:6])
# Try to get image
if "media_content" in entry:
for media in entry.media_content:
if "url" in media:
article.image_url = media["url"]
break
articles.append(article)
except Exception as e:
print(f"Error fetching {feed_url}: {e}")
return articles
def _clean_html(self, text: str) -> str:
"""Remove HTML tags from text."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, "html.parser")
return soup.get_text(separator=" ").strip()
def fetch_category(self, category: str) -> List[NewsArticle]:
"""Fetch all articles from a category."""
all_articles = []
feeds = self.default_feeds.get(category, [])
for feed_url in feeds:
articles = self.fetch_feed(feed_url)
all_articles.extend(articles)
# Sort by date, newest first
all_articles.sort(
key=lambda x: x.published or datetime.min,
reverse=True
)
return all_articles
def fetch_all(self) -> Dict[str, List[NewsArticle]]:
"""Fetch articles from all categories."""
results = {}
for category in self.default_feeds:
results[category] = self.fetch_category(category)
return resultsStep 2: Article Content Extractor
Extract full article content from URLs:
# fetcher/article_extractor.py
from typing import Optional
import requests
from newspaper import Article
import trafilatura
class ArticleExtractor:
"""Extract full article content from URLs."""
def __init__(self):
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
self.timeout = 10
def extract_with_newspaper(self, url: str) -> Optional[str]:
"""Extract article using newspaper3k."""
try:
article = Article(url)
article.download()
article.parse()
return article.text
except Exception as e:
print(f"Newspaper extraction failed: {e}")
return None
def extract_with_trafilatura(self, url: str) -> Optional[str]:
"""Extract article using trafilatura."""
try:
downloaded = trafilatura.fetch_url(url)
if downloaded:
text = trafilatura.extract(downloaded)
return text
except Exception as e:
print(f"Trafilatura extraction failed: {e}")
return None
def extract(self, url: str) -> str:
"""Extract article content using multiple methods."""
# Try trafilatura first (usually better)
content = self.extract_with_trafilatura(url)
if content and len(content) > 200:
return content
# Fallback to newspaper3k
content = self.extract_with_newspaper(url)
if content and len(content) > 200:
return content
return ""
def extract_batch(self, urls: list) -> dict:
"""Extract content from multiple URLs."""
results = {}
for url in urls:
results[url] = self.extract(url)
return resultsStep 3: Extractive Summarizer
Create a fast extractive summarizer for initial processing:
# summarizer/extractive.py
import re
from typing import List, Tuple
from collections import Counter
import math
class ExtractiveSummarizer:
"""Extractive summarization using TF-IDF scoring."""
def __init__(self):
self.stop_words = set([
"the", "a", "an", "and", "or", "but", "in", "on", "at", "to",
"for", "of", "with", "by", "from", "as", "is", "was", "are",
"were", "been", "be", "have", "has", "had", "do", "does", "did",
"will", "would", "could", "should", "may", "might", "must",
"that", "which", "who", "whom", "this", "these", "those",
"it", "its", "they", "them", "their", "we", "us", "our",
"you", "your", "he", "him", "his", "she", "her", "i", "me", "my"
])
def _tokenize(self, text: str) -> List[str]:
"""Simple tokenization."""
text = text.lower()
words = re.findall(r'\b[a-z]+\b', text)
return [w for w in words if w not in self.stop_words and len(w) > 2]
def _split_sentences(self, text: str) -> List[str]:
"""Split text into sentences."""
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if len(s.strip()) > 20]
def _compute_tf(self, words: List[str]) -> dict:
"""Compute term frequency."""
tf = Counter(words)
total = len(words)
return {word: count / total for word, count in tf.items()}
def _compute_idf(self, sentences: List[str]) -> dict:
"""Compute inverse document frequency."""
n_docs = len(sentences)
word_doc_count = Counter()
for sentence in sentences:
words = set(self._tokenize(sentence))
word_doc_count.update(words)
idf = {}
for word, count in word_doc_count.items():
idf[word] = math.log(n_docs / (1 + count))
return idf
def _score_sentence(
self,
sentence: str,
tf: dict,
idf: dict,
position: int,
total_sentences: int
) -> float:
"""Score a sentence based on multiple factors."""
words = self._tokenize(sentence)
if not words:
return 0.0
# TF-IDF score
tfidf_score = sum(tf.get(w, 0) * idf.get(w, 0) for w in words) / len(words)
# Position score (first sentences are often more important)
position_score = 1.0 - (position / total_sentences) * 0.3
# Length score (prefer medium-length sentences)
length = len(words)
if 10 <= length <= 30:
length_score = 1.0
elif length < 10:
length_score = length / 10
else:
length_score = 30 / length
return tfidf_score * position_score * length_score
def summarize(
self,
text: str,
num_sentences: int = 3,
max_length: int = 500
) -> str:
"""Generate extractive summary."""
sentences = self._split_sentences(text)
if len(sentences) <= num_sentences:
return text
# Compute TF-IDF
all_words = self._tokenize(text)
tf = self._compute_tf(all_words)
idf = self._compute_idf(sentences)
# Score sentences
scored_sentences = []
for i, sentence in enumerate(sentences):
score = self._score_sentence(sentence, tf, idf, i, len(sentences))
scored_sentences.append((i, sentence, score))
# Select top sentences
scored_sentences.sort(key=lambda x: x[2], reverse=True)
selected = scored_sentences[:num_sentences]
# Sort by original position for coherence
selected.sort(key=lambda x: x[0])
summary = " ".join(s[1] for s in selected)
# Truncate if needed
if len(summary) > max_length:
summary = summary[:max_length].rsplit(" ", 1)[0] + "..."
return summaryStep 4: Abstractive Summarizer
Use transformer models for high-quality summaries:
# summarizer/abstractive.py
from typing import List, Optional
import torch
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
pipeline
)
class AbstractiveSummarizer:
"""Abstractive summarization using transformer models."""
def __init__(
self,
model_name: str = "facebook/bart-large-cnn",
device: str = None
):
self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
self.model_name = model_name
print(f"Loading {model_name} on {self.device}...")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(self.device)
# Create pipeline for easier use
self.summarizer = pipeline(
"summarization",
model=self.model,
tokenizer=self.tokenizer,
device=0 if self.device == "cuda" else -1
)
print("Model loaded successfully!")
def summarize(
self,
text: str,
max_length: int = 150,
min_length: int = 50,
do_sample: bool = False
) -> str:
"""Generate abstractive summary."""
# Truncate input if too long
max_input_length = 1024
inputs = self.tokenizer(
text,
max_length=max_input_length,
truncation=True,
return_tensors="pt"
)
if len(inputs["input_ids"][0]) < 50:
return text # Text too short to summarize
# Generate summary
summary = self.summarizer(
text,
max_length=max_length,
min_length=min_length,
do_sample=do_sample,
truncation=True
)
return summary[0]["summary_text"]
def summarize_batch(
self,
texts: List[str],
max_length: int = 150,
min_length: int = 50
) -> List[str]:
"""Summarize multiple texts efficiently."""
summaries = self.summarizer(
texts,
max_length=max_length,
min_length=min_length,
truncation=True,
batch_size=4
)
return [s["summary_text"] for s in summaries]
def summarize_news(
self,
text: str,
style: str = "brief"
) -> str:
"""Generate news-optimized summary."""
length_configs = {
"brief": {"max_length": 75, "min_length": 30},
"standard": {"max_length": 150, "min_length": 50},
"detailed": {"max_length": 300, "min_length": 100}
}
config = length_configs.get(style, length_configs["standard"])
return self.summarize(text, **config)Step 5: Multi-Document Summarizer
Summarize multiple related articles:
# summarizer/multi_doc.py
from typing import List, Dict
from dataclasses import dataclass
from .extractive import ExtractiveSummarizer
from .abstractive import AbstractiveSummarizer
@dataclass
class ArticleCluster:
topic: str
articles: List[Dict]
combined_summary: str
key_points: List[str]
class MultiDocSummarizer:
"""Summarize multiple related articles."""
def __init__(self, abstractive_summarizer: AbstractiveSummarizer = None):
self.extractive = ExtractiveSummarizer()
self.abstractive = abstractive_summarizer or AbstractiveSummarizer()
def _extract_key_points(self, texts: List[str], n_points: int = 5) -> List[str]:
"""Extract key points from multiple texts."""
# Combine all texts
combined = " ".join(texts)
# Get extractive summary
sentences = self.extractive._split_sentences(combined)
# Score all sentences
all_words = self.extractive._tokenize(combined)
tf = self.extractive._compute_tf(all_words)
idf = self.extractive._compute_idf(sentences)
scored = []
for i, sent in enumerate(sentences):
score = self.extractive._score_sentence(sent, tf, idf, i, len(sentences))
scored.append((sent, score))
# Get top unique points
scored.sort(key=lambda x: x[1], reverse=True)
key_points = []
seen = set()
for sent, score in scored:
# Avoid similar sentences
words = set(sent.lower().split()[:5])
if not words & seen:
key_points.append(sent)
seen.update(words)
if len(key_points) >= n_points:
break
return key_points
def summarize_cluster(
self,
articles: List[Dict],
topic: str = "News"
) -> ArticleCluster:
"""Summarize a cluster of related articles."""
# Extract content from articles
texts = [a.get("content", a.get("summary", "")) for a in articles]
texts = [t for t in texts if t]
if not texts:
return ArticleCluster(
topic=topic,
articles=articles,
combined_summary="No content available",
key_points=[]
)
# Create combined text with source attribution
combined_parts = []
for i, (article, text) in enumerate(zip(articles, texts)):
source = article.get("source", f"Source {i+1}")
# Take first 500 chars from each
combined_parts.append(f"[{source}]: {text[:500]}")
combined_text = " ".join(combined_parts)
# Generate abstractive summary
try:
summary = self.abstractive.summarize(
combined_text,
max_length=200,
min_length=75
)
except Exception as e:
print(f"Abstractive summarization failed: {e}")
summary = self.extractive.summarize(combined_text, num_sentences=3)
# Extract key points
key_points = self._extract_key_points(texts)
return ArticleCluster(
topic=topic,
articles=articles,
combined_summary=summary,
key_points=key_points
)
def create_daily_digest(
self,
articles_by_category: Dict[str, List[Dict]]
) -> Dict[str, ArticleCluster]:
"""Create a daily digest from categorized articles."""
digest = {}
for category, articles in articles_by_category.items():
if articles:
# Take top 5 articles per category
top_articles = articles[:5]
cluster = self.summarize_cluster(top_articles, topic=category)
digest[category] = cluster
return digestStep 6: Flask Application
Create the web interface:
# app.py
import os
from flask import Flask, render_template, jsonify, request
from flask_cors import CORS
from datetime import datetime
from fetcher.rss_fetcher import RSSFetcher
from fetcher.article_extractor import ArticleExtractor
from summarizer.abstractive import AbstractiveSummarizer
from summarizer.extractive import ExtractiveSummarizer
app = Flask(__name__)
CORS(app)
# Initialize components
rss_fetcher = RSSFetcher()
article_extractor = ArticleExtractor()
extractive_summarizer = ExtractiveSummarizer()
# Lazy load abstractive summarizer (heavy model)
abstractive_summarizer = None
def get_abstractive_summarizer():
global abstractive_summarizer
if abstractive_summarizer is None:
abstractive_summarizer = AbstractiveSummarizer()
return abstractive_summarizer
@app.route("/")
def index():
return render_template("index.html")
@app.route("/api/news")
def get_news():
"""Fetch latest news from all categories."""
category = request.args.get("category", "all")
if category == "all":
all_news = rss_fetcher.fetch_all()
# Flatten and sort
articles = []
for cat, arts in all_news.items():
for art in arts[:10]: # 10 per category
articles.append({
"id": art.article_id,
"title": art.title,
"url": art.url,
"source": art.source,
"category": cat,
"summary": art.summary[:200] if art.summary else "",
"published": art.published.isoformat() if art.published else None,
"image": art.image_url
})
else:
articles_list = rss_fetcher.fetch_category(category)
articles = [
{
"id": art.article_id,
"title": art.title,
"url": art.url,
"source": art.source,
"category": category,
"summary": art.summary[:200] if art.summary else "",
"published": art.published.isoformat() if art.published else None,
"image": art.image_url
}
for art in articles_list[:20]
]
return jsonify({"articles": articles})
@app.route("/api/summarize", methods=["POST"])
def summarize_article():
"""Summarize a specific article."""
data = request.get_json()
url = data.get("url")
style = data.get("style", "standard")
method = data.get("method", "abstractive")
if not url:
return jsonify({"error": "URL required"}), 400
# Extract article content
content = article_extractor.extract(url)
if not content:
return jsonify({"error": "Could not extract article content"}), 400
# Generate summary
if method == "extractive":
num_sentences = {"brief": 2, "standard": 3, "detailed": 5}.get(style, 3)
summary = extractive_summarizer.summarize(content, num_sentences=num_sentences)
else:
summarizer = get_abstractive_summarizer()
summary = summarizer.summarize_news(content, style=style)
return jsonify({
"summary": summary,
"content_length": len(content),
"summary_length": len(summary),
"compression_ratio": f"{len(summary)/len(content)*100:.1f}%"
})
@app.route("/api/digest")
def get_digest():
"""Get daily news digest."""
all_news = rss_fetcher.fetch_all()
digest = {}
for category, articles in all_news.items():
if articles:
# Get summaries for top 3 articles
summaries = []
for art in articles[:3]:
summaries.append({
"title": art.title,
"source": art.source,
"summary": art.summary[:150] + "..." if art.summary else ""
})
digest[category] = summaries
return jsonify({"digest": digest, "generated_at": datetime.now().isoformat()})
if __name__ == "__main__":
app.run(debug=True, port=5000)Step 7: Web Dashboard
<!-- templates/index.html -->
<!DOCTYPE html>
<html>
<head>
<title>AI News Summarizer</title>
<style>
* { box-sizing: border-box; margin: 0; padding: 0; }
body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; background: #f0f2f5; }
.container { max-width: 1200px; margin: 0 auto; padding: 20px; }
header { background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 30px 20px; margin-bottom: 20px; border-radius: 12px; }
h1 { font-size: 28px; margin-bottom: 10px; }
.subtitle { opacity: 0.9; }
.categories { display: flex; gap: 10px; margin-bottom: 20px; flex-wrap: wrap; }
.category-btn { padding: 8px 16px; border: none; background: white; border-radius: 20px; cursor: pointer; font-size: 14px; transition: all 0.2s; }
.category-btn:hover, .category-btn.active { background: #667eea; color: white; }
.news-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(350px, 1fr)); gap: 20px; }
.news-card { background: white; border-radius: 12px; overflow: hidden; box-shadow: 0 2px 10px rgba(0,0,0,0.1); }
.news-card img { width: 100%; height: 180px; object-fit: cover; }
.news-card .content { padding: 20px; }
.news-card h3 { font-size: 16px; margin-bottom: 10px; line-height: 1.4; }
.news-card .meta { font-size: 12px; color: #666; margin-bottom: 10px; }
.news-card .summary { font-size: 14px; color: #444; line-height: 1.6; }
.news-card .actions { margin-top: 15px; display: flex; gap: 10px; }
button { padding: 8px 16px; border: none; border-radius: 6px; cursor: pointer; font-size: 13px; }
.btn-primary { background: #667eea; color: white; }
.btn-secondary { background: #e0e0e0; color: #333; }
.modal { display: none; position: fixed; top: 0; left: 0; width: 100%; height: 100%; background: rgba(0,0,0,0.5); align-items: center; justify-content: center; }
.modal.active { display: flex; }
.modal-content { background: white; padding: 30px; border-radius: 12px; max-width: 600px; width: 90%; max-height: 80vh; overflow-y: auto; }
.modal-content h2 { margin-bottom: 20px; }
.modal-content .summary-text { line-height: 1.8; font-size: 16px; }
.close-btn { float: right; font-size: 24px; cursor: pointer; }
.loading { opacity: 0.5; pointer-events: none; }
</style>
</head>
<body>
<div class="container">
<header>
<h1>AI News Summarizer</h1>
<p class="subtitle">Stay informed with AI-powered summaries</p>
</header>
<div class="categories">
<button class="category-btn active" data-category="all">All News</button>
<button class="category-btn" data-category="tech">Technology</button>
<button class="category-btn" data-category="general">General</button>
<button class="category-btn" data-category="science">Science</button>
</div>
<div class="news-grid" id="news-grid">
<p>Loading news...</p>
</div>
</div>
<div class="modal" id="summary-modal">
<div class="modal-content">
<span class="close-btn" onclick="closeModal()">×</span>
<h2 id="modal-title"></h2>
<div class="summary-text" id="modal-summary"></div>
</div>
</div>
<script>
let currentCategory = "all";
async function loadNews(category) {
currentCategory = category;
const grid = document.getElementById("news-grid");
grid.innerHTML = "<p>Loading...</p>";
document.querySelectorAll(".category-btn").forEach(btn => {
btn.classList.toggle("active", btn.dataset.category === category);
});
try {
const response = await fetch(`/api/news?category=${category}`);
const data = await response.json();
displayNews(data.articles);
} catch (error) {
grid.innerHTML = "<p>Error loading news</p>";
}
}
function displayNews(articles) {
const grid = document.getElementById("news-grid");
grid.innerHTML = articles.map(article => `
<div class="news-card">
${article.image ? `<img src="${article.image}" alt="">` : ""}
<div class="content">
<h3>${article.title}</h3>
<div class="meta">${article.source} | ${article.category}</div>
<p class="summary">${article.summary}</p>
<div class="actions">
<button class="btn-primary" onclick="summarize('${article.url}', '${article.title.replace(/'/g, "")}')">
AI Summary
</button>
<button class="btn-secondary" onclick="window.open('${article.url}', '_blank')">
Read Full
</button>
</div>
</div>
</div>
`).join("");
}
async function summarize(url, title) {
const modal = document.getElementById("summary-modal");
const modalTitle = document.getElementById("modal-title");
const modalSummary = document.getElementById("modal-summary");
modalTitle.textContent = title;
modalSummary.innerHTML = "Generating AI summary...";
modal.classList.add("active");
try {
const response = await fetch("/api/summarize", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url, style: "standard", method: "abstractive" })
});
const data = await response.json();
if (data.error) {
modalSummary.innerHTML = `Error: ${data.error}`;
} else {
modalSummary.innerHTML = `
<p>${data.summary}</p>
<hr style="margin: 20px 0">
<small>Compression: ${data.compression_ratio} | Original: ${data.content_length} chars</small>
`;
}
} catch (error) {
modalSummary.innerHTML = "Failed to generate summary";
}
}
function closeModal() {
document.getElementById("summary-modal").classList.remove("active");
}
document.querySelectorAll(".category-btn").forEach(btn => {
btn.addEventListener("click", () => loadNews(btn.dataset.category));
});
// Close modal on outside click
document.getElementById("summary-modal").addEventListener("click", (e) => {
if (e.target.id === "summary-modal") closeModal();
});
// Load initial news
loadNews("all");
</script>
</body>
</html>Running the Application
# Start the application
python app.py
# Access at http://localhost:5000Conclusion
You’ve built a complete AI news summarization system that fetches, processes, and summarizes news from multiple sources. This system demonstrates production patterns including multi-source aggregation, hybrid summarization, and an interactive web interface.
Key takeaways:
- Hybrid summarization combines extractive speed with abstractive quality
- RSS feeds provide reliable, structured news access
- Trafilatura excels at content extraction from web pages
- BART and similar models generate fluent abstractive summaries
- Caching and lazy loading optimize resource usage
This foundation can be extended with personalization based on reading history, topic clustering for related stories, email digest delivery, and integration with mobile push notifications.
