设计搜索引擎
问题
如何用 Rust 设计一个全文搜索引擎?
答案
倒排索引架构
简化的倒排索引实现
use std::collections::HashMap;
/// 倒排索引
pub struct InvertedIndex {
/// term → (doc_id, term_frequency)
index: HashMap<String, Vec<(usize, u32)>>,
/// doc_id → 原始文档
documents: Vec<String>,
}
impl InvertedIndex {
pub fn new() -> Self {
Self {
index: HashMap::new(),
documents: Vec::new(),
}
}
/// 索引一个文档
pub fn add_document(&mut self, content: &str) {
let doc_id = self.documents.len();
self.documents.push(content.to_string());
// 分词并统计词频
let mut term_freq: HashMap<String, u32> = HashMap::new();
for word in content.split_whitespace() {
let normalized = word.to_lowercase();
*term_freq.entry(normalized).or_default() += 1;
}
// 更新倒排索引
for (term, freq) in term_freq {
self.index.entry(term)
.or_default()
.push((doc_id, freq));
}
}
/// 搜索:返回匹配的文档 ID(按相关性排序)
pub fn search(&self, query: &str) -> Vec<(usize, f64)> {
let terms: Vec<String> = query.split_whitespace()
.map(|w| w.to_lowercase())
.collect();
let mut scores: HashMap<usize, f64> = HashMap::new();
let total_docs = self.documents.len() as f64;
for term in &terms {
if let Some(postings) = self.index.get(term) {
// IDF = log(N / df)
let idf = (total_docs / postings.len() as f64).ln();
for &(doc_id, tf) in postings {
// TF-IDF 评分
let tf_score = 1.0 + (tf as f64).ln();
*scores.entry(doc_id).or_default() += tf_score * idf;
}
}
}
let mut results: Vec<(usize, f64)> = scores.into_iter().collect();
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
results
}
}
Tantivy:Rust 版 Lucene
Tantivy 是 Rust 实现的全文搜索引擎库,相当于 Java 的 Lucene:
use tantivy::schema::*;
use tantivy::{Index, doc};
// 定义 Schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();
// 创建索引
let index = Index::create_in_ram(schema);
let mut writer = index.writer(50_000_000)?; // 50MB 缓冲
writer.add_document(doc!(
title => "Rust Programming",
body => "Rust is a systems programming language"
))?;
writer.commit()?;
// 搜索
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = tantivy::query::QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("rust programming")?;
let top_docs = searcher.search(&query, &tantivy::collector::TopDocs::with_limit(10))?;
| 维度 | Elasticsearch | Tantivy |
|---|---|---|
| 语言 | Java (Lucene) | Rust |
| 部署 | 独立服务 | 嵌入式库 |
| 性能 | 高 | 更高(无 GC) |
| 分布式 | 内置 | 需要 Quickwit |
常见面试问题
Q1: 倒排索引的空间优化?
答案:
- 前缀压缩:相邻 term 共享前缀(FST: Finite State Transducer)
- 差值编码:Posting List 存差值而非绝对值
[1, 5, 8]→[1, 4, 3] - VByte 编码:变长字节编码,小数字用更少字节
- Roaring Bitmap:高效的位图存储 Doc ID 集合
Tantivy 内置了这些优化。