设计搜索引擎

问题

如何用 Rust 设计一个全文搜索引擎？

答案

倒排索引架构

简化的倒排索引实现

use std::collections::HashMap;

/// 倒排索引
pub struct InvertedIndex {
    /// term → (doc_id, term_frequency)
    index: HashMap<String, Vec<(usize, u32)>>,
    /// doc_id → 原始文档
    documents: Vec<String>,
}

impl InvertedIndex {
    pub fn new() -> Self {
        Self {
            index: HashMap::new(),
            documents: Vec::new(),
        }
    }

    /// 索引一个文档
    pub fn add_document(&mut self, content: &str) {
        let doc_id = self.documents.len();
        self.documents.push(content.to_string());

        // 分词并统计词频
        let mut term_freq: HashMap<String, u32> = HashMap::new();
        for word in content.split_whitespace() {
            let normalized = word.to_lowercase();
            *term_freq.entry(normalized).or_default() += 1;
        }

        // 更新倒排索引
        for (term, freq) in term_freq {
            self.index.entry(term)
                .or_default()
                .push((doc_id, freq));
        }
    }

    /// 搜索：返回匹配的文档 ID（按相关性排序）
    pub fn search(&self, query: &str) -> Vec<(usize, f64)> {
        let terms: Vec<String> = query.split_whitespace()
            .map(|w| w.to_lowercase())
            .collect();

        let mut scores: HashMap<usize, f64> = HashMap::new();
        let total_docs = self.documents.len() as f64;

        for term in &terms {
            if let Some(postings) = self.index.get(term) {
                // IDF = log(N / df)
                let idf = (total_docs / postings.len() as f64).ln();

                for &(doc_id, tf) in postings {
                    // TF-IDF 评分
                    let tf_score = 1.0 + (tf as f64).ln();
                    *scores.entry(doc_id).or_default() += tf_score * idf;
                }
            }
        }

        let mut results: Vec<(usize, f64)> = scores.into_iter().collect();
        results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
        results
    }
}

Tantivy：Rust 版 Lucene

Tantivy 是 Rust 实现的全文搜索引擎库，相当于 Java 的 Lucene：

use tantivy::schema::*;
use tantivy::{Index, doc};

// 定义 Schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();

// 创建索引
let index = Index::create_in_ram(schema);
let mut writer = index.writer(50_000_000)?; // 50MB 缓冲

writer.add_document(doc!(
    title => "Rust Programming",
    body => "Rust is a systems programming language"
))?;
writer.commit()?;

// 搜索
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = tantivy::query::QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("rust programming")?;
let top_docs = searcher.search(&query, &tantivy::collector::TopDocs::with_limit(10))?;

维度	Elasticsearch	Tantivy
语言	Java (Lucene)	Rust
部署	独立服务	嵌入式库
性能	高	更高（无 GC）
分布式	内置	需要 Quickwit

常见面试问题

Q1: 倒排索引的空间优化？

答案：

前缀压缩：相邻 term 共享前缀（FST: Finite State Transducer）
差值编码：Posting List 存差值而非绝对值 [1, 5, 8] → [1, 4, 3]
VByte 编码：变长字节编码，小数字用更少字节
Roaring Bitmap：高效的位图存储 Doc ID 集合

Tantivy 内置了这些优化。

问题​

答案​

倒排索引架构​

简化的倒排索引实现​

Tantivy：Rust 版 Lucene​

常见面试问题​

Q1: 倒排索引的空间优化？​

相关链接​

问题

答案

倒排索引架构

简化的倒排索引实现

Tantivy：Rust 版 Lucene

常见面试问题

Q1: 倒排索引的空间优化？

相关链接