PaperFlux
PaperFlux helps you read scientific papers faster by automatically extracting exact quotations, organizing them by category (e.g., contributions, limitations, claims, evidence), and annotating your PDFs with precise highlights. It also produces a concise, structured summary you can share or extend. PaperFlux works with either OpenAI or Anthropic (Claude) models; you select the backend in config.yaml and provide the matching API key.
The Idea
- Give the model the paper and let it find the most relevant passages: OpenAI uses server-side file search over a temporary vector store, while Anthropic (Claude) reads the PDF directly in context.
- Ask the model to return structured results (JSON) with exact quotes and page numbers per category.
- Turn that into actionable artifacts: an annotated PDF with color-coded highlights and a clean Markdown summary.
- Keep everything reproducible: save the extracted quotes alongside your outputs so you can re-annotate without re-running extraction.
Features
- Pluggable LLM backend: OpenAI or Anthropic (Claude), selected via
providerin config - Batch CLI: [options] *.pdf
- YAML config with LLMs, prompts, colors, defaults
- Three detail levels (low / medium / high)
- RAG retrieval and summarization pipeline
- User-editable prompt templates (Jinja2)
- RGB highlight colors editable in config
- Optional output directory for generated artifacts
- Stage-level CLI progress with elapsed time during extraction and annotation
- Markdown summary injected as sticky note on page 1
- Color-coded highlights: contributions (Y), limitations (O), claims (B), evidence (G)
- Quote-match report with matched/skipped counts, pages, scores, and verbose layout-gap diagnostics
Getting Started
1. Installation
Install the latest release from PyPI:
python -m pip install paperflux
For local development from a cloned repository:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
2. Choose a Provider and Configure the API Key
PaperFlux supports two backends, selected by the provider key in config.yaml:
# "openai" (default) or "anthropic"
provider: "openai"
openai:
api_key: "ENV:PAPERFLUX_OPENAI_API_KEY"
model: "gpt-5.4-mini"
# Used when provider is "anthropic"
anthropic:
api_key: "ENV:PAPERFLUX_ANTHROPIC_API_KEY"
model: "claude-opus-4-8"
Only the selected provider’s block is required; PaperFlux validates that the chosen provider has a matching configuration block. The ENV: prefix means PaperFlux expands the named environment variable at runtime. A PaperFlux-specific key makes it easier to monitor package-related usage and cost separately.
Choose one of the following setups (substitute PAPERFLUX_ANTHROPIC_API_KEY and an sk-ant- key when using Anthropic):
-
Temporary (current shell only):
export PAPERFLUX_OPENAI_API_KEY="sk-your-key" -
Persistent for zsh (macOS default):
echo 'export PAPERFLUX_OPENAI_API_KEY="sk-your-key"' >> ~/.zshrc source ~/.zshrc -
Using a
.envfile (auto-loaded):Create a file named
.envin the repository root:echo 'PAPERFLUX_OPENAI_API_KEY=sk-your-key' > .envPaperFlux loads
.envautomatically because it callsload_dotenv()during config load. -
Inline in
config.yaml(not recommended for commits):openai: api_key: "sk-your-key"
3. Usage
Run the CLI:
paperflux init
paperflux --config config.yaml path/to/your.pdf
# or reuse prior quotes without extraction
paperflux --config config.yaml --quotes-file your_paper_quotes.json path/to/your.pdf
Options
--config,-c: Path to configuration file (required)--detail,-d: Detail level (low/medium/high, overrides config)--verbose: Enable internal logs, per-quote match details, and layout-gap diagnostics--output-dir,-o: Directory where annotated PDFs and summaries will be saved--progress/--no-progress: Show or hide stage-level progress updates with elapsed time (default: shown)--quotes-file: Path to JSON quotes file to annotate without rerunning extraction--version,-V: Show the PaperFlux version and exit
Use paperflux init [directory] to create a starter config.yaml and editable prompt templates in prompts/. Existing files are not overwritten unless you pass --force.
Customizing Prompts
PaperFlux uses three editable Jinja2 templates in the prompts/ directory to control how the AI extracts and summarizes information:
rag_category_system_prompt.txt and rag_category_system_prompt_anthropic.txt
The system prompt that instructs the AI assistant on its role and behavior. PaperFlux uses a provider-specific variant: rag_category_system_prompt.txt for OpenAI (which retrieves passages via file_search) and rag_category_system_prompt_anthropic.txt for Anthropic (which reads the attached PDF directly). Both define the same extraction rules:
- Find the relevant passages (file_search for OpenAI; the attached PDF for Anthropic)
- Extract near-verbatim quotations with accurate page numbers
- Avoid including section/table/figure references in quotes
- Return structured JSON without code fences
This is the “personality” of the extraction assistant. The file used for each provider is configurable via rag.category_system_prompt_file and rag.category_system_prompt_file_anthropic.
rag_category_prompt.j2
The user prompt template for category extraction. It:
- Lists all extraction categories (contributions, limitations, claims, evidence) with their descriptions
- Specifies the JSON output format with categories, quotes, pages, and category summaries
- Gets rendered with variables from
config.yaml(categories list)
Edit config.yaml (under extraction_categories) to change what categories are extracted. Edit this template to modify the output structure or instructions.
rag_summary_prompt.j2
The prompt template for generating the final Markdown summary. It:
- Receives all category summaries as input
- Uses the detail level (low/medium/high) to control summary length
- Produces a cohesive narrative summary integrating all categories
Edit this to change how the final summary is structured or what information it emphasizes.
All three templates can be customized without modifying any Python code—just edit the files in prompts/ and rerun PaperFlux.
Model and Retrieval Settings
OpenAI
The default OpenAI model is gpt-5.4-mini, which supports the Responses API, structured outputs, and file search. You can change it in config.yaml:
openai:
model: "gpt-5.4-mini"
File-search retrieval can also be tuned without changing Python code:
ui:
reasoning_effort: "medium"
max_output_tokens: 32768
matching:
min_similarity: 0.88
max_window_tokens: 80
rag:
max_num_results: # leave empty to let OpenAI choose
max_quotes_per_category: 6
include_search_results: false
vector_store_expires_after_days: 1
Use max_num_results to cap retrieved passages when latency or cost matters. Use max_quotes_per_category to keep the structured JSON response bounded. The local highlighter uses matching.min_similarity and matching.max_window_tokens to align returned quotes to real PDF word spans; raising the similarity threshold improves precision, while lowering it improves recall. Enabling include_search_results is useful for debugging retrieval, but it increases the response payload.
Anthropic (Claude)
When provider: "anthropic", set the model under the anthropic block:
provider: "anthropic"
anthropic:
model: "claude-opus-4-8"
Claude reads the PDF directly (sent as a document in the request), so there is no vector store and the rag.max_num_results, rag.include_search_results, and rag.vector_store_expires_after_days settings do not apply. The shared ui and rag knobs still apply:
ui.reasoning_effortmaps to Claude’s thinking:"none"disables thinking, andlow/medium/high/xhighenable adaptive thinking at the corresponding effort.ui.max_output_tokenscaps the response (requests stream, so large values are supported).rag.max_quotes_per_categorybounds the structured JSON response, and the samematching.*settings align quotes to PDF word spans.
Page numbers are returned as a field in the structured JSON (citations and structured output cannot be combined in the Messages API), keeping the output identical to the OpenAI path.
Quote matching
The highlighter first tries exact and fuzzy contiguous matching. If those fail, it can fall back to layout-gap matching for quotes split by page layout interruptions such as tables, figures, captions, or column breaks. Layout-gap matches highlight only the words that belong to the quote and skip intervening layout artifacts.
Brief History
- v1.0.20250525: Initial version using local text extraction, LLM-based processing of extracted text, and fuzzy quote matching for PDF annotation.
- v2.0.20250625: Migrated to OpenAI Responses API, adopted file_search with vector stores.
- v2.1.20251115: Added GPT-5 support, and bundled multi-category extraction into a single structured call.
- v2.2.20251115: Added n-gram quote matching for improved PDF annotation.
- v3.0.20251116: Added token-based quote matching for improved PDF annotation. Removed fuzzy- and n-gram-based matching.
- v3.1.20251116: Introduced a
--quotes-fileflow to re-annotate quickly without re-running extraction. - v3.2.20260527: Updated default model to GPT-5.4 mini and added file-search tuning options.
- v3.3.20260527: Added local quote span alignment for more accurate PDF highlights.
- v3.4.20260528: Added layout-gap quote matching for quotes split by tables, figures, captions, or column breaks; added quote-match reports, stage-level CLI progress, concise default CLI output with verbose diagnostics, package metadata, CLI validation improvements, and the
PAPERFLUX_OPENAI_API_KEYdefault for easier cost tracking. - v4.0.20260530: Added Anthropic (Claude) as a selectable LLM backend via a
providerconfig key, alongside a pluggable provider registry. The Anthropic path reads the PDF directly (no vector store), returns the same structured JSON with page numbers, and mapsreasoning_effortto adaptive thinking. - v4.1.20260613: Added a
--version/-VCLI flag that reports the installed package version.
Contributing
Contributions are very welcome! Whether you’re fixing a bug, improving extraction accuracy, or adding a new workflow, here’s how to get started:
- Fork the repository and create a feature branch.
- Set up the project and run tests locally.
- Open a PR describing the change and its impact.
Issues, ideas, or questions? Please open an issue or start a discussion—feedback helps shape the roadmap.
For more information about configuration options and advanced features, see config.yaml in the repository.