PaperFlux
PaperFlux helps you read scientific papers faster by automatically extracting exact quotations, organizing them by category (e.g., contributions, limitations, claims, evidence), and annotating your PDFs with precise highlights. It also produces a concise, structured summary you can share or extend. The current version requires an OpenAI API key.
The Idea
- Use Retrieval-Augmented Generation (RAG) to find the most relevant passages inside your paper via server-side file search.
- Ask the model to return structured results (JSON) with exact quotes and page numbers per category.
- Turn that into actionable artifacts: an annotated PDF with color-coded highlights and a clean Markdown summary.
- Keep everything reproducible: save the extracted quotes alongside your outputs so you can re-annotate without re-running extraction.
Features
- Batch CLI: [options] *.pdf
- YAML config with LLMs, prompts, colors, defaults
- Three detail levels (low / medium / high)
- RAG retrieval and summarization pipeline
- User-editable prompt templates (Jinja2)
- RGB highlight colors editable in config
- Optional output directory for generated artifacts
- Markdown summary injected as sticky note on page 1
- Color-coded highlights: contributions (Y), limitations (O), claims (B), evidence (G)
Getting Started
1. Installation
Quick start (local):
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
2. Configure the API Key
PaperFlux requires an OpenAI API key. The key is read from an environment variable referenced in config.yaml:
-
In
config.yaml, the default is:openai: api_key: "ENV:OPENAI_API_KEY"The
ENV:prefix means PaperFlux will expand the environment variableOPENAI_API_KEYat runtime.
Choose one of the following setups:
-
Temporary (current shell only):
export OPENAI_API_KEY="sk-your-key" -
Persistent for zsh (macOS default):
echo 'export OPENAI_API_KEY="sk-your-key"' >> ~/.zshrc source ~/.zshrc -
Using a
.envfile (auto-loaded):Create a file named
.envin the repository root:echo 'OPENAI_API_KEY=sk-your-key' > .envPaperFlux loads
.envautomatically because it callsload_dotenv()during config load. -
Inline in
config.yaml(not recommended for commits):openai: api_key: "sk-your-key"
3. Usage
Run the CLI:
python -m src.cli --config config.yaml path/to/your.pdf
# or reuse prior quotes without extraction
python -m src.cli --config config.yaml --quotes-file your_paper_quotes.json path/to/your.pdf
Options
--config,-c: Path to configuration file (required)--detail,-d: Detail level (low/medium/high, overrides config)--verbose: Enable verbose output--output-dir,-o: Directory where annotated PDFs and summaries will be saved--progress/--no-progress: Show or hide per-file progress updates (default: shown)--quotes-file: Path to JSON quotes file to annotate without rerunning extraction
Customizing Prompts
PaperFlux uses three editable Jinja2 templates in the prompts/ directory to control how the AI extracts and summarizes information:
rag_category_system_prompt.txt
The system prompt that instructs the AI assistant on its role and behavior. It defines the extraction rules:
- Use RAG file_search to find relevant passages
- Extract near-verbatim quotations with accurate page numbers
- Avoid including section/table/figure references in quotes
- Return structured JSON without code fences
This is the “personality” of the extraction assistant.
rag_category_prompt.j2
The user prompt template for category extraction. It:
- Lists all extraction categories (contributions, limitations, claims, evidence) with their descriptions
- Specifies the JSON output format with categories, quotes, pages, and category summaries
- Gets rendered with variables from
config.yaml(categories list)
Edit config.yaml (under extraction_categories) to change what categories are extracted. Edit this template to modify the output structure or instructions.
rag_summary_prompt.j2
The prompt template for generating the final Markdown summary. It:
- Receives all category summaries as input
- Uses the detail level (low/medium/high) to control summary length
- Produces a cohesive narrative summary integrating all categories
Edit this to change how the final summary is structured or what information it emphasizes.
All three templates can be customized without modifying any Python code—just edit the files in prompts/ and rerun PaperFlux.
Brief History
- v1.0.20250525: Initial version using local text extraction, LLM-based processing of extracted text, and fuzzy quote matching for PDF annotation.
- v2.0.20250625: Migrated to OpenAI Responses API, adopted file_search with vector stores.
- v2.1.20251115: Added GPT-5 support, and bundled multi-category extraction into a single structured call.
- v2.2.20251115: Added n-gram quote matching for improved PDF annotation.
- v3.0.20251116: Added token-based quote matching for improved PDF annotation. Removed fuzzy- and n-gram-based matching.
- v3.1.20251116: Introduced a
--quotes-fileflow to re-annotate quickly without re-running extraction.
Contributing
Contributions are very welcome! Whether you’re fixing a bug, improving extraction accuracy, or adding a new workflow, here’s how to get started:
- Fork the repository and create a feature branch.
- Set up the project and run tests locally.
- Open a PR describing the change and its impact.
Issues, ideas, or questions? Please open an issue or start a discussion—feedback helps shape the roadmap.
For more information about configuration options and advanced features, see config.yaml in the repository.