Show HN: ArXiv-txt, LLM-friendly ArXiv papers
arxiv-txt.orgJust change arxiv.org to arxiv-txt.org in the URL to get the paper info in markdown
Example:
Original URL: https://arxiv.org/abs/1706.03762
Change to: https://arxiv-txt.org/abs/1706.03762
To fetch the raw text directly, use https://arxiv-txt.org/raw/abs/1706.03762, this will be particularly useful for APIs and agents
It just extracts the abstracts?
For now , yes - abstracts and other metadata
do you plan on adding descriptions of figures and tables?
will probably focus on getting the text out of the papers first, figures might be a good next step after that
This would be awesome wrapped in an MCP server/tool call :)
whoa - i haven't yet played with MCP - might be a good first project!
The example you give doesn't seem to work - the raw txt does not have authors.
you're right - I hadn't noticed! I fixed it now, thanks for pointing it out
If you train an LLM on only formally verified code, it should not be expected to generate formally verified code.
Similarly, if you train an LLM on only published ScholarlyArticles ['s abstracts], it should not be expected to generate publishable or true text.
Traceability for Retraction would be necessary to prevent lossy feedback.