Phrase matching in Marginalia Search

192 points by marginalia_nu 10 months ago

senkora 10 months ago

> turned up nothing but vietnamese computer scientists, and nothing about the famous blog post “ORM is the vietnam of computer science”. [emphasis added]

This points in the direction of the kinds of queries that I tend to use with Marginalia. I've found it to be very helpful in finding well-written blog posts about a variety of subjects, not just technical. I tend to use Marginalia when I am in the mood to find and read such articles.

This is also largely the same reason that I read HN. My current approach is to 1) read HN on a regular schedule, 2) search Marginalia if there is a specific topic that I want, and then 3) add interesting blogs from either to my RSS reader app.

ColinHayhurst 10 months ago

Congrats Viktor.

> The feedback cycle in web search engine development is very long....Overall the approach taken to improving search result quality is looking at a query that does not give good results, asking what needs to change for that to improve, and then making that change. Sometimes it’s a small improvement, sometimes it’s a huge game changer.

Yes, this resonates with our experience

outime 10 months ago

This comment felt like indirect spam, as it doesn't really contribute anything IMHO. The phrase "in our experience" implies they’re in the same business, and upon checking the profile, I found that aside from the bio linking to their own thing, most comments resemble (in)direct spam. Everyone has their strategies, but I really disliked seeing this.
- marginalia_nu 10 months ago
  
  Eh, I think it's always interesting to compare notes with other search projects.
  It's a small niche, and I think we're all rooting for eachother.
  - EarlKing 10 months ago
    
    [flagged]

mdaniel 10 months ago

Based solely upon the title and the first commit's date, I'm guessing it's this: https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99

marginalia_nu 10 months ago

Correct!

pmdulaney 10 months ago

Amazing! "bicycle touring in France" as a search target produces a huge number of spot-on returns beautifully formatted.

gary_0 10 months ago

> To make the most of phrase matching, stop words need to go.

Perhaps I am misunderstanding; does this mean occurrences of stop words like "the" are stored now instead of ignored? That seems like it would add a lot of bloat. Are there any optimizations in place?

Just a shot-in-the-dark suggestion, but if you are storing some bits with each keyword occurrence, can you add a few more bits to store whether the term is adjacent to a common stop word? So maybe if you have to=0 or=1, "to be or not to be" would be able to match the data `0be 1not 0be`, where only "be" and "not" are actual keywords. But the extra metadata bits can be ignored, so pages containing "The Clash" will match both the literal query (via the "the" bit), and just "clash" (without the "the" bit).

heikkilevanto 10 months ago

One of the problems with stop words is that they vary between languages. "The" is a good candidate in English, but in Danish it just means "tea", which should be a valid search term. And even in English, what looks like a serious stop word, can be an integral part of the phrase. "How to use The in English".
marginalia_nu 10 months ago

It's not as bad as you might think, we're speaking dozens of GB across the entire index.
I don't think stopwords as an optimization makes sense when you go beyond BM25. The search engine behaves worse and adding a bunch of optimizations makes an already incrediby complex piece of software more so.
So overall I don't think the juice is worth the squeeze.
ValleZ 10 months ago

Removing stop words is usually a bad advice which is beneficial only in a limited set of circumstances. Google keeps all the “the”: https://www.google.com/search?q=the
- efilife 10 months ago
  
  I don't think it's as simple for them as just keeping the thes. It's probably very context dependent
joking 10 months ago

in reality, terms are stored with the position, so if you use stopwords
to be or not to be should be indexed as 2-be 6-be and a phrase match should match exactly. the problem would be that it would match also "be not to or be", as the distance between the 2 "be" is also 3.
Long time ago it was necessary, but nowadays you loose more that what you gain using stopwords.

efilife 10 months ago

I wrote my own search engine some time ago and was impressed by how well it worked on my relatively small index. And then I see this. Marginalia's dev is just unmatched with persistence and knowledge to pull all of this off, I wouldn't even know where to start some of the things he did with his search engine

marginalia_nu 10 months ago

To be honest it's mostly persistence. I didn't know most of this stuff when I started out, at least not as well as I do now. Having gotten the opportunity to work full time on this for a year now has also helped.

hosteur 10 months ago

Always nice to see updates on marginalia.

arromatic 10 months ago

1. Is the index public ? 2. Any chance for a rss feed search ?

marginalia_nu 10 months ago

1. I'm not sure what you mean. The code is open source[3], but the data is, for logistical reasons, not available. Common Crawl is far more comprehensive though.
2. I've got such plans in the pipe. Not sure when I'll have time to implement it, as I'm in the middle of moving in with my girlfriend this month. Soon-ish.
[3] at https://git.marginalia.nu/ , though still some rough edges to sand down before it's easy to self-host (as easy as hosting a full blown internet search engine gets).
- arromatic 10 months ago
  
  Thanks . What you answered at 1. is what I meant. I was looking for a small web dataset but cc is too big for me process .
  1. Do you know any dataset of rss feeds that are not 100s of gbs ?
  2. How does your crawler handle malicious site when crawling ?
  - marginalia_nu 10 months ago
    
    1. Here are all RSS feeds known to the search engine as of some point in 2023: https://downloads.marginalia.nu/exports/feeds.csv -- it's quite noisy though, a fair number of them are anything but small web. You should be able to fetch them all in a few hours I'd reckon, and have a sample dataset to play with. There's also more data at https://downloads.marginalia.nu/exports/ , e.g. a domain level link graph, if you want to experiment more in this space.
    2. It's a constant whac-a-mole to reverse-engineer and prevent search engine spam. Luckily I kinda like the game. It's also helpful that it's a search engine so it's quite possible to use the search engine itself to find the malicious results, by searching for the sorts of topics where they tend to crop up, e.g. e-pharama, prostitution, etc.
    
    arromatic 10 months ago
    
    Apologies for too many questions but resources on search engines are scarce . How do I visualize the link graphs or process them ? is there any tool preferably foss . Majestic seem to have one but it's their own .
    
    marginalia_nu 10 months ago
    
    I don't know if there's any real good answers. It's hard to visualize a graph of this size, but most graph libraries will at least consume it assuming you have a decent amount of RAM.
    
    arromatic 10 months ago
    
    On 2. I meant malware that could affect your crawling server not spams. And thanks for the data .
    
    marginalia_nu 10 months ago
    
    Malware authors typically focus on more common targets, like web browsers. I'm quite possibly the only person doing crawling with the stack I'm on, which means it's not a very appealing target. It also helps that the crawler is written in Java, which is a relatively robust language.