deduplication

From IndieWeb


deduplication (de-duplication AKA deduping/de-duping) is the process of comparing responses (sometimes posts) and seeing if they are exactly or essentially the same, and only keeping the earliest or most canonical version, perhaps keeping track of alternative URLs, like syndicated copies.

How to deduplicate responses

Replies and other responses are often duplicated in different places, e.g. via backfeed of POSSEd replies by Bridgy. Ideally, recipients should try to de-dupe webmention sources, preferring an original post (see below). Getting this perfect is hard, but getting close is pretty easy (see one IRC discussion and another) by both:

  1. Preferring original replies
  2. Comparing an incoming reply (etc) to existing replies based on:
    • u-uid
    • u-url
    • u-syndication (also compare to u-url, and vice versa)
    • other u-in-reply-to links in the incoming reply
    • rel=alternate / rel=canonical
    • full text, after stripping HTML tags and probably ignoring whitespace differences
    • text prefix, after also stripping leading @username, RT/MT, trailing ..., etc.
    • edit distance, longest common subsequence, or other fuzzy match

Responses challenges

Examples / challenges for de-duping (use these as source material to check any de-duping approaches / algorithms)

  • comments on https://waterpigs.co.uk/notes/4Y38Ts/
  • security / identification / preventing hijacking. An attacker could overwrite or delete an existing webmention by sending a new one from their own site with the same u-url. To prevent this, receivers can compare source domain as well as uid, u-url, etc., and only interpret two webmentions as duplicates if both match.

IndieWeb Examples

Kyle Mahan

Kyle Mahan de-duplicates comments on his site since at least 2015-06:

Aaron Parecki

Aaron Parecki de-duplicates comments on his site since 2017-09-01, with a partially working implementation since ~2016

Silo Examples

Twitter

  • Twitter: ~24hr(?) dedupe. In their web create UI, if you enter the same text as a previous tweet in the past 24hrs (tested minutes, and years, educated guessing 24hrs) and attempt to "Tweet", Twitter won't post it, and will instead show an error message of "You have already sent this Tweet.".

See Also

  • https://en.wikipedia.org/wiki/Data_deduplication
  • NBC apparently syndicates the same article to multiple domains, and thus if that article links to you, you will receive dozens of webmentions from essentially the same source content, across many domains. https://twitter.com/Mappletons/status/1635555293563060224
    • "Good lord.

      I did one quick chat for an NBC news piece and they linked to my website.

      Now my WebMentions are a dumpster fire - filled with all their crappy, low-quality syndicated copies of the article.

      Counted 38 bunk domains all pointing back to the canonical NBC domain" @Mappletons March 14, 2023
  • Brainstorming: for copies of/in content as well as copies of posts. E.g. social readers can/should de-dupe the copypaste a line of text repeatedly (posting trend on Threads e.g. https://www.threads.net/@h2nate/post/C1qQJcVs2BF/) which is a deliberate attempt to manipulate the reader (repetition causes more belief/acceptance)
  • ^

    Stop supporting CVS.
    Stop supporting CVS.
    Stop supporting CVS.
    [...]