IndieWebCamp is a 2-day creator camp focused on growing the independent web

original-post-discovery


original post discovery is a discovery algorithm for starting with a POSSE copy of a post and finding the original post.

Contents

Use Cases

Reply to original

First, as part of How to make a comment, it would be more indieweb-friendly if post authoring implementations:

  • automatically detected when a user is trying to reply to a POSSE'd copy (e.g. a tweet),
  • auto-discovered the original post, and
  • linked to the original post instead

In question form:

  • How do I find an original post of a POSSE'd copy that I'm replying to?

Thread original posts and POSSE copies

Second, when POSSEing reply posts, it's useful to automatically:

  • markup your reply post with in-reply-to markup to the original post
  • when POSSEing your reply post to Twitter, set the in-reply-to-status-id to the status-id of the POSSE'd tweet copy of the original post.
  • for more details see: How to POSSE a comment

Algorithm

How to discover an original post URL from a copy of that post at a POSSE permalink

  1. retrieve the POSSE permalink
  2. in the h-entry that represents the POSSE copy, look for a link with "u-url" and "u-uid" - use that href as the original post URL.
  3. otherwise look for a rel=canonical link in the POSSE'd copy that links back to an original - use that as the original post URL.
  4. otherwise look for a link with link text of "See Original" in the POSSE'd copy page that links back to an original - use that as the original post URL.
  5. otherwise if a parenthetical permashortlink citation is the last thing in the POSSE'd copy content, convert that to a URL, use that as the candidate URL
  6. otherwise if a URL is the last thing in the POSSE'd copy content, use that as the candidate URL
  7. retrieve the candidate URL and parse it for hyperlinks
  8. iterate across hyperlinks with rel=syndication or u-syndication URLs (syndication URLs)
    1. if a syndication URL matches the POSSE permalink, then the candidate URL is the original post URL.
    2. else if a syndication URL has the same domain as the POSSE permalink
      1. retrieve the syndication URL
      2. if its redirect destination matches the POSSE permalink, then the candidate URL is the original post URL. (implementations may check such URL's redirect destinations one at a time and should stop when they find a match in order to minimize HTTP requests)
    3. end if
  9. end iteration

A parenthetical permashortlink citation looks something like:

  • (ttk.me t4Pc2)

The specific format of a parenthetical permashortlink citation is:

  1. literal '('
  2. domain name, likely short domain name (to avoid having Twitter auto-link it, as Twitter auto-links .com .net .org TLDs.
  3. literal space ' '
  4. id consisting of a-zA-Z0-9
  5. literal ')'

Convert a parenthetical permashortlink citation to URL by:

  • start with string "http://"
  • append the domain name from (2) above to the string
  • append a literal slash '/' to the string
  • append the id from (4) above to the string
  • the resulting string is a permashorturl

Implementations

Algorithm Notes

Use-cases that were used to add steps to the algorithm

  • "syndication URL has the same domain as the POSSE permalink" and substeps. From the time of posting of the original post (and its POSSE permalink) to when this algorithm is run on the POSSE permalink, it's possible that the POSSE destination has changed its permalinks in some way. The following two have been seen in practice and thus are handled by this step in the algorithm
    1. http/https differences. E.g. Twitter permalinks used to be "http:" but are now (as of 2012+?) canonically "https:". Any implementation that saved POSSE tweet permalinks before that change would likely publish/link to "http:" URLs which require a retrieval of their redirect destination for comparison.
    2. change of path. Silos have in the past changed implementation specifics about how their permalinks work, leaving redirects behind for the original paths. Silos may also allow users to alter part of the permalink of a post, e.g. editing the slug, after publishing, and still support the old URL either by tracking all past permalinks for a post, or perhaps by only requiring non-post-slug portions of the permalink for unique retrieval.

POSSE Post Discovery

Main article: posse-post-discovery

Some prefer not to include permalinks/citations in POSSEd copies for aesthetic (Twitter's length limit) or technical (Instagram's lack of posting API) reasons. Is it possible to do original-post-discovery on a syndicated copy that contains no permalink or citation? posse-post-discovery. (Spoiler: yes, with syndication links, described above.)


Brainstorming

Extended backlink interpretation

TL;DR: new heuristic for determining when a link in a silo post implies it's a POSSE: the silo post's text is a "close enough" duplicate of the linked page's name, summary, or content.

Problem

Implementing Backfeed has raised some additional subtleties that aren't handled by this algorithm. In particular, Bridgy generally only backfeeds responses to a POSSE post, not the POSSE post itself, which is effectively a duplicate of the original post.

However, many posts with backlinks aren't POSSE posts, and are worth backfeeding along with their responses. For example, this tweet from Kevin Marks:

@debcha @quinnnorton there, I fixed it: http://svgur.com/s/1c - need to use area for that ratio.

is a mention, not a POSSE, but it's to one of his domains, and it was posted right after the original svgur.com post, so it's hard to determine that it's a non-POSSE.

The backlink isn't at the end of the tweet, which we do currently use as a heuristic. That's not always dependable, though. For example, I could have easily written this non-POSSE tweet with the snarfed.org backlink at the end, e.g.:

@caseorganic @aaronpk thanks again! great to meet everyone, had a blast. btw, podcasts: http://99percentinvisible.org/faq/ and http://snarfed.org/podcasts

The heuristics of at-the-end, on-one-of-user's-domains, and within-24-hrs are all useful, but may not be enough for some use cases, e.g. Bridgy's high volume backfeed across many users with many different POSSE patterns.

Proposed solution

When considering a backlink in a silo post, use most or all of these heuristics to determine whether it's a POSSE:

  • The backlink must be at or near the end. (Allow e.g. a close paren after the link.)
  • The backlink must point to one of the user's domains, as determined by rel-me and links in their silo profile.
  • The silo post must be published within 24h of the original post.
  • New: compare the silo post's text and the original post's name, summary, and/or content, taking prefixes if they're meaningfully longer. (If the silo post has an ellipsis at or near the end, that's a strong hint to use a prefix.) The edit distance should be below a certain threshold, disregarding common differences like @-usernames in silo posts vs human names in original posts (e.g. this OP vs this POSSE).

Thanks to Kevin Marks for helping work this out in IRC!

POSSE copy domain approximation

If by following the discovery algorithm you're unable to verify that a candidate URL is the original post for the current (apparent) POSSE permalink, perhaps simply check the following:

  • Does the profile of the POSSE permalink (e.g. Twitter profile) have a rel=me link (e.g. in "Website:" field) to the domain of the candidate URL?
  • If so, treat the candidate URL as the original post URL.
Advantages
This may help discovery more original post URLs.
Disadvantages
This may provide false positives, e.g. multiple tweets from someone about their same original blog post will be otherwise all treated as POSSE copies of that original blog post.

24hr mitigation

To reduce the number of false positives:

  • If a candidate URL is more than 24 hours more recent (per h-entry dt-published) than the (apparent) POSSE permalink, then DO NOT treat it as the original post URL.

This way only apparent POSSE permalinks created less than 24 hours after an original can count as actual POSSE copies. You have 24hrs to publish a POSSE copy or else it's considered a mention/comment, not a POSSE copy.

See Also