IndieWebCamp Edinburgh: July 25-26, 2015!


original post discovery is a discovery algorithm for starting with a POSSE copy of a post and finding the original post.


Use Cases

Reply to original

First, as part of How to make a comment, it would be more indieweb-friendly if post authoring implementations:

  • automatically detected when a user is trying to reply to a POSSE'd copy (e.g. a tweet),
  • auto-discovered the original post, and
  • linked to the original post instead

In question form:

  • How do I find an original post of a POSSE'd copy that I'm replying to?

Thread original posts and POSSE copies

Second, when POSSEing reply posts, it's useful to automatically:

  • markup your reply post with in-reply-to markup to the original post
  • when POSSEing your reply post to Twitter, set the in-reply-to-status-id to the status-id of the POSSE'd tweet copy of the original post.
  • for more details see: How to POSSE a comment


How to discover an original post URL from a copy of that post at a POSSE permalink

  1. retrieve the POSSE permalink
  2. in the h-entry that represents the POSSE copy, look for a link with "u-url" and "u-uid" - use that href as the original post URL.
  3. otherwise look for a rel=canonical link in the POSSE'd copy that links back to an original - use that as the original post URL.
  4. otherwise look for a link with link text of "See Original" in the POSSE'd copy page that links back to an original - use that as the original post URL.
  5. otherwise if a parenthetical permashortlink citation is the last thing in the POSSE'd copy content, convert that to a URL, use that as the candidate URL
  6. otherwise if a URL is the last thing in the POSSE'd copy content, use that as the candidate URL
  7. retrieve the candidate URL and parse it for hyperlinks
  8. iterate across hyperlinks with rel=syndication or u-syndication URLs (syndication URLs)
    1. if a syndication URL matches the POSSE permalink, then the candidate URL is the original post URL.
    2. else if a syndication URL has the same domain as the POSSE permalink
      1. retrieve the syndication URL
      2. if its redirect destination matches the POSSE permalink, then the candidate URL is the original post URL. (implementations may check such URL's redirect destinations one at a time and should stop when they find a match in order to minimize HTTP requests)
    3. end if
  9. end iteration

A parenthetical permashortlink citation looks something like:

  • ( t4Pc2)

The specific format of a parenthetical permashortlink citation is:

  1. literal '('
  2. domain name, likely short domain name (to avoid having Twitter auto-link it, as Twitter auto-links .com .net .org TLDs.
  3. literal space ' '
  4. id consisting of a-zA-Z0-9
  5. literal ')'

Convert a parenthetical permashortlink citation to URL by:

  • start with string "http://"
  • append the domain name from (2) above to the string
  • append a literal slash '/' to the string
  • append the id from (4) above to the string
  • the resulting string is a permashorturl


Algorithm Notes

Use-cases that were used to add steps to the algorithm

  • "syndication URL has the same domain as the POSSE permalink" and substeps. From the time of posting of the original post (and its POSSE permalink) to when this algorithm is run on the POSSE permalink, it's possible that the POSSE destination has changed its permalinks in some way. The following two have been seen in practice and thus are handled by this step in the algorithm
    1. http/https differences. E.g. Twitter permalinks used to be "http:" but are now (as of 2012+?) canonically "https:". Any implementation that saved POSSE tweet permalinks before that change would likely publish/link to "http:" URLs which require a retrieval of their redirect destination for comparison.
    2. change of path. Silos have in the past changed implementation specifics about how their permalinks work, leaving redirects behind for the original paths. Silos may also allow users to alter part of the permalink of a post, e.g. editing the slug, after publishing, and still support the old URL either by tracking all past permalinks for a post, or perhaps by only requiring non-post-slug portions of the permalink for unique retrieval.


POSSE Post Discovery

Main article: posse-post-discovery

Some prefer not to include permalinks/citations in POSSEd copies for aesthetic (Twitter's length limit) or technical (Instagram's lack of posting API) reasons. Is it possible to do original-post-discovery on a syndicated copy that contains no permalink or citation? posse-post-discovery.

POSSE copy domain approximation

If by following the discovery algorithm you're unable to verify that a candidate URL is the original post for the current (apparent) POSSE permalink, perhaps simply check the following:

  • Does the profile of the POSSE permalink (e.g. Twitter profile) have a rel=me link (e.g. in "Website:" field) to the domain of the candidate URL?
  • If so, treat the candidate URL as the original post URL.
This may help discovery more original post URLs.
This may provide false positives, e.g. multiple tweets from someone about their same original blog post will be otherwise all treated as POSSE copies of that original blog post.

24hr mitigation

To reduce the number of false positives:

  • If a candidate URL is more than 24 hours more recent (per h-entry dt-published) than the (apparent) POSSE permalink, then DO NOT treat it as the original post URL.

This way only apparent POSSE permalinks created less than 24 hours after an original can count as actual POSSE copies. You have 24hrs to publish a POSSE copy or else it's considered a mention/comment, not a POSSE copy.

See Also