Allow Google to locked down categories

I had a great conversation with @fzngagan in PM’s on Meta today and he suggested to move the conversation to here, suggesting that @angus would have some good ideas about this.

My initial question was if it would be possible to allow crawlers access to locked down categories when the category-lockdown plugin is being used. This turned out to be pretty hard since there are a number of architectural cough “issues” in Discourse.

Summary:

@richard

The crawl branch does not return any content. It successfully circumvents the 402 from happening but somehow the content is not returned to the client. I will play around with this a bit.

Thanks for making the branch. I did try something similar but the use_crawler_layout was not accessible from the Guardian and I didn’t get the smart idea of trapping this in the ApplicationController. Although I am afraid that although rescue_discourse_actions is not called, the code after the exception is not executed.

@fzngagan

Yeah, its tricky for sure. You need to somehow pass a boolean value to the TopicView object(likely in opts param) while creating it so you can decide from check_and_raise_exceptions whether to raise an exception or not.

@richard

So the issue is that TopicView creates its own Guardian (line 57) instead of reusing the guardian that was created by ApplicationController. The Guardian created in TopicView does not have access to the request object.

Passing stuff in opts was what I thought about as well but it would require completely overriding the show method of the TopicViewController, and that would be a hell to maintain.

The only thing I can think of is moving the logic out of TopicView and passing it into TopicController. Overriding the show method is a bad option but at the end of the show method, the perform_show_response is called, and it is only called from that single place (I guess they have been splitting up a large method to keep their code quality indicators green).

Overriding perform_show_response would be a good place I think, because it would be

def perform_show_response
   # logic here, with access to use_crawler_layout
   super
end

However, it would only cover the show topic method and all other places where content could be seen is not covered by this. So maybe we would be throwing the baby out with the bath water…

When I started thinking about this 2 days ago, I thought this would be easy :thinking:

@fzngagan

Yeah, I reached exactly the same conclusions you reached.

Yeah, TopicView has been used in many places which aren’t covered if we use this solution.

I was thinking about whether we can set some global state before the TopicView object created and access that global state. Its still tricky because we need to make sure what we’re reading is correct and we need to destroy it correctly.

Ok, you can do this. Its possible to render in the rescue block.

@richard

Maybe I am misunderstanding, but check_and_raise_exceptions is called at the (almost) very top of the TopicView constructor. So when the rescue code runs, a large part of the code has not been executed?

I like your idea of setting some kind of “global” flag. This way we could skip raising the exception in the TopicController.show flow and in all other cases the code would behave like it does now.

I think it’s really dirty but it looks like this works:

  • override TopicController.show to have
def show
 $crawler = use_crawler_layout?
 super
end

add one simple line to the is_locked function

  module ::CategoryLockdown
    def self.is_locked(guardian, topic)
      return false if $crawler 
      ...
      (the rest of the code)
      ...
    end

    class NoAccessLocked < StandardError; end
  end

What would be really cool is if we could play nice with Google and indicate that the content is paywalled by including some schema.org JSON-LD metadata. I hope that does not require a second global :wink:

1 Like

After a night of sleep I have an additional or maybe complimentary or alternative idea.

Currently a crawler has current_user nil just like any anonymous user.
If we had a (separate) plugin that would treat a crawler like a predefined user (“Crawler”) then we could give that user the correct groups, and crawlers would even be able to index categories that are really closed, and be able to index login_required forums.

How hard would that be?

1 Like

@richard
On the idea of using render in the rescue block, If we patch the initialize method of the TopicView class and raise the exception at the end, all of.the processing has been already done at this point so we should be able to render the correct stuff?

The problem is that as long as we are in TopicView, we have no clue of whether we are dealing with a crawler or not. We only know in TopicController or ApplicationController.

If we’re going to be using a global variable to “cheat” anyway, then I think it’s better to avoid the exception than to rescue from it.

1 Like

@richard
Ok, here’s what we can do actually.
Let’s say we add a site setting allow_crawler_bypass
in check_and_raise_exceptions we add another condition that if allow_crawler_bypass is enabled, then exception won’t be raised.

now in perform_show_response, we check whether its crawler, if not and the category is locked too, then we raise the same exception from there. This is the best bet we have IMO.

I think this should be a good starting point

We can do a similar thing for a handful other places where TopicView class is used on a case by case basis.

It sounds pretty complex, what would be a reason not to do it like here ?

But then using

return false if SiteSetting.category_lockdown_crawler_bypass && $crawler

in the is_locked function?

Yeah, this is almost the same thing. agreed. I was unsure about if whether the context is shared or not between the overriding and overridden methods but if it works, then its fine :slight_smile:

1 Like

I am thinking what happens to this global if we have concurrent sessions within one Unicorn process… maybe we should tie the global var to Thread. @angus do you know the best way to approach this?

I don’t think its a shared variable accross requests. It would be tied only to the request in question ?

I’m not sure if Unicorn is multi threaded and if globals are thread safe.

Revisiting this:

lib/guardian.rb has an AnonymousUser, it would be cool if the guardian would have a CrawlerUser that it uses instead whenever a crawler is detected.

That way we would not have to mess with an actual session.

@richard
Good news for you here.

Another POC, but this time its working. GitHub - paviliondev/discourse-category-lockdown at bypass-crawler

I’ve used the request_store gem to make the is_crawler global but only in the request context which is exactly what we need.

@merefield @angus
This gem will give you an ember like feel i.e. everything being accessible from everywhere :wink: but its helpful when discourse isn’t passing stuff down and a PR isn’t possible :wink:

1 Like

That is so cool, I’m going to experiment with it later today!

1 Like

I have (finally) been playing around with this branch and it works very well. I will also be making a pull request so it will include the correct Schema data as described here.

Along the way, I came across two issues:

1 Like

@fzngagan do you have a rough indication when you would be able to take a look at the bug mentioned above?

(not meant to push but just to be able to plan my work better :slight_smile: )

@fzngagan I have built upon your code to add crawler access functionality to the category lockdown plugin.
I have also added the options to inject schema.org paywall metadata and to inject a noarchive meta header.

Pull request is here Bypass crawler by discoursehosting · Pull Request #9 · paviliondev/discourse-category-lockdown · GitHub

Can you please take a look and merge it when you have time?

1 Like

@richard
The plugin itself doesn’t work currently. It would need some work to be back on track which I plan to do within a couple of days. Its not your code at all, the actual logic the plugin was based on has changed in core.

Currently, the user is redirected to a non existant url.

Yes, that’s what I found too

1 Like

Oopsie, I didn’t notice that.
I think we’ll move to our custom and concise logic which would prevent such breakage. It was fragile anyway.

Yeah, it looks like the error message was misused for the URL.
The question was not if it would break, but when :slight_smile:

1 Like

I hope I’ve grown as a dev since then. :wink: