seanplynch.com

My New Twitter Bot

Something a Little Bit Different

Now that I have updated my blog and gotten the Obligatory Blogging Like a Hacker Post out of the way, I would like to introduce you to @bucketobytes. @bucketobytes is my new twitter bot. Before we get into the details about him, I'd like to talk about his inspiration. As I was sorting through my RSS feeds over the holidays I came across Random Shopper, a bot that Darius Kazemi wrote. Random Shopper is a bot that randomly buys Darius presents on Amazon. I thought

Well that is just a really cool idea, I wish I had thought of it.

I then went on a hunt for other bots and came across @ticbot and a few others. So I decided I wanted to try my hand at such a thing. So I set out on a quest to build a twitter spam bot, that (hopefully) wouldn't actually do any spamming.

What Should a Twitter Bot Do?

I have grand plans for @bucketobytes, but to start off with I wanted to keep things simple to ensure that I would actually get him out there. I decided that the bare minimum was that he should follow people that follow him, reply to @mentions, and tweet randomly. As far as content, I came across this github repository that contains a vast number of fortunes. After pairing down the content to the tweet sized fortunes there are about 2000 available. I figured that would be a good start for @bucketobytes so I blatantly stole them and put them in my own project (thank you @brianclapper). In the future I have been thinking about incorporating chatoms and I'm sure there are other countless resources that I can look into. I thought about other more useful content such as: replying to a mention that includes a movie title with the times that movie was playing. Upon reconsideration, however, I think that the guiding principle I will follow is that @bucketobytes not really do anything all that useful.

As of this writing the current implementation of @bucketobytes is as follows. If you follow him, he will follow you back. If you @mention him, he will @reply to you with a random fortune. If you @mention him with the hash tag #cc he will retweet what you've sent him. And finally he will spontaneously tweet fortunes at random approximately 22 times per day. There are some other implementations I am considering in the future. The first is a mechanism that you can submit suggestions, perhaps with the hashtag #iwantbucketobytesto. Perhaps have him lie, for example subscribe to a Rotten Tomatoes RSS feed and @bucketobytes will say how much he loved such and such movie. I'm excited to implement new ideas as they occur to me. You can find an up to date list of @bucketobytes actions here: @bucketobytes Info Page.

I will now outline his construction, so if you aren't into reading a lot of code feel free to duck out now.

The Guts of a Twitter Bot

You can find the source code for @bucketobytes on GitHub here: seepel/bucketobytes.

Disclaimer: I would ask that you not do anything nefarious with my code, but I imagine there is other more advanced software for real spam bots. If you arrived at my site looking for a spam bot implementation, you will probably be better served to look elsewhere.

To Python or Not to Python

I decided to use python to write my bot as I am very familiar with it from my Physics days, and it would be easy to get things up and running. For the twitter api I tried a few different GitHub projects and decided on Twython. Twython ended up being the most robust and gave me the most freedom while taking away as many headaches as possible. Other libraries were very good, but ended up being a bit restrictive. In order to get things up and running I did have to modify things quite a bit. At the time of writing Twython was still primarily based in the v1.0 twitter api, and there are some problems with the most recent version of the requests module, in particular dealing with oauth. So I updated all the twitter endpoints to v1.1 and modified the streaming portion of Twython to handle the new oauth user endpoint. You can find my changes here: seepel/Twython. I didn't end up pushing my changes back to the main repository as my changes are a bit of a hack at the moment. I may come back and revisit this if someone else doesn't beat me to it.

@bucketobytes: A history

My first implementation was rather fragile and relied entirely upon the REST API. What that meant was that everything that was done, had to be done in discrete chunks. The script would be run periodically, and the bot would have to catch up from its most recent state. Since it was just a prototype I was saving the most recent tweet id in a text file, and stopping processing whenever I hit that tweet. As one would guess this was problematic, if there was ever a problem with that text file @bucketobytes would get out of sync, and potentially re-run any number of actions. This would be annoying for anyone that interacted with him. From here I decided to move to the streaming API.

The streaming version was a bit better. It consisted of two scripts: the first was a script that would listen to the user stream and respond accordingly, the second was a separate script that when run would post a random fortune. So the respond script would act as a long running process to respond to actions, and the post script would be run from a crontab periodically. This process worked a lot better, and was stable enough to let run for a while. But I had grander plans, what if I wanted to incorporate the random posting into the replying and have finer control over how many tweets @bucketobytes was making. This method also led to these scripts being rather lengthy, it would be nice to refactor things.

So I setup a modular design. To do this I had to learn about threading in python. Luckily I found this article over at IBM that was dead simple to understand. The script is centered around two long running threads. Of course the first is listening to the user streaming endpoint to trigger replies and follows. Setting up the stream didn't change much from the respond script of the previous iteration. What changed was how I funneled input into twitter. When a json object cames down from the user stream, it is placed into a queue to be processed. To process the pending queue, I setup a Thread class PostScheduler that would be responsible for the coordination of posting and following. Here is how that class is setup.

class PostScheduler(threading.Thread):
  def __init__(self, api, simulate=False, controllers=None, default_time_to_sleep=60):
    threading.Thread.__init__(self)
    self.api = api
    self.controllers = controllers
    self.queue = Queue.Queue()
    self.post_objects = []
    self.default_time_to_sleep = default_time_to_sleep
    self.setDaemon(True)

The class has an instance of a Twython object in the api variable, a list of controllers (we'll get to that in a bit), a Queue, and a list of post_objects. The heartbeat of the bot is determined by the run method.

def run(self):
    while True:
      queue_object = self.queue.get()
      
      if self.queue.empty():
        self.queue.put(self.default_time_to_sleep)

      if isinstance(queue_object, (int, long, float)):
        time_to_sleep = queue_object
        if time_to_sleep > 0:
          time.sleep(time_to_sleep)
          self.evaluate_tweets()
      else:
        self.post_objects.append(queue_object)

      self.queue.task_done()

What happens here is that the scheduler will remove the first item in the queue. If that item is a number, then that signals to the scheduler that it needs to wait before it handles more actions. If that object is not a number, then it is assumed that it is a dictionary object representing some twitter output (post_object) that needs to be handled. For example: a json object representing a mention from the streaming endpoint. That dictionary is then appended to post_objects for later processing by evaluate_tweets().

  def evaluate_tweets(self):
    self.count += 1
    seconds_from_midnight = (datetime.today() - datetime.min).seconds
    post_objects_to_remove = []

    for post_object in self.post_objects:
      can_be_handled = False
      for controller in self.controllers:
        if controller.can_handle_object(post_object):
          can_be_handled = True
          break
      if not can_be_handled:
        post_objects_to_remove.append(post_object)

    for post_object in post_objects_to_remove:
      self.post_objects.remove(post_object)

    for controller in self.controllers:
      chosen_object = None
      for post_object in self.post_objects:
        if self.evaluate_tweet(controller, post_object, seconds_from_midnight):
          chosen_object = post_object
          break
      if chosen_object != None:
        self.post_objects.remove(chosen_object)
        break
      self.evaluate_tweet(controller, { }, seconds_from_midnight)

The evaluate_tweets() method is where the controllers come in. The controllers allow the whole bot to be configured. The first thing evaluate_tweets does, is figure how many seconds it has been since midnight. This way, one can configure the action to be dependent on the time of day. The next thing it does is for each post_object, determine if one of its controllers can handle the object, if not the object is removed. The problem is that there is a lot of stuff that comes in from the user stream that is not a mention or follow, this sequence removes the junk that will never be responded to. The next chunk of code runs through all the controllers and gives them an opportunity to act on the post_object via evaluate_tweet. If a controller handles the object, it is of course removed from the list of post_objects to be handled. The very last line evaluates a post_object being represented by an empty dictionary. This is how spontaneous tweets are generated.

  def evaluate_tweet(self, controller, post_object, seconds_from_midnight):
    probability = controller.probabilityToPost(post_object, seconds_from_midnight, self.default_time_to_sleep, self.simulate)
    if probability == 0:
      return False
    steps = 10000.0
    random_number = random.randrange(steps)/steps
    if random_number <= probability:
      self.posts += 1
      print controller
      print controller.composePost(self.api, post_object, self.simulate)
      controller.postUpdateStatus(self.api, post_object)
      return True
    return False

The evaluate_tweet() method asks a controller for the probability to respond to an object based on the object, the time of day, and the size of the time step that governs the schedulers heartbeat. The method then generates a random number and if the number is less than the probability, prompts the controller to respond to the object. This allows a controller to do something such as make tweets happen at certain times of day, while still being somewhat random. The second thing that is helpful about this method is that it allows a controller to back off on certain actions. For example, say there is a controller that handles replies, and someone decides to spam the account with 1000 mentions. If my bot were to respond to all those mentions at once, it would probably hit the twitter limit and potentially get blocked. This allows the bot to cut certain users off from replies.

Now that we see how the bot handles the flow of tweets, let's talk controllers. Controllers are responsible for determining how and when the bot should tweet. For example there is a PostController that is responsible for spontaneously posting tweets, the ReplyController is responsible for dealing with mentions, the RetweetController is responsible for handling instances where someone retweets one of @bucketobytes tweets, and etc. Here is the PostController:

class PostController(object):
  def __init__(self, post_composers = [], postControllers = None, current_user=None):
    self.post_composers = post_composers
    self.current_user = current_user

  def can_handle_object(self, post_object):
    return len(post_object) == 0

  def probabilityToPost(self, post_object, seconds_from_midnight, time_step, simulate=False):
    if len(post_object) != 0:
      return 0
    if self.isCurrentUser(post_object):
      return 0
    # flat distribution 22 tweets per day
    one_day = 60.*60.*24./float(time_step)
    if simulate:
      one_day /= 60
    return 22./one_day

  def isCurrentUser(self, post_object):
    if self.current_user == None:
      print 'No current user skipping'
      return False
    # don't respond if the tweet belongs to the current user -- would be infinite loop!
    if post_object.has_key('user'):
      if post_object['user'].has_key('id_str'):
        return post_object['user']['id_str'] == self.current_user['id_str']
    return False

  def choosePostComposer(self):
    post_composers = []
    total_percent = 0
    for post_composer in self.post_composers:
      if post_composer.percent() == 100:
        return post_composer
      post_composers.append(post_composer)
      total_percent += post_composer.percent()
    probability = random.randrange(total_percent)
    threshold = 0
    for post_composer in post_composers:
      if threshold <= post_composer.percent():
        return post_composer
      threshold += post_composer.percent()
    return post_composer

  def composePost(self, api, post_object, simulate):
    return self.choosePostComposer().compose(api, post_object, simulate)

The post controller handles post objects that are empty dictionaries, this happens at the end of each scheduler cycle. Thus far, it has a constant probability to post such that the average should be about 22 tweets per day. In the future I will look into making it more likely to tweet at certain times of day.

Each controller has a list of PostComposers. This will later give me the ability to tweet different things. For example: spontaneously tweet a fortune 40% of the time and post a chatom 60% of the time. The controller object can also decide if it should respond to a post_object based on its content. For example here is the ReplyController that only handles objects which mention @bucketobytes.

class ReplyController(post.PostController):
  def __init__(self, post_composers = [], postControllers = None, current_user=None):
    post.PostController(post_composers, postControllers, current_user)
    self.post_composers = post_composers
    self.current_user = current_user
    self.reply_ids = { }

  def can_handle_object(self, post_object):
    if self.isCurrentUser(post_object):
      return False
    if not post_object.has_key('entities'):
      return False
    if not post_object['entities'].has_key('user_mentions'):
      return False
    for user_mention in post_object['entities']['user_mentions']:
      if user_mention['id_str'] == self.current_user['id_str']:
        return True
    return False

  def probabilityToPost(self, post_object, seconds_from_midnight, time_step, simulate=False):
    if self.isCurrentUser(post_object):
      return 0
    if not post_object.has_key('entities'):
      return 0
    if not post_object['entities'].has_key('user_mentions'):
      return 0
    for user_mention in post_object['entities']['user_mentions']:
      if user_mention['id_str'] == self.current_user['id_str']:
        return self.probabilityForId(post_object, seconds_from_midnight, time_step)
    return 0

  def probabilityForId(self, post_object, seconds_from_midnight, time_step):
    if not post_object.has_key('user'):
      return 0
    if not post_object['user'].has_key('id_str'):
      return 0
    user_id = post_object['user']['id_str']
    if not self.reply_ids.has_key(user_id):
      self.reply_ids[user_id] = { 'probability' : 1, 'first_reply' : datetime.today(), 'last_attempt' : datetime.min }

    current_datetime = datetime.today()
    if (current_datetime - self.reply_ids[user_id]['first_reply']).seconds > 1:#60*60*24:
      self.reply_ids[user_id] = { 'probability' : 1, 'first_reply' : datetime.today(), 'last_attempt' : datetime.min }
      return 1

    probability = self.reply_ids[user_id]['probability']
    delta = (datetime.today() - self.reply_ids[user_id]['last_attempt'])
    if delta.microseconds < 500:
      probability = 0

    self.reply_ids[user_id]['last_attempt'] = datetime.today()

    return probability

  def postUpdateStatus(self, api, post_object):
    user_id = post_object['user']['id_str']
    probability = float(self.reply_ids[user_id]['probability'])
    self.reply_ids[user_id]['probability'] = probability * 0.5

Finally, the flow trickles down to a PostComposer which creates the actual tweet. Here is the FortuneComposer

class FortuneComposer(PostComposer):
  def __init__(self):
    self.fortunes = open('fortunes').read().split('\n%\n')
    for fortune in self.fortunes:
      if len(fortune) > 140:
        self.fortunes.remove(fortune)

  def compose(self, api, post_object, simulate):
    fortune = None
    screen_name = None
    if post_object.has_key('user'):
      if post_object['user'].has_key('screen_name'):
        screen_name = post_object['user']['screen_name']
    if screen_name != None:
      fortune = self.chooseFortune(140, screen_name)
    else:
      fortune = self.chooseFortune()
    if fortune == None:
      return None
    if simulate:
      return fortune
    if post_object.has_key('id_str') and screen_name != None:
      return api.updateStatus(status=fortune, in_reply_to_status_id=post_object['id_str'])
    else:
      return api.updateStatus(status=fortune)

  def chooseFortune(self, max_len=140, screen_name=None):
    fortune = ''
    if screen_name != None:
      fortune += '@' + screen_name + ' '
      max_len -= len(fortune)
    tmp_fortune = random.choice(self.fortunes)
    count = 0
    while len(tmp_fortune) > max_len:
      if count > 1000:
        return None
      tmp_fortune = random.choice(self.fortunes)
      count += 1
    fortune += tmp_fortune
    return fortune

This should be pretty self explanatory. It chooses a random fortune, prepending a screen name at the beginning if it is constructing a reply. It then ensures that the tweet will fit in the allocated 140 characters, and finally uses the provided Twython api object to send it to twitter.

To sum everything up, I feel that I have a nice implementation of a twitter bot that I can expand on down the line. It should be relatively easy to add new actions as I think of them. So why not give it a try and send @bucketobytes an @mention?