вторник, 14 мая 2013 г.

Twitter is a great data source for Machine Learning experiments - quick recipe.

Why did I need twitter's live stream?

I'm working on several machine learning algorithms. Research and experimenting in this area require some data to run algorithms on. Previously, I had local corpuses, twitter dumps and a bit of data I captured from users (grammarly users). Not bad data sources, but the problem is that them are static. Once taken snapshot.
My further work did require dynamic data, a live stream of messages. There are several options to get them. Twitter is a good one. So, I went through the twitter API documentation and obtained all the streamed data I needed. Here is a quick overview how to use them in python.

First of all, after quick googling I got this working example in python:

import urllib 
import json response = urllib.urlopen("http://search.twitter.com/search.json?q=britain")
print json.load(response) 

this returned and parsed json:

{
    "completed_in": 0.021,
    "max_id": 334060350016200700,
    "max_id_str": "334060350016200704",
    "next_page": "?page=2&max_id=334060350016200704&q=britain",
    "page": 1,
    "query": "britain",
    "refresh_url": "?since_id=334060350016200704&q=britain",
    "results": [
        {
            "created_at": "Mon, 13 May 2013 21:39:28 +0000",
            "from_user": "PhoebeHunt",
            "from_user_id": 67871080,
            "from_user_id_str": "67871080",
            "from_user_name": "Phoebe Hunt",
            "geo": null,
            "id": 334060350016200700,
            "id_str": "334060350016200704",
            "iso_language_code": "en",
            "metadata": {
                "result_type": "recent"
            },
            "profile_image_url": "http://a0.twimg.com/profile_images/3623547765/2fa1b209d088c2fffcd2b186b0e260c5_normal.jpeg",
            "profile_image_url_https": "https://si0.twimg.com/profile_images/3623547765/2fa1b209d088c2fffcd2b186b0e260c5_normal.jpeg",
            "source": "<a href="http://blackberry.com/twitter">Twitter for BlackBerry®</a>",
            "text": "Dunno why everyone is so shocked at the programme skint, welcome to the real Britain.."
        },
        {
            "created_at": "Mon, 13 May 2013 21:39:22 +0000",
            "from_user": "Ovidus",
            "from_user_id": 30197745,
            "from_user_id_str": "30197745",
............................................
............................................
............................................


This is a simple public twitter searching service with JSON interface. It can be used for many purposes. But this data still cannot be called "dynamic". Sure, you can request search result again, and it will be up-to-date. But I needed a stream.

Information about streamed data I found in the twitter documentation.


especially

Steps I had to go through:
  1. Created a twitter account.
  2. Went to https://dev.twitter.com/apps and logged in with my twitter credentials.
  3. Clicked "create an application"
  4. Filled in the form and agreed to the terms. 
  5. On the next page clicked "Create my access token"
  6. Copied my "Consumer key" and "Consumer secret"
  7. Clicked "Create my access token." (for Oauth authorization).

 My python installation didn't include oauth2, so I had to install it with pip install oauth2

Twitter doesn't grant an access to firehose stream. It means, that you won't be able to get the stream of all the twits on the earth, only 1%  is available for free. But it was really enough for me. (read about it here)
This sample stream is available at this endpoint https://stream.twitter.com/1/statuses/sample.json

Python code which gets stream from this endpoint. It is a bit changed for my purposes:

# Twitter 1% sample stream requesting script
# Usage python tstream.py > output

import oauth2 as oauth # Authentication
import urllib2 as urllib # Http Requests

access_token_key = "SECRET FIELD"
access_token_secret = "SECRET FIELD"

consumer_key = "SECRET FIELD"
consumer_secret = "SECRET FIELD"

oa_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
o_consumer = oauth.Consumer(key=consumer_key, secret=consumer_secret)

sig_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1() http_method = "GET"

http_handler  = urllib.HTTPHandler() https_handler = urllib.HTTPSHandler()

def twitter_request(url, method, parameters):
  request = oauth.Request.from_consumer_and_token(o_consumer,
                                             token=oa_token,
                                             http_method=http_method,
                                             http_url=url, 
                                             parameters=parameters)

  request.sign_request(sig_method_hmac_sha1, o_consumer, oa_token)

  headers = request.to_header()

  if http_method == "POST":
    encoded_post_data = request.to_postdata()
  else:
    encoded_post_data = None url = request.to_url()

  opener = urllib.OpenerDirector()
  opener.add_handler(http_handler)
  opener.add_handler(https_handler)

  response = opener.open(url, encoded_post_data)

  return response

def start_samples_receiving(): url = "https://stream.twitter.com/1/statuses/sample.json"
  parameters = []
  response = twitter_request(url, "GET", parameters)
  for line in response:
    print line.strip()

start_samples_receiving()


Resulting objects I received in JSON according to https://dev.twitter.com/docs/platform-objects/tweets;  After filtering I had a stream with following objects:

{
    "created_at": "Mon May 13 22:15:36 +0000 2013",
    "id": 334069443342782464,
    "id_str": "334069443342782464",
    "text": "@CalaceSofia con mucho gusto ajajja, na mentira yo te quiero :)",
    "source": "web",
    "truncated": false,
    "in_reply_to_status_id": 334054695234572289,
    "in_reply_to_status_id_str": "334054695234572289",
    "in_reply_to_user_id": 864772136,
    "in_reply_to_user_id_str": "864772136",
    "in_reply_to_screen_name": "CalaceSofia",
    "user": {
        "id": 1413481183,
        "id_str": "1413481183",
        "name": "Julian Bertoldi",
        "screen_name": "NanooJBertoldi",
        "location": "",
        "url": null,
        "description": null,
        "protected": false,
        "followers_count": 6,
        "friends_count": 62,
        "listed_count": 0,
        "created_at": "Wed May 08 17:59:21 +0000 2013",
        "favourites_count": 0,
        "utc_offset": null,
        "time_zone": null,
        "geo_enabled": false,
        "verified": false,
        "statuses_count": 17,
        "lang": "es",
        "contributors_enabled": false,
        "is_translator": false,
        "profile_background_color": "C0DEED",
        "profile_background_image_url": "http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_image_url_https": "https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_tile": false,
        "profile_image_url": "http:\/\/a0.twimg.com\/profile_images\/3631853451\/ef30046384f63f5bbb1dbfc7dbf7719f_normal.png",
        "profile_image_url_https": "https:\/\/si0.twimg.com\/profile_images\/3631853451\/ef30046384f63f5bbb1dbfc7dbf7719f_normal.png",
        "profile_link_color": "0084B4",
        "profile_sidebar_border_color": "C0DEED",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "default_profile": true,
        "default_profile_image": false,
        "following": null,
        "follow_request_sent": null,
        "notifications": null
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "retweet_count": 0,
    "favorite_count": 0,
    "entities": {
        "hashtags": [
            
        ],
        "symbols": [
            
        ],
        "urls": [
            
        ],
        "user_mentions": [
            {
                "screen_name": "CalaceSofia",
                "name": " \u2655 \u221e ",
                "id": 864772136,
                "id_str": "864772136",
                "indices": [
                    0,
                    12
                ]
            }
        ]
    },
    "favorited": false,
    "retweeted": false,
    "filter_level": "medium",
    "lang": "es"
}

Комментариев нет:

Отправить комментарий