Twitter APIについての勘違い - カブトボーグを写経するblog

http://twitter.com/statuses/public_timeline.json
↑これとかがTwitter APIだったのか

Twitter API Wiki / Twitter REST API Method: statuses public_timeline
http://apiwiki.twitter.com/Twitter-REST-API-Method%3A-statuses-public_timeline

ということはデータ集めはJSONとか使わずTwitter4jのgetPublicTimelineで収集しても同じってことだ

あとはいかにして漏れをなくPublicTimelineを取得するか
これが一番の問題だ
10秒間隔じゃ取りこぼしばかりだしなぁ

10個くらい起動して1秒づつずらして実行してそれぞれ収集して、最後にまとめる感じでいいのかな

当然一番いいのはStreaming APIのfirehoseを使わせてもらうことがけど、メール送ってからだいぶだって反応ないから駄目っぽい
spritzer()とgetPublicTimeline()どちらがいいかも問題か
これは両方作ってテストしてみよう
spritzer()　VS　1秒間隔getPublicTimeline()を試験中
publicの方があっという間にAPI制限かかったｗ
spritzer()で決定

こうなるといよいよfirehose()が使いたいよ・・・

同じ悩みの人発見

Twitter検索 : Search http://twitter.1x1.jp/search/ こういうTwitter検索ってどうやって実現しているのでしょうか？（実現していると思われますか？） Twitterでは検索AP.. - 人力検索はてな
 http://q.hatena.ne.jp/1195485924

spritzer()複数起動するとどうなるんだろう試してみる
まったく同じものを取得するから意味なし

http://pcod.no-ip.org/yats/public_timeline?json
こちらを使わせてもらうのが一番賢いかも

JSON難しいからRSSでやったらできた

public class RSSTest {
	public static void main(String[] args) throws Exception {
		FeedFetcher fetcher = new HttpURLFeedFetcher();
		String url = "http://pcod.no-ip.org/yats/public_timeline?rss";
		SyndFeed feed = fetcher.retrieveFeed(new URL(url));
		for (SyndEntry entry : (List<SyndEntry>) feed.getEntries()) {
			System.out.print(entry.getUpdatedDate()+"\t\t");
			System.out.print(entry.getAuthor()+"\t\t");
			System.out.println(entry.getDescription().getValue());
		}
	}
}

あとはこれで拾ったデータをテキストかDBに追加
次に同じ情報をテキストに追加しない処理、Tweetから@snkkenやhttp://を取り除く
これでクローラは完成

完成
テスト稼動中
あとはNGIDの追加かなー
中華系スパムBOTをはじきたい

クローラ機能追加
先頭のスペースは除去、連続したスペースは1つにまとめる
少しでも節約したいから削ってみた

クローラ順調だったのに突然エラーがESCを読み込んでエラー
どうすればいいんだろう

Exception in thread "main" com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 6: An invalid XML character (Unicode: 0x1b) was found in the element content of the document.