:  ~ 16 min read

Extracting and parsing tweets from your Twitter archive

I’ve recently gave Micro.blog a try and  shortly after  I thought of importing all my tweets here, because … why not own my content? This post will be about extracting and converting your Twitter archive into simpler objects with just text and timestamp — there are many more available fields, but these were the only ones I was particularly interested in.

First things first, we need to request your archive: in our Twitter’s profile settings, all the way to the bottom we can find ”Your Tweet archive”; we need to click on ”Request your archive” and after a while we’ll receive an email with a link to the download.

The archive contains a single folder with a website structure (so you can open index.html, if you’d like), but we’re only interested in tweets.csv file at the root, next to index.html. This file has the following structure:

"tweet_id","in_reply_to_status_id","in_reply_to_user_id","timestamp","source","text","retweeted_status_id","retweeted_status_user_id","retweeted_status_timestamp","expanded_urls"

Twitter will replace any URLs you post with https://t.co URLs and I, personally, didn’t want to use those. And that’s what expanded_urls are there for: it’s a comma delimited string with the actual URLs for all the t.co URLs in a tweet (if any).

Off we go: there’s a nice CSV parser (Thanks, naoty!), which we’ll need to add it to the Sources folder of a new Playground and we’ll have to update it to Swift 4 — the compiler will tell us everything that needs to be done. We’ll then add the tweets.csv file to the Playground’s Resources, and we’re good to go. There will be a lot of force unwrapping, but this is intentional, so we can find problems or exceptions and treat them individually.

First, a Tweet struct, to hold the final data, and a Syntax enum to dictate what we do with the URLs found in tweets:

struct Tweet {

   let text: String
   let timestamp: String

}

enum Syntax {
   case markdown
   case html
   case none
}

Then, some more preparations:

let syntax: Syntax = .markdown // 1
let dataDetector = try! NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue) // 2
let handleRegex = try! NSRegularExpression(pattern: "@[^.,:;\\-()\\[\\]{}\\s]+", options: .caseInsensitive) // 3

Since I wanted to write this post and also create a gist about it, I added a property that will determine the syntax of the URLs (1). Then, an NSDataDetector was needed, to find all the t.co URLs and replace them with the actual URLs from expanded_urls. Lastly, we’ll also want to create URLs for all the handles, so an NSRegularExpression is required (3) — _match all strings that start with @ and contain any character, any number of times, except the ones between brackets.

Using the library is totally straightforward:

let file = Bundle.main.path(forResource: "tweets", ofType: "csv")! // 1
let csv = try! CSV(name: file) // 2
var rawTweets = csv.rows.filter {  // 3
   // Used a lot of clients throughout the years, each with its own retweeting format ...
   let isRetweet = $0["retweeted_status_user_id"]?.isEmpty == false
      || $0["expanded_urls"]?.contains("https://twitter.com") == true
      || $0["expanded_urls"]?.contains("favd.net") == true
      || $0["text"]?.contains("via @") == true
      || $0["text"]?.contains("RT @") == true
      || $0["text"]?.contains("\"@") == true
      || $0["text"]?.contains("“@") == true
      || $0["text"] == "." // Don't ask, I have no idea ...
   let isReply = $0["in_reply_to_status_id"]?.isEmpty == false
      || $0["text"]?.hasPrefix("@") == true
   let isLinkToBlog = $0["expanded_urls"]?.contains("rolandleth.com") == true

   return !isRetweet && !isReply && !isLinkToBlog // 4
}
let tweets = rawTweets.map { rawTweet -> Tweet in // 5
   // [...]

   return Tweet(text: text, timestamp: rawTweet["timestamp"]!) // 6
}

We create a path to our file (1) and pass it to CSV (2); we then have access to all tweets via the rows property (3). You might want to skip the next part, but for my own purposes, I only cared about tweets (no retweets, no direct tweets and no replies) because that’s the ”standalone” content (4). I also didn’t care about any tweets where my blog was mentioned (4), since those are most likely just tweets where I shared posts. The map will create an array of Tweets (5, 6).

All the code from this point onward is part of the map, corresponding to a single tweet — it will be easier to go through the code like this.

var text = rawTweet["text"]! // 1

if syntax == .markdown { // 2
   text = text.replacingOccurrences(of: "\n", with: "  \n")
}

var nsText: NSString { return text as NSString } // 3
var textRange: NSRange { return NSRange(location: 0, length: text.utf16.count) } // 4

let expandedURLs = rawTweet["expanded_urls"]!.components(separatedBy: ",") // 5

let reversedMatches = dataDetector
   .matches(in: text, options: [], range: textRange) // 6
   .reversed() // 7
let matchesCount = reversedMatches.count
var nonTcoURLs = 0 // 8

First, we extract our text (1); if the syntax is set to Markdown, we add two whitespaces before any new line (2), since that’s what Markdown requires. Then, instead of creating a Range<String.Index> out of an NSRange, it was easier to just convert our text to NSString (3) — but it has to be a computed property, because we’ll be modifying text.

For the range we use a computed NSRange with a length of .utf16.count, because there will be emojis (4)! We extract the expanded_urls into an array of Strings (5), so we can replace the matches found by the dataDetector (6). This will be done in reversed order (7), otherwise replacing the first occurrence would break the ranges of the next occurrences, if any.

Since Twitter does not convert URLs that are not ”full” (rolandleth.com, for example), but NSDataDetector does, we need to make sure we’re properly replacing t.co occurrences — for each tweet, we have to keep track of the number of non-t.co URLs (8), so we can adjust the index when reading from our expandedURLs (explained below).

reversedMatches.enumerated().forEach { i, m in
   var url = nsText.substring(with: m.range) // 1
   let correctURL: String

   // 2
   if matchesCount > expandedURLs.count, !url.hasPrefix("http") {
      url = "http://" + url
      nonTcoURLs += 1
   }
   else {
      url = expandedURLs[i - nonTcoURLs] // 3
   }

   let urlName = url
      .replacingOccurrences(of: "http://", with: "")
      .replacingOccurrences(of: "https://", with: "") // 4

   switch syntax {
   case .markdown: correctURL = "[\(urlName)](\(url))" // 5
   case .html: correctURL = "<a href=\"\(url)\">\(urlName)</a>" // 6
   case .none: correctURL = url // 7
   }

   text = nsText.replacingCharacters(in: m.range, with: correctURL) // 8
}

We fallback on the original matched URL (1), and if we have more matches than expandedURLs and the current occurrence doesn’t have a prefix of http, we add the prefix ourselves, and increase our nonTcoURLs count (2); otherwise, we replace the t.co occurrence with the corresponding expandedURL, by taking into account the number of nonTcoURLs (3).

The reason behind adding the http prefix ourselves is that if we don’t, a server might use these URLs as a dynamic path — for example on https://rolandleth.com, a Markdown URL of example.com would be rendered as https://rolandleth.com/example.com.

We then remove any http:// or https:// to be used as the URL’s display name (4) and then we can see correctURL’s purpose: to add Markdown syntax (5), HTML syntax (6), or none (7). We then replace the occurrence of the t.co URL with the real one (8).

let reversedHandleMatches = handleRegex
   .matches(in: text, options: [], range: textRange)
   .reversed() // 1

reversedHandleMatches.forEach {
   let accountRange = NSRange(location: $0.range.location + 1, length: $0.range.length - 1)
   let account = nsText.substring(with: accountRange) // 2
   let correctHandleURL: String
   let handleURL = "https://twitter.com/\(account)"

   switch syntax { // 3
   case .markdown: correctHandleURL = "[@\(account)](\(handleURL))"
   case .html: correctHandleURL = "<a href=\"\(handleURL)\">@\(account)</a>"
   case .none: correctHandleURL = handleURL
   }

   text = nsText.replacingCharacters(in: $0.range, with: correctHandleURL) // 4
}

Finally, for the last piece of the puzzle, we will replace all @handles in a tweet by finding all the matches our handleRegex gives us, also in reverse (1). We first extract the handle (2), then we add the correct syntax (3), and we replace the occurrence in the original text with the handle’s URL (4).

The whole gist can be found here, and all my tweets can now be found here. This was fun to do 😁