Self-Creating Franken Post

May 4th, 2010

3

A Markov chain is a system whose next state depends on the current state; the idea of a text generator based on Markov chains works like this: the next word in a phrase is selected randomly based on the current phrase. For example, suppose the current phrase is “6 year” and in our previous blog posts, someone mentioned a “6 year old” or perhaps a “6 year plague” (probably not, but just an example). In this case, the word “old” was randomly selected to go after “6 year”. The entire text is generated based on such an algorithm.

Since it was crunch time, and I really had no idea of what to write about, I decided to create a Franken-post from all of the previous Border Stylo blog posts. First I downloaded the content via wget:

This grabbed all the HTML data, and ignored all of the image files, so I was able to download everything very quickly. Next I took all the output files from the /posts directory and ran them through html2text to strip all the markup:

The scrubit program is a small Ruby script I wrote that stripped out the header, footer, and non essential items from the page:

Afterwards, I took all the posts and concatenated them into one big file, stripping out the non-printing characters in the process:

This seed data was fed into a Markov chain text generator of order 6 and length 7,500 characters to produce some raw text, which I formatted for line breaks and length (it ended up being about 4,000 characters), and that is the text you see above.

(To view the entire thumbnail image, click here. Source)

Tagged with: markov chain, franken post

Related Posts

Author

Min Huang

Small

Min is a Developer on the extensions team. Professional title: Mook. Fun fact: What’s a mook?

Tags

API Aardvark Athletes AutoCAD AutoLISP Avinash Kaushik Barrelfish Box Shadows CSS3 Calculus Careers Catalysts Community Community Conferences/Conventions Conferences/Conventions Cross Browser Culture Degrading Digital Footprints Evernote Front End Development Gaming Geek Culture Glass Gradients HR HTML Haskell Holidays IPv4 IPv6 IgniteLA Ignorance Innovative Interactions Kanban Knowledge LEGO Lomography Los Angeles Martha Stewart Movies Multikernel Music NBA Photoshop QA Resolutions Rounded Corners SGML Scheme Scriptability Social Fresh Software Development Sports Stereomood Swag Unix Videos Web Standards World Cup 2010 advice agile ajax apps beta beta testing beta versions bloggers brands browser cache caching call/cc challenges china chrome cold call comet communication community management company pages computation connectivity continuations control-structures copyleft copyright coroutines creative workspaces creativity critiques css cucumber cursors customer service customer support data products design designers dynamic code economy entrepreneur entrepreneurs exceptions extension facebook feed firefox franken post gadgets generators google greasemonkey grid system http humanization influencers innovation intellectual property internet iphone jQuery javascript job search job-hunting jobs lambda lamp marketing markov chain martinis monetization strategies mottos mst3k networking new technology open source software partner passion patent phone plugin privacy productivity products programming languages protocol pure-function quality assurance readability remote pair programming resumes tips rspec ruby ruby on rails scalability screencast security servers social media software engineering sponsors start-ups state syntax taxes team members terminology test threads tips tools turing machine type theory types typography unicycling user experience user stories vidcon web development webspider xbl youtube zappos

3 Comments Leave a comment

4 months ago

Welcome to the world of spamming circa 1998.

Reply to comment

4 months ago

Interesting. I’ve assessed student dissertations that read just like that. Now I get it.

Reply to comment

Joss
4 months ago

Lots of fun – could be improved with integration with a service like After The Deadline to check grammar… the one thing that really gives it away is the false verbs, so if there was a way of reading over it and correcting all the wrong tenses and fixing other grammatical quirks…. you’d be unstoppable.

Reply to comment

Leave a comment

Anonymous
Right now

Your comment preview

Reply to comment





Incorrect please try again
Enter the words above: Enter the numbers you hear:
If you are not able to read this, you can get another image or hear it
Want to see an image again?

Allowed Tags

_emphasis_
*strong*
??citation??
-deleted text-
+inserted text+
^superscript^
~subscript~
@code@

Add code using a GIST
gist: gistid