Closed Captioning File Cleanup

August 29, 2018

Code
JavaScript

The station is pledging right now, which means everyone is super busy and on edge. Four times a year we do pledge drives to raise funds. It's stressful enough interrupting programming, but there's also so much planning for each pledge drive: finding talent for pledge breaks, booking the studio, figuring out which programs we are filming breaks for, writing scripts, and on and on.

One of the programs we are pledging this time around is Leonard Cohen Tower of Song. I think it's a concert film. Other PBS stations have done pledge breaks on this film before, so the team writing the script put out a call asking if any other stations had a script prepared. This would help save them the time of writing one themselves. Another station responded, saying that while they didn't have a script -- their talent usually just ad libbed -- they did have a closed captioning file, which they shared with us so we could use bits from it to write our own script.

Great! But there was a problem: The formatting on the closed captioning file was super wonky. Here's what the first page looked like:

1 00:00:00,200 --> 00:00:01,568 <b>>> Ron Sexsmith playing</b> 2 00:00:01,568--> 00:00:03,236 <b>"Suzanne" from Leonard Cohen.</b> 3 00:00:03,236 -->00:00:04,804 <b>This is a wonderful tribute</b> 4 00:00:04,804 --> 00:00:06,272<b>concert, "Tower of Song."</b> 5 00:00:06,272 --> 00:00:07,974<b>And we are pleased to bring it</b> 6 00:00:07,974 --> 00:00:09,576<b>to you tonight right here on</b> 7 00:00:09,576 --> 00:00:11,344<b>this station, all made possible</b> 8 00:00:11,344 --> 00:00:12,379<b>by your support.</b> 9 00:00:12,379 --> 00:00:13,747<b>My name is Eric Luskin.</b> 10 00:00:13,747 --> 00:00:15,348<b>I'm here with Melissa Jones.</b> 11 00:00:15,348 --> 00:00:16,616<b>And this extraordinary</b> 12 00:00:16,616 --> 00:00:18,051<b>concert -- there's so much</b>

It's just a mess of line numbers, time codes, bold tags, not to mention that this was just one of eighty pages. It would take that team forever to remove all of the extraneous stuff and get the lines grouped together into proper paragraphs.

Or, it could take me just a few lines of code, and some easy Sublime Text tricks. Here's how I did it.

Fix that Script!

So the basic breakdown of what I did was as follows:

Take all of those lines, put them into one, single-line string.
Use the .split() method on that single string to return an array of these various lines.
Map over that array, and for each element remove the time codes.
Join those arrays back together into a new, massive string.
Separate the new, massive string into paragraphs as one would expect in a word document.

Let's get started.

Take all of those lines, put them into one, single-line string

Okay I'll admit that I didn't do the programming for this step. I had other things on my plate besides cleaning up a closed captioning file. Instead I used the Knowledge Walls Online Multi Line to Single Line Converter tool for this.

I store this huge string into a variable named originalScript, which looked like this:

const originalScript =  '1 00:00:00,200 --> 00:00:01,568 <b>>> Ron Sexsmith playing</b> 2 00:00:01,568 --> 00:00:03,236 <b>"Suzanne" from Leonard Cohen.</b> 3 00:00:03,236 --> 00:00:04,804 <b>This is a wonderful tribute</b>[...]';

I've shortened it here as the string is truly massive.

Use the `.split()` method on that single string to return an array of these various lines

Now that I have my originalScript variable, I want to split it into an array. Once it's an array, I have a ton of useful array methods at my disposal. The common element here is the html <b> tag that exists on every line, so I start there.

const arrayFromScript = originalScript.split("<b>");

This gives me a an array that looks like the following:

["1 00:00:00,200 --> 00:00:01,568 ", ">> Ron Sexsmith playing</b> 2 00:00:01,568 --> 00:00:03,236 ", ""Suzanne" from Leonard Cohen.</b> 3 00:00:03,236 --> 00:00:04,804 ", "This is a wonderful tribute</b>"]

Map over that array, and for each element remove the time codes

This step has to be broken up into a few different steps.

First, each of those arrays can now be split along the closing bold tag.

const arrayOfLines = arrayFromScript.map((line) => {  return line.split("</b>");});

This portion of the code produces an array for each line that looks something like this:

[">> Ron Sexsmith playing", " 2 00:00:01,568 --> 00:00:03,236 "][""Suzanne" from Leonard Cohen.", " 3 00:00:03,236 --> 00:00:04,804 "]

From here, I simply want to keep only the first element. So I use .shift() to isolate the string that I want and discard the time stamps.

Join those arrays back together into a new, massive string

Easy.

const newscript = arrayOfLines.join(" ");

And now we are back to having one enormous string, but without time codes.

Separate the new, massive string into paragraphs as one would expect in a word document

This was easily accomplished using Sublime Text. The closed caption file handily provided ">>" at the beginning of each paragraph, so I simply did a find and replace all to replace each instance of that with a regex for a new line \n. Copy and paste that into a word document, and we are good to go.

The Final Code

const originalScript =  '1 00:00:00,200 --> 00:00:01,568 <b>>> Ron Sexsmith playing</b> 2 00:00:01,568 --> 00:00:03,236 <b>"Suzanne" from Leonard Cohen.</b> 3 00:00:03,236 --> 00:00:04,804 <b>This is a wonderful tribute</b>';const arrayFromScript = originalScript.split("<b>");const arrayOfLines = arrayFromScript.map((line) => {  return line.split("</b>").shift();});const newscript = arrayOfLines.join(" ");

All said, a quick 9 lines, an online tool, and some work in Sublime Text that saved my teammates an entire afternoon.