Removing all xml or html tags using Notepad++

Let’s say you have an xml or an html document and you want to remove the tags.

<h2>Shopping List</h2>
<ol>
	<li>Milk</li>
	<li>eggs</li>
	<li>butter</li>
	<li>cereal</li>
	<li>bananas</li>
	<li>apples</li>
	<li>orange juice</li>
	<li>yogurt</li>
	<li>bread</li>
	<li>cheese</li>
</ol>

This can be done rather quickly in a tool like notepad++ using the find and replace with regular expressions feature.

  1. Go to Find and Replace.
  2. Enter this regular expression: <[^>]+>
  3. Select regular expression.
  4. Make sure the cursor is at the start of the document.
  5. Click replace all.

That is it.

6 Comments

  1. Clifton Willard says:

    does nothing. Replace all, ) occurrences were replaced

  2. Chris says:

    Thanks, great help!

  3. Natasha C says:

    God Bless. I have no idea what I'm doing, and you've saved me. I downloaded corpora, wanting .txt files with no markup, but it gave me .xml and I was annoyed, but you've given me an easy way out.

  4. Rhyous says:

    Yeah...This is for quick and dirty lists from html code. Say for example, you want to grab the list from an HTML drop down menu to add into documentation. There are 100 items in the drop down menu. So I right-click and "Inspect element" on Google Chrome. Grab the HTML and stick it in Notepad++. I use the steps to remove the html and then I have a nice list to put in my documentation.

  5. Bob Pelerson says:

    echo "" | sed -E 's/]+>//g'

    It removes the comment. Don't parse XML with regex

Leave a Reply

How to post code in comments?