Splitting sentences in C# using Stanford.NLP

So I need to break some sentences up. I have a pretty cool regex that does this, however, I want to try out Stanford.NLP for this. Let’s check it out.

  1. Create a Visual Studio C# project.
    I chose a New Console Project and named it SentenceSplitter.
  2. Right-click on the project and choose “Manage NuGet Packages.
  3. Add the Stanford.NLP.CoreNLP nuget package.
  4. Add the following code to Program.cs (This is a variation of the code provide here: http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html
    using edu.stanford.nlp.ling;
    using edu.stanford.nlp.pipeline;
    using java.util;
    using System;
    using System.IO;
    using Console = System.Console;
    
    namespace SentenceSplitter
    {
        class Program
        {
            static void Main(string[] args)
            {
                // Path to the folder with models extracted from `stanford-corenlp-3.4-models.jar`
                var jarRoot = @"stanford-corenlp-3.4-models\";
    
                const string text = "I went or a run. Then I went to work. I had a good lunch meeting with a friend name John Jr. The commute home was pretty good.";
    
                // Annotation pipeline configuration
                var props = new Properties();
                props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
                props.setProperty("sutime.binders", "0");
    
                // We should change current directory, so StanfordCoreNLP could find all the model files automatically 
                var curDir = Environment.CurrentDirectory;
                Directory.SetCurrentDirectory(jarRoot);
                var pipeline = new StanfordCoreNLP(props);
                Directory.SetCurrentDirectory(curDir);
    
                // Annotation
                var annotation = new Annotation(text);
                pipeline.annotate(annotation);
    
                // these are all the sentences in this document
                // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
                var sentences = annotation.get(typeof(CoreAnnotations.SentencesAnnotation));
                if (sentences == null)
                {
                    return;
                }
                foreach (Annotation sentence in sentences as ArrayList)
                {
                    Console.WriteLine(sentence);
                }
            }
        }
    }
    

    Warning! If you try to run here, you will get the following exception: Unrecoverable error while loading a tagger model

    java.lang.RuntimeException was unhandled
      HResult=-2146233088
      Message=edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
      Source=stanford-corenlp-3.4
      StackTrace:
           at edu.stanford.nlp.pipeline.StanfordCoreNLP.4.create()
           at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
           at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(Properties A_1, Boolean A_2)
           at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
           at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props)
           at SentenceSplitter.Program.Main(String[] args) in c:\Users\jbarneck\Documents\Projects\NLP\SentenceSplitter\SentenceSplitter\Program.cs:line 20
           at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
           at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
           at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
           at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
           at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
           at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
           at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
           at System.Threading.ThreadHelper.ThreadStart()
      InnerException: edu.stanford.nlp.io.RuntimeIOException
           HResult=-2146233088
           Message=Unrecoverable error while loading a tagger model
           Source=stanford-corenlp-3.4
           StackTrace:
                at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
                at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile, Properties config, Boolean printLoading)
                at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile)
                at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(String A_0, Boolean A_1)
                at edu.stanford.nlp.pipeline.POSTaggerAnnotator..ctor(String annotatorName, Properties props)
                at edu.stanford.nlp.pipeline.StanfordCoreNLP.4.create()
           InnerException: java.io.IOException
                HResult=-2146233088
                Message=Unable to resolve "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
                Source=stanford-corenlp-3.4
                StackTrace:
                     at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(String textFileOrUrl)
                     at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
                InnerException: 
    
    
  5. Download the stanford-corenlp-full-3.4.x.zip file from here: http://nlp.stanford.edu/software/corenlp.shtml#Download
  6. Extract the stanford-corenlp-full-2014-6-16.x.zip.
    Note: Over time, as new versions come out, make sure the version you download matches the version of your NuGet package.
  7. Extract the stanford-corenlp-3.4-models.jar file to stanford-corenlp-3.4-models.
    I used 7zip to extract the jar file.
  8. Copy the stanford-corenlp-3.4-models folder to your Visual Studio project files.
    Note: This is one way to include the jar file in your project. Other ways might be a copy action or another good way would be to use an app.config appSetting. I chose this way because it makes all my files part of the project for this demo. I would probably use the app.config method in production.
  9. In Visual Studio, use ctrl + left click to  highlight the stanford-corenlp-3.4-models folder and all subfolders.
  10. Open Properties (Press F4), and change the namespace provider setting to false.
  11. In Visual Studio, use ctrl + left click to  highlight the files under the stanford-corenlp-3.4-models folder and all files in all subfolders.
  12. Open Properties (Press F4), and change the Build Action to Content and the Copy to Output Directory setting to Copy if newer.
  13. Run the code.

 

Note: At first I tried to just load the model file. That doesn’t work. I got an exception. I had to set the @jarpath as shown above. I needed to copy all the contents of the jar file.

Results

Notice that I through it curve ball by ending a sentence with Jr. It still figured it out.

I went or a run. Then I went to work. I had a good lunch meeting with a friend name John Jr. The commute home was pretty good.

However, I just tried this paragraph and it did NOT detect the break after the first sentence.

Exit Room A. Turn right. Go down the hall to the first door. Enter Room B.

I am pretty sure this second failure is due to the similarity in string with a legitimate first name, middle initial, last name.

Jared A. Barneck
Room A. Turn

Now the question is, how do I train it to not make such mistakes?

The two-clause BSD License

Here is the two-clause BSD License, sometimes called the FreeBSD License or the Simplified BSD License.

Copyright (c) <Year>, <Owner>
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

How to compile WinNFSd with Visual Studio?

Recently I needed an NFS server on Windows, preferably written in C++. A long time ago I had used WinNFSd and sure enough the project still exists on Sourceforge, though unfortunately it hasn’t been updated since 2005.

However, I found that someone had updated it here: https://github.com/noodle1983/winnfsd-nd

So the big question, how do you compile this on windows with Visual Studio?

Step 1 – Download and extract the WinNFSd source

  1. Go to https://sourceforge.net/projects/winnfsd and download the source.
    Note: You can alternately download the git hub source as it has an update you might like:
    https://github.com/noodle1983/winnfsd-nd
  2. Click the Zip button at the top of the page to download the source as a zip.
    Note: Alternately if you have git working already you can clone this repo.
  3. Extract the zip file to a directory.  Remember where you extracted it as we will copy the source files later.

Step 2 – Create a new Visual Studio solution

  1. In Visual Studio, go to File | New | Project.
  2. Select Other Languages | Visual C++ | Empty Project.
    Note: Depending on your Visual Studio configuration you may Visual C++ in a different place.
  3. Name the solution WinNFSd.
  4. Click Ok.

Step 3 – Add the WinNFSd files to your solution

  1. In Visual Studio, right-click on your Project and click Open Folder in Windows Explorer.
  2. Create a new folder to hold your source code.
    Note: I simply named my folder src.
  3. Copy the source you extracted in Step 1 into the src directory.
  4. Highlight all the files in the src directory.
  5. Drag the files into Visual Studio and drop them on your project.
Note: If you try to build now, you will get 22 errors in debug mode and maybe 17 in release mode.

Step 4 – Configure the project properties

  1. In Visual Studio, right-click on your project and choose Properties.
    Note: The Configuration should say Active(Debug) currently.
  2. Go to Configuration Properties | Linker | Input.
  3. Add ws2_32.lib to the Additional Dependencies.
  4. Change the Configuration to Release and add ws2_32.lib for release as well.

Step 5 – Handle the Visual Studio C++ Runtime

If you were to compile now, and try to run your project on a different machine (not the one running Visual Studio) you would likely get an error due to a missing dll.  Here is the error you will likely receive.

The program can’t start because MSVCR100.dll is missing from your computer. Try reinstalling the program to fix this problem.

I am not going to explain the solution again here, because it is all documented here:
Avoiding the MSVCR100.dll or MSVCR100D.dll is missing error

Choose the best of the three solutions for you from the link above.

Note: For this single file exe, I prefer the statically linked option.

Step 6 – Build WinNFSd

  1. You should now be able to click Build | Build Solution and it should build.

You should be able to test both debug and release.

Note: I received 37 warnings, which would be nice to resolve, but I wouldn’t worry too much about them.

 

Using Open Source in Proprietary Software

I was consulted by a Lawyer about open source licenses and since it is not that difficult, I thought I would share some of the simplicity with you.

Open Source software might seem like a nightmare only lawyers can understand. But this is not actually true. Determining if an Open Source library is usable in a commercial product is can become a simple task.

I am going to make this easy for you.

Think of Open source license as you would a traffic light. Green is good to go, yellow is proceed with caution but be ready to stop, Red is stop but with time may become green.

  • Green licenses
    BSD/MIT/Apache licenses (and a few others)
  • Yellow Licenses
    LGPL, Sun/Oracle
  • Red Licenses
    GPL, 3rd Party Commercial

Green licenses

These are close to 100% free but not quite…usually the only restrictions are simple things…like things you should not do but wouldn’t do anyway:

• “don’t remove license from the code files”
• “don’t use the name of the author or their place of business to promote your product”.

Which since we don’t do anyway, the code is essentially 100% free.

These licenses are usually short and don’t require a lawyer as anyone can pretty much read and understand them.

You don’t even have to include these licenses in any “pop-up” license agreements. As long as internally where we store these files the license remains in the source, we are fine.

You usually don’t even have to provide a mirror for this code and you can ignore requests for the source.

Yellow Licenses

LGPL or Oracle isn’t as bad, but they can be a pain. Usually you are free to use these. You cannot link using the source or static link to the library, or you are essentially GPL (a Red license) but if you dynamically link to an LGPL library, then you can be free to go. Any changes to the library can be made, but you have to treat them as GPL.

So basically if you ship their software using their installer or a loose file but you don’t change the file, you are likely good to go with minimal effort.

You have to include these licenses often times in a popup or when your own license is used.

Red Licenses

GPL
GPL means that anything you write that touches the GPL code must be licensed with GPL as well. Also, if you distribute GPL code, you must provide a “mirror” or be able to provide the GPL code when asked for it. You must also provide your own code (which is now GPL) when asked for it.

People often call GPL a virus as anything that touches it is infected by the GPL as well. However, from another point of view, someone has spent a lot of time giving their code/knowledge to the world and they deserve the right to keep this knowledge free and it is just a way to prevent someone from stealing their knowledge.

Your company can use GPL just fine if two things are true: 1) The feature is isolated and 2) the feature is not really a “differentiator” in our market. For example, a TFPT service. It really wouldn’t matter if a commercial company use GPL because TFTP is isolated to its own service and TFTP is ubiquitous and there are hundreds of different TFPT servers out there.

It may also be possible to that GPL won’t hurt you if you are first to market. If you release code and become a dominate player or market share leader, it might not matter that others can see your code.
3rd Party Commercial
3rd Party Commercial licenses absolutely cannot be used unless the company is willing to sell us the license and we are willing to pay the license fee. There may also be other agreements made in the license negotiation.

You have to include these licenses often times in a popup.

Track your licenses

It is a simple task to track the software you use and the license you use.  Create three lists, Green, Yellow, and Red lists and life will be easy for you. You may only need to review a license once. Once you have places a license in the green list, any software that uses the license is in the green list.

It is also important to keep a copy of the software you use and the license you used to obtain this software. This can be important to do. Imagine if your software uses a BSD licensed tool that is re-licensed using GPL in the next version. You can keep the old BSD licensed version and continue to use it.

Conclusion

Any license but the 3rd party Commercial licenses are possibilities all the time. Only 3rd Party Commercial licenses may be denied you but usually for enough money they are available.

Be thorough and be careful.

Just be aware of the license and know when to use them an when not to.

See a previous post of mine:

Differences between the BSD/FreeBSD Copyrights and the GNU Public License (GPL)

DISCLAIMER

I am not a lawyer. I am not responsible in any way for the misuse of a license based on this post, even if the post is has some piece of data that is blatantly wrong. It is the responsibility of the user of licensed or copyrighted software to make sure the license agreement or copyright is adhered to properly.