programing

Java가 C ++보다 큰 파일을 더 빨리 읽는 이유는 무엇입니까?

nicescript 2021. 1. 15. 07:53
반응형

Java가 C ++보다 큰 파일을 더 빨리 읽는 이유는 무엇입니까?


다음 iputfile.txt과 같이 파일의 모든 줄이 단어 인 2GB 파일 ( )이 있습니다.

apple
red
beautiful
smell
spark
input

파일의 모든 단어를 읽고 단어 수를 인쇄하는 프로그램을 작성해야합니다. Java와 C ++를 사용하여 작성했지만 결과는 놀랍습니다. Java는 C ++보다 2.3 배 빠르게 실행됩니다. 내 코드는 다음과 같습니다.

C ++ :

int main() {
    struct timespec ts, te;
    double cost;
    clock_gettime(CLOCK_REALTIME, &ts);

    ifstream fin("inputfile.txt");
    string word;
    int count = 0;
    while(fin >> word) {
        count++;
    }
    cout << count << endl;

    clock_gettime(CLOCK_REALTIME, &te);
    cost = te.tv_sec - ts.tv_sec + (double)(te.tv_nsec-ts.tv_nsec)/NANO;
    printf("Run time: %-15.10f s\n", cost);

    return 0;
}

산출:

5e+08
Run time: 69.311 s

자바:

 public static void main(String[] args) throws Exception {

    long startTime = System.currentTimeMillis();

    FileReader reader = new FileReader("inputfile.txt");
    BufferedReader br = new BufferedReader(reader);
    String str = null;
    int count = 0;
    while((str = br.readLine()) != null) {
        count++;
    }
    System.out.println(count);

    long endTime = System.currentTimeMillis();
    System.out.println("Run time : " + (endTime - startTime)/1000 + "s");
}

산출:

5.0E8
Run time: 29 s

이 상황에서 Java가 C ++보다 빠른 이유는 무엇이며 C ++의 성능을 어떻게 향상시킬 수 있습니까?


당신은 같은 것을 비교하지 않습니다. 자바 프로그램은 줄 바꿈에 따라 줄을 읽는 반면 C ++ 프로그램은 공백으로 구분 된 "단어"를 읽는데 약간의 추가 작업입니다.

시도해보십시오 istream::getline.

나중

바이트 배열을 읽고 줄 바꿈을 스캔하기 위해 기본 읽기 작업을 시도하고 수행 할 수도 있습니다.

나중에

On my old Linux notebook, jdk1.7.0_21 and don't-tell-me-it's-old 4.3.3 take about the same time, comparing with C++ getline. (We have established that reading words is slower.) There isn't much difference between -O0 and -O2, which doesn't surprise me, given the simplicity of the code in the loop.

Last note As I suggested, fin.read(buffer,LEN) with LEN = 1MB and using memchr to scan for '\n' results in another speed improvement of about 20%, which makes C (there isn't any C++ left by now) faster than Java.


There are a number of significant differences in the way the languages handle I/O, all of which can make a difference, one way or another.

Perhaps the first (and most important) question is: how is the data encoded in the text file. If it is single-byte characters (ISO 8859-1 or UTF-8), then Java has to convert it into UTF-16 before processing; depending on the locale, C++ may (or may not) also convert or do some additional checking.

As has been pointed out (partially, at least), in C++, >> uses a locale specific isspace, getline will simply compare for '\n', which is probably faster. (Typical implementations of isspace will use a bitmap, which means an additional memory access for each character.)

Optimization levels and specific library implementations may also vary. It's not unusual in C++ for one library implementation to be 2 or 3 times faster than another.

Finally, a most significant difference: C++ distinguishes between text files and binary files. You've opened the file in text mode; this means that it will be "preprocessed" at the lowest level, before even the extraction operators see it. This depends on the platform: for Unix platforms, the "preprocessing" is a no-op; on Windows, it will convert CRLF pairs into '\n', which will have a definite impact on performance. If I recall correctly (I've not used Java for some years), Java expects higher level functions to handle this, so functions like readLine will be slightly more complicated. Just guessing here, but I suspect that the additional logic at the higher level costs less in runtime than the buffer preprocessing at the lower level. (If you are testing under Windows, you might experiment with opening the file in binary mode in C++. This should make no difference in the behavior of the program when you use >>; any extra CR will be considered white space. With getline, you'll have to add logic to remove any trailing '\r' to your code.)


I would suspect that the main difference is that java.io.BufferedReader performs better than the std::ifstream because it buffers, while the ifsteam does not. The BufferedReader reads large chunks of the file in advance and hands them to your program from RAM when you call readLine(), while the std::ifstream only reads a few bytes at a time when you prompt it to by calling the >>-operator.

Sequential access of large amounts of data from the hard drive is usually much faster than accessing many small chunks one at a time.

A fairer comparison would be to compare std::ifstream to the unbuffered java.io.FileReader.


I am not expert in C++, but you have at least the following to affect performance:

  1. OS level caching for the file
  2. For Java you are using a buffered reader and the buffer size defaults to a page or something. I am not sure how C++ streams does this.
  3. Since the file is so big that JIT would probably be kicked in, and it probably compiles the Java byte code better than if you don't turn any optimization on for your C++ compiler.

Since I/O cost is the major cost here, I guess 1 and 2 are the major reasons.


I would also try using mmap instead of standard file read/write. This should let your OS handle the reading and writing while your application is only concerned with the data.

There's no situation where C++ can't be faster than Java, but sometimes it takes a lot of work from very talented people. But I don't think this one should be too hard to beat as it is a straightforward task.

mmap for Windows is described in File Mapping (MSDN).

ReferenceURL : https://stackoverflow.com/questions/22955178/why-does-java-read-a-big-file-faster-than-c

반응형